The key problem is that when you parse a document using lxml, it creates a structured document which makes it easy to scrape some things (like tables), but sometimes more difficult to find other things like a single piece of information hidden within an element contained in another element. In this case searching using standard regular expressions on a flat text file of the document may be easier - and faster - than using lxml or other tools.
In fact, this worked so fast end well that it may serve as a useful reminder to others: before investing a lot of time in lxml, BeautifulSoup or Scrapy, consider whether regex might be sufficient!
Here is a the python code for scraping standard links (starting with http):
import re
from urllib2 import urlopen
url = (address to the web page you want to scrape)
html = urlopen(url).read()
links = re.findall(r'href=[\'"]?([^\'" >]+)', html)
No comments:
Post a Comment