Wednesday 10 July 2013

Scraping: Don't forget regex!

I was using python to scrape some times related to run pacing and split times. Using lxml.html worked fine to import web pages with tables to collect data and links (See Wes' book: Python for Data Analysis, p. 166- for a brief introduction). But there was one problem. An important link to the pages with historical split time for each individual, was hidden within a table and it used java ("open.window" pop-up). No matter how hard I tried I could not make lxml catch the url of this link. Then, after a few hours the solution occurred to me: Why not simply use regular expressions!

The key problem is that when you parse a document using lxml, it creates a structured document which makes it easy to scrape some things (like tables), but sometimes more difficult to find other things like a single piece of information hidden within an element contained in another element. In this case searching using standard regular expressions on a flat text file of the document may be easier - and faster - than using lxml or other tools.

In fact, this worked so fast end well that it may serve as a useful reminder to others: before investing a lot of time in lxml, BeautifulSoup or Scrapy, consider whether regex might be sufficient!

Here is a the python code for scraping standard links (starting with http):

import re
from urllib2 import urlopen 
url = (address to the web page you want to scrape)
html = urlopen(url).read()
links = re.findall(r'href=[\'"]?([^\'" >]+)', html)