Wednesday, May 9, 2012

Scrape webpages with node.js

I recently had the task of scraping data from a website so I choose to use node.js in order to get a bit more experience with it. After checking out a few different options for scraping, I finally settled on the project which provided the most robust handling and configuration features that I could find.

The hard part about scraping data from websites is coming up with ways to quickly and reliably pick out pieces from the document object model (DOM). These days, I spend a lot of time using the jQuery selector syntax to develop my site which means that ideally I'd find a solution that can download a webpage and then provide me with jQuery-like functions and selectors to pick out pieces from the DOM. For this purpose, uses a project called node-soupselect by default, but I found the selector syntax to be lacking. Thus, I layered another project called cheerio on top. Whatever you do, don't use jsdom as it is too slow and very strict in its processing of html. stands out from the rest of the projects because it applies a 'jobs' approach to scraping. This is something that I used in another project of mine and it worked out really well. In other words, you write a 'job' which gets executed and if there is a failure during the run of the job, you have control over what you do next (skip, fail, retry).

In order to develop and debug your code, you shouldn't continually hit the website that you are scraping from. What I do is download the page I want to scrape and then run a local webserver to serve up the file. I found the python webserver the easiest to use as it is just one simple command, python -m SimpleHTTPServer 8000 which will serve up files from whichever directory you run that command from.

For debugging code, I recommend setting up the excellent node-inspector which will allow you to setup breakpoints in your code and step through to inspect objects as necessary, just like you do with web page JavaScript development. This becomes invaluable with JavaScript because the lack of types makes it hard to know what properties objects have. For logging output to the console so that I could keep track of the execution, I ended up with the nlogger project which I wasn't super happy with, but worked well enough for this project.

Writing your first job is easy and if you are a CoffeeScript (CS) fan, will automatically compile your CS files for you.  If you aren't a CS fan, I apologize as my example is in CS. This simple job @get's a page and selects the <title> element:

nodeio = require ''
cheerio = require 'cheerio'
log = require('nlogger').logger(module)

count = 0
class InputScraper extends nodeio.JobClass
    input: [476,1184]
    run: (inputId) ->
        @get "http://localhost:8000/#{inputId}.html", (err, data, headers) =>
            @exit(err) if err?
      'Started: {}', inputId)
                $ = cheerio.load(data)
      "(#{count}) Finished inputId: #{inputId}")
            catch error
                log.error('Error: {} : {}', inputId, error.stack)
    output: (data) ->
@class = InputScraper
@job = new InputScraper({spoof:true, max: 1})


The 'run' function is called for each piece of input data in the array (476, 1184). @get() grabs the html page data. On success, @get() executes the callback function which loads the data into cheerio and @emit()'s the title. When the 'run' function is complete, calls output which logs the @emit() data to the console.

In my code, the line above the @emit(), I have another class function which I pass the $ cheerio object into which handles all of my parsing and the result is an object that I pass into @emit(). This allows me to re-use the InputScraper boilerplate to parse all sorts of different pages.

I also run things directly with node 'nodeio.start(@job)' so that I can use the node-inspector more easily. This means that I also end up actually compiling the CS myself using my answer on StackOverflow.

Obviously, this is a pretty simple example, but it should get you up and running with the framework quickly. The speed of this is fairly impressive, on my desktop and home network, I was able to crawl and extract data from about 2000 webpages using max: 20 in about 1.5 minutes. Most of the time is spent downloading the pages and the parsing only takes a few milliseconds.