Server-side DOM scraping with Javascript: options

  • Helma NG on App Engine. Seems like it'll be nice when it's stable, using the ServerJS Securable Modules system, Jack and Rhino. I got a basic "fetch a remote file and print it" working, but couldn't yet work out how to get a DOM from a remote document, or how to import env.js and Sizzle.
  • Stefano Mazzocchi's Sizzle-based scraping app in Acre. Modified NekoHTML parser and env.js, plus Sizzle for selectors. Hosted by Freebase.
  • Jaxer. Develop and run in Aptana Studio, an Eclipse plugin. DOM scraping using the same engine as Firefox 3. Can be deployed to the Aptana Cloud.
  • Headless Firefox 3/XulRunner and HTTP socket for communication. Still not sure if it'll run headless properly, without any interaction.

Update: Yahoo! announced YQL Execute yesterday, which allows server-side Javascript (including CSS 3 selectors) to be executed between DOM fetching and returning YQL results. The only problem with YQL is that - because it obeys robots.txt rules - it's often denied access to web content.

dom javascript
gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.