Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Scrape page after onload JS DOM injection

Scrape page after onload JS DOM injection

Problem

I'm building a scraper that gets main images (based on Content-Length right now) from a page. It goes through all elements and makes a HEAD request. But certain pages, esp. mobile, have images inserted after page load. Any ideas on how to tackle this?

I'm using node.js.

Problem courtesy of: Jungle Hunter

Solution

I can't be sure that it solves your problem, but you could look into using jsdom, as it can fetch and execute the scripts in a page, and gives you a DOM on the serverside. Something like:

var request = require('request'),
    jsdom = require('jsdom').jsdom;

request(url, function(err, response, body) {
  if(err) return console.error(err);

  var doc = jsdom(body, null, {
    FetchExternalResources: ['script', 'img']
  });
  var window = doc.createWindow();

  var images = doc.getElementsByTagName('img');
});
Solution courtesy of: Linus Gustav Larsson Thiel

Discussion

View additional discussion.



This post first appeared on Node.js Recipes, please read the originial post: here

Share the post

Scrape page after onload JS DOM injection

×

Subscribe to Node.js Recipes

Get updates delivered right to your inbox!

Thank you for your subscription

×