Scrape page after onload JS DOM injection
Problem
I'm building a scraper that gets main images (based on Content-Length
right now) from a page. It goes through all elements and makes a
HEAD
request. But certain pages, esp. mobile, have images inserted after page load. Any ideas on how to tackle this?
I'm using node.js
.
Problem courtesy of: Jungle Hunter
Solution
I can't be sure that it solves your problem, but you could look into using jsdom, as it can fetch and execute the scripts in a page, and gives you a DOM on the serverside. Something like:
var request = require('request'),
jsdom = require('jsdom').jsdom;
request(url, function(err, response, body) {
if(err) return console.error(err);
var doc = jsdom(body, null, {
FetchExternalResources: ['script', 'img']
});
var window = doc.createWindow();
var images = doc.getElementsByTagName('img');
});
Solution courtesy of: Linus Gustav Larsson Thiel
Discussion
View additional discussion.