Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Zombie.js in node.js fails to scrape certain websites

Zombie.js in node.js fails to scrape certain websites

Problem

The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:

var Browser = require("zombie");
var assert = require("assert");

// Load the page from localhost
browser = new Browser()
browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () {
browser.wait(function(){
console.log(browser.html());
});
});

run with node

output:

S����J����ꪙRUݒ�kf�6���Efr2�Riz�����^��0�X� ��{�^�a�yp��p�����Ή��`��(���S]-��'N�8q�����/���?�ݻ��u;�݇�ׯ�Eiٲ>��-���3�ۗG�Ee�,��mF���MI��Q�۲������ڊ�ZG��O�J�^S�C~g��JO�緹�Oݎ���P����ET�n;v������v���D�tvJn��J�8'��햷r�v:��m��J��Z�nh�]�� ����Z����.{Z��Ӳl�B'�.¶D�~$n�/��u"�z�����Ni��"Nj��\00_I\00\��S��O�E8{"�m;�h��,o��Q�y��;��a[������c��q�D�띊?��/|?:�;��Z!}��/�wے�h�

(actual output is much longer)

Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???

Thanks

Problem courtesy of: Hugh M Halford-Thompson

Solution

I have abandoned this method long ago, but in case anyone is interested I got a reply from one of the zombie.js devs.

https://github.com/assaf/zombie/issues/251#issuecomment-5969175

He says: "Zombie will now send accept-encoding header to indicate it does not support gzip."

Thank you all who looked into this.

Solution courtesy of: Hugh M Halford-Thompson

Discussion

View additional discussion.



This post first appeared on Node.js Recipes, please read the originial post: here

Share the post

Zombie.js in node.js fails to scrape certain websites

×

Subscribe to Node.js Recipes

Get updates delivered right to your inbox!

Thank you for your subscription

×