Zombie.js in node.js fails to scrape certain websites
Problem
The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:
var Browser = require("zombie");
var assert = require("assert");
// Load the page from localhost
browser = new Browser()
browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () {
browser.wait(function(){
console.log(browser.html());
});
});
run with node
output:
S����J����ꪙRUݒ�kf�6���Efr2�Riz�����^��0�X� ��{�^�a�yp��p�����Ή��`��(���S]-��'N�8q�����/���?�ݻ��u;�݇�ׯ�Eiٲ>��-���3�ۗG�Ee�,��mF���MI��Q�۲������ڊ�ZG��O�J�^S�C~g��JO�緹�Oݎ���P����ET�n;v������v���D�tvJn��J�8'��햷r�v:��m��J��Z�nh�]�� ����Z����.{Z��Ӳl�B'�.¶D�~$n�/��u"�z�����Ni��"Nj��\00_I\00\��S��O�E8{"�m;�h��,o��Q�y��;��a[������c��q�D�띊?��/|?:�;��Z!}��/�wے�h�
(actual output is much longer)
Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???
Thanks
Solution
I have abandoned this method long ago, but in case anyone is interested I got a reply from one of the zombie.js devs.
https://github.com/assaf/zombie/issues/251#issuecomment-5969175
He says: "Zombie will now send accept-encoding header to indicate it does not support gzip."
Thank you all who looked into this.
Discussion
View additional discussion.