Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Getting the page title from a scraped webpage

Getting the page title from a scraped webpage

Problem

var http = require('http');
var urlOpts = {host: 'www.nodejs.org', path: '/', port: '80'};
http.get(urlOpts, function (response) {
response.on('data', function (chunk) {
var str=chunk.toString();
var re = new RegExp("(]*>(.+?)", "g")
console.log(str.match(re));
});

});

Output

[email protected] ~ $ node app.js [ 'node.js' ] null null

I only need to get the title.

Problem courtesy of: user1777212

Solution

I would suggest using RegEx.exec instead of String.match. You can also define the regular expression using the literal syntax, and only once:

var http = require('http');
var urlOpts = {host: 'www.nodejs.org', path: '/', port: '80'};
var re = /(]*>(.+?)/gi;
http.get(urlOpts, function (response) {
    response.on('data', function (chunk) {
        var str=chunk.toString();
        var match = re.exec(str);
        if (match && match[2]) {
          console.log(match[2]);
        }
    });    
});

The code also assumes that the title will be completely in one chunk, and not split between two chunks. It would probably be best to keep an aggregation of chunks, in case the title is split between chunks. You may also want to stop looking for the title once you've found it.

Solution courtesy of: bdukes

Discussion

View additional discussion.



This post first appeared on Node.js Recipes, please read the originial post: here

Share the post

Getting the page title from a scraped webpage

×

Subscribe to Node.js Recipes

Get updates delivered right to your inbox!

Thank you for your subscription

×