Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

How to convert HTML page to plain text in node.js?

How to convert HTML page to plain text in node.js?

Problem

I know this has been asked before but I can't find a good answer for node.js

I need server-side to extract the Plain text (no tags, script, etc.) from an HTML page that is fetched.

I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side.

I've tried https://npmjs.org/package/html-to-text but this doesn't handle scripts.

  var htmlToText = require('html-to-text');
    var request = require('request');
    request.get(url, function (error, result) {
        var text = htmlToText.fromString(result.body, {
            wordwrap: 130
        });
    });

I've tried phantom.js but can't find a way to just get plain text.

Problem courtesy of: metalaureate

Solution

Use jsdom and jQuery (server-side).

With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text.

Example

(This is not tested with jsdom and node, only in Chrome)

jQuery('script').remove()
jQuery('noscript').remove()
jQuery('body').text().replace(/\s{2,9999}/g, ' ')
Solution courtesy of: hgoebl

Discussion

View additional discussion.



This post first appeared on Node.js Recipes, please read the originial post: here

Share the post

How to convert HTML page to plain text in node.js?

×

Subscribe to Node.js Recipes

Get updates delivered right to your inbox!

Thank you for your subscription

×