Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Extract only javascript from a script tag

Extract only javascript from a script tag

Problem

I would like to extract only javascript from Script tags in a HTML document which I want to pass it to a JS parser like esprima. I am using nodejs to write this application and have the content extracted from the script tag as a string. The problem is when there are HTML comments in the javascript extracted from html documents which I want to remove.
should be converted to var a
A simple removal of and --> does not work since it fails in the case 0); --> where it removes the middle -->
I would also like to remove identifiers like [if !IE] and [endif] which are sometimes found inside script tags. I would also like to extract the JS inside CDATA segments.
should be converted to var a
Is all this possible using a regex or is something more required?
In short I would like to sanitize the JS from script tags so that I can safely pass it into a parser like esprima.
Thanks!

EDIT:
Based on @user568109 's answer. This is the rough code that parses through HTML comments and CDATA segments inside script tags

var htmlparser = require("htmlparser2");
var jstext = '';
var parser = new htmlparser.Pavar htmlparser = require("htmlparser2");
var jstext = '';
var parser = new htmlparser.Parser({
onopentag: function(name, attribs){
    if(name === "script" && attribs.type === "text/javascript"){
        jstext = '';
        //console.log("JS! Hooray!");
    }
},
ontext: function(text) {
    jstext += text;
},
onclosetag: function(tagname) {
    if(tagname === "script") {
        console.log(jstext);
        jstext = '';
    }
},
oncomment : function(data) {
    if(jstext) {
        jstext += data;
    }
}
},  {
xmlMode:true
});
parser.write(input);
parser.end()
Problem courtesy of: everconfusedGuy

Solution

That is the job of the parser. See the htmlparser2 or esprima itself. Please don't use regex to parse HTML, it is seductive. You will waste your precious time and effort trying to match more tags.

An example from the page:

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
    onopentag: function(name, attribs){
        if(name === "script" && attribs.type === "text/javascript"){
            console.log("JS! Hooray!");
        }
    },
    ontext: function(text){
        console.log("-->", text);
    },
    onclosetag: function(tagname){
        if(tagname === "script"){
            console.log("That's it?!");
        }
    }
});
parser.write("Xyz ");
parser.end();

Output (simplified):

--> Xyz 
JS! Hooray!
--> var foo = '>';
That's it?!

It will give you all the tags divs, comments, scripts etc. But you would have to validate the script inside the comments yourself. Also CDATA is a valid tag in XML(XHTML), so htmlparser2 would detect it as a comment, you would have to check those too.

Solution courtesy of: user568109

Discussion

View additional discussion.



This post first appeared on Node.js Recipes, please read the originial post: here

Share the post

Extract only javascript from a script tag

×

Subscribe to Node.js Recipes

Get updates delivered right to your inbox!

Thank you for your subscription

×