Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Analyze your text in R (MicrosoftML)

MicrosoftML” package is powerful and can be used in the recent Microsoft R Server and R Client. Here I describe how you can use the text featurization capabilities using this great package with simple sentiment example.

Note : Sorry, but now it’s available in Windows only (not Linux). Please wait for the update.

You can apply the text featurization into the sentiment analytics, spam filtering (classification), social analytics, etc. Note that the supported languages are English, French, German, Dutch, Italian, Spanish and Japanese (great !) now. (see MSDN for the latest update)

Sample Data Set

Here I use the sample dataset of Amazon book reputation which is having 975,194 records of Rating and free text comment as follows. (see here for this dataset)

RATING  REVIEW_TEXT
2.0     This book has its good ...
2.0     The fatalistic view of ...
1.0     I was intrigued by the ...
2.0     I admit, I haven't finished ...
...

Train with text featurization

Now let’s see sample code with the text featurization. Please see the following example.

In Microsoftml (MML), the text featurizatoin is one of the transforms for trainings. rxFastLinear is the function for the linear regression (or binary regression), which is entried in the new MML package. (It’s more fast than ever.)
Before proceeding the linear regression, we featurize the REVIEW_TEXT string, and store this featurized data named “Features” by the transforms. (Not all words. In this example, the words which is used less than 500 times are all ignored.)
The following rxFastLinear function analyzes this featurized data as input with the regression algorithm, and creates the function for the input of REVIEW_TEXT string and output of RATING numeric.

In the real applications, we must consider more tuning points, but note that this shows the very straight forward example for your understanding. (For example, this model below will work badly in predicting very low or very high ratings like exceeding 5.0 rating, etc…)

library(MicrosoftML)

# read data (you can also use RxTextData)
data 

By summary function, it shows the coefficients, and you can see many negative words and positive words (which I surrounded by the green rectangle) in the output.
In this example, we’re using 2 as n-gram value for the text featurization (see the source code above), and the words like “not|disappointed” are also estimated correctly as follows.

Predict (Scoring)

Now let’s predict some input text using the generated model.
The following is having 3 of input text, and the following screenshot shows the scored (predicted) rating results.

pred1 

The next sample predicts the rating with new 700 records, and plots the relation of actual ratings and predicted ratings.

test 

I will show some social analytics example (with the real data) by MicrosoftML package in the event (in Japan) coming soon.

[Reference]

MSDN : Introduction to MicrosoftML
https://msdn.microsoft.com/en-us/microsoft-r/microsoftml-introduction

Revolution Analytics Blog : Building a machine learning model with the MicrosoftML package
http://blog.revolutionanalytics.com/2017/01/microsoftml-taxi-trips.html

Share the post

Analyze your text in R (MicrosoftML)

×

Subscribe to Msdn Blogs | Get The Latest Information, Insights, Announcements, And News From Microsoft Experts And Developers In The Msdn Blogs.

Get updates delivered right to your inbox!

Thank you for your subscription

×