October 23rd 2017

Overview

Data scientists are using Python and R to solve data problems due to the ready availability of these packages. These languages are often limited as the data is processed on a single machine, where the movement of data from the development environment to production environment is time-consuming and requires extensive re-engineering.

To overcome this problem, Spark provides a powerful, unified engine that is both fast (100x faster than Hadoop for large-scale data processing) and easy to use by the data scientists and data engineers. It is simple, scalable, and easy to integrate with other tools.

Seahorse, a scalable data analytics workbench, allows the data scientists to visually build Spark applications. It allows the data scientists to perform data preparation, data transformation, data modeling, data training, data analysis, and data visualization collaboratively. Seahorse has built-in operations to allow the data scientists to customize parameter values.

In this blog, let us discuss predicting loan default of Lending Club. Lending Club is the world’s largest online marketplace to connect borrowers and investors.

Pre-requisites

VirtualBox (version 5.0.10)
Vagrant (version 1.8.1)
Google Chrome (60.0.3112.113)

Data Description

Loan data of Lending Club, from 2007-2011, with 40K records is used the source file. Each loan has more than 100 characteristics of loan and borrower.

Use Case

Analyze loan data of Lending Club.
Predict loan default in Lending Club dataset by building data model using Logistic Regression.

Loan status falls under two categories such as Charged Off (default loan) and Fully Paid (desirable loan). Lending Club defines Charged Off loans as loans that are non-collectable and the lender has no hope of recovering money.

Synopsis

Read Data from Source
Prepare Data
Train and Evaluate Data Model
Visualize Data

Workflow Operations

In Seahorse, all the machine learning processes are made as operations. R Transformation operations are used to clean and prepare the data. The operations used for Lending Club loan data analysis are as follows:

Input / Output – Read DataFrame
Action – Fit, Transform, Evaluate
Set Operation – Split
Filtering – Filter Columns, Handle Missing Values
Transformation – SQL Transformation, R Transformation
Feature Conversion – String Indexer, One Hot Encoder, Assemble Vector
Machine Learning – Logistic Regression from Classification, Binary Classification Evaluator from Evaluation

Reading Data from Source

Seahorse supports three different file formats such as CSV, Parquet, and JSON from different types of data sources such as HDFS, Database, Local, and Google Spreadsheets. Read DataFrame operation is used to read the files from the data sources and upload it into Seahorse library.

Preparing Data

To prepare the data for analysis, perform the following:

Remove irrelevant data (loan ID, URL, and so on), poorly documented data (average current balance), and less important features (payment plan, home state) from the source data.
Use Filter Columns operation to select 17 key features from the dataset as shown in the below diagram:

Use R Transformation operation to write any custom function in R.
Convert string columns into numeric columns by removing special characters and duplicate data.
For Example, convert int_rate and revol_util columns into numeric by removing special characters (%).

transform

Derive new features from the date columns by applying feature engineering.
For example, derive issue_month and issue_year from issue_d feature and for earliest_cr_line feature.

transform The derived features are shown in the below diagram:

After Preprocessing

After preprocessing, perform the following:

Use Handle Missing Values operation to find the rows with missing values and to handle them with the selected strategy such as remove row, remove column, custom value, and mode.
For example, provide custom values for NAs and empty string.
Select numeric and string columns from the DataFrame and select remove row as strategy as shown in the below diagram:

Use String Indexer to map the categorical features into numbers.
Choose the columns from the DataFrame using name, index, or type.
Select string type columns from the DataFrame and apply string indexer to those columns.
For example, after the String Indexer execution, Fully Paid will become 0 and Charge Off will become 1 in loan_status column.
Use One Hot Encoder operation to convert categorical values into numbers in a fixed range of values.
A vector will be produced in each column corresponding to one possible value of the feature.

Use Assemble Vector operation to group all relevant columns together and to form a column with a single vector of all the features.
For example, the loan_status column is prediction variable and all other columns are features.
Use excluding mode to select all the columns other than the prediction variable.

Training and Evaluating Data Model

To split the dataset into training set and validation set using Split operation based on split ratio, perform the following:

Use 0.7 as a split ratio to split 70 percentage of data in the training set and 30 percentage of data in the validation set.

Use Logistic Regression and Fit operations to perform model training.
Use Fit operation to fit on estimator so as to produce Transformer.
In Fit Operation, select features columns and prediction variable.
Select maximum iterations and threshold value for the model.
Fit operation provides prediction variable with predicted values and confidence score in raw prediction and probability columns.

Use Evaluate action with Binary Classification Evaluator to find the performance of the model.
Find AUC, F-Score, and Recall values from the Binary Classification Evaluator and select AUC as a metric for the model.

Use custom functions (R or Python Transformation) to find the confusion matrix of the model and derive the metrics for that model.
Use SQL Transformation to write custom Spark SQL query and to get correctly predicted values and wrongly predicted values from the DataFrame.

Visualizing Data

DataFrame Report

In DataFrame Report, every column has some plots based on the datatype.

Int_rate Column Visualization

For Continuous features, the bar chart is used for data visualization as shown in the below diagram:

Grade Column Visualization

For Discrete features, the pie chart is used for data visualization as shown in the below diagram:

To create a custom plot like the combination of two column values, use custom operations such as R, Python, SQL Transformation or Python or R Notebook.

References

Sample Loan Data
Seahorse Overview
Getting started with Seahorse

The Ultimate Gaming Experience: Explo…
best projectors for home
Key Tips for Renting an Electric Vehi…
Another Appliance Company Now Files A…

This post first appeared on Front-end Code Review & Validation Tools | Treselle Systems, please read the originial post: here

People also like

The Ultimate Gaming Experience: Exploring Xbox in the Cloud

best projectors for home

Key Tips for Renting an Electric Vehicle in Europe

Another Appliance Company Now Files An Unexpected Bankruptcy

Predict Lending Club Loan Default Using Seahorse and SparkR

Overview

Pre-requisites

Data Description

Use Case

Workflow Operations

Reading Data from Source

Preparing Data

After Preprocessing

Training and Evaluating Data Model

Visualizing Data

References

Related Articles

Predict Lending Club Loan Default Using Seahorse and SparkR

Overview

Pre-requisites

Data Description

Use Case

Workflow Operations

Reading Data from Source

Preparing Data

After Preprocessing

Training and Evaluating Data Model

Visualizing Data

References

Related Articles

Share the post

Subscribe to Front-end Code Review & Validation Tools | Treselle Systems

Thank you for your subscription