Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Learning Logistic Regression in R

Logistic Regression is a technique to predict a Categorical Variable Outcome based on one or more input variable(s). Logistic Regression is a classification algorithm. For binary categorical outcomes like 0/1 or TRUE/FALSE or YES/NO values, we can use Binomial Logistic Regression Model. There is a concept of Multinomial Logistic Regression Model which we may use to classify Films as "Horror","Drama" and "Romantic". We can learn about it sometime later. Let's first concentrate on our Binomial Model.
Problem:
Create a Logistic Regression Model Example in R
Solution:
Below are the steps that we will follow to create our first Logistic Regression Model.
1. Pick a DataSet:
For this Example, we are going to use the Kaggle Titanic Datasets. There are two Datasets "Train.csv" and "Test.csv". We are going to build a Logistic Regression Model using the Training Set. Then we will use the Model to predict Survival Probability for each passenger in the Test Dataset. 
Below is a brief description of the Data:
Variable
Definition
Values
survival
Survival
0= No, 1 = Yes
pclass
Ticket class
1= 1st, 2= 2nd, 3= 3rd
sex
Sex
Male/Female
Age
Age in years

sibsp
No of siblings / spouses aboard the Titanic

parch
No of parents / children aboard the Titanic

ticket
Ticket number

fare
Passenger fare

cabin
Cabin number

embarked
Port of Embarkation
C = Cherbourg,
Q = Queenstown,
S = Southampton
2. Load Data:
First, we will Load the "Train.csv" and "Test.csv" Datafiles using the function read.csv(). While loading the Datafiles, we will use na.strings = "" so that the missing values in the data are loaded as NA.We will bind these 2 Datasets into one for further Data Cleaning.


3. Data Cleaning:

To keep this Example simple, we are not going to add any new features. We will do the following cleanups instead:
  • Remove a few variables which may not be beneficial for our analysis.
  • Check for the NA values and replace the NA values with some meaningful data.
4. Create a Model:
To create our first Logistic Regression Model, we need to divide the dataset into Train and Test as we had earlier. We will use glm() (Generalized Linear Model) and Family= Binomial for this example.

5. Understand The Model:
Let's look at the Coefficients of the above Model and try to understand the meaning of the predictor variables.

  • It looks like the variables "Pclass","Sex" and "Age" are highly significant in building the above Model. Predictor "SibSp" is also significant. But, "Parch" and "Embarked" has a less or no significance at all in this Model. 
  • The Estimate column for all the variables is Negative. That means, for example, if all other variables being equal, the MALE passenger is less likely to have survived.
  • The P-value is lowest for the Age variable and it signifies a strong association with the probability of being survived.
  • AIC Value is the measure of the quality of the Model. The preferred Model is the one with Minimum AIC Value.
6. Predict Survival for Test Data:
Now, let's use the above Model to Predict Survival in the Test DataSet. To predict the survival, we need to use the Predict(). 
If we check the summary of our prediction, it looks like the Minimum value is 0 and the Max value is 1. That means we are calculating the Probability of survival of a passenger and the value is in between 0 and 1. In this example, 156 passengers did survived and 262 passengers did not.
The Prediction value and the efficiency of the Model can be improved by adding some more features and trying adding or removing some variables. I have added my detailed analysis for Titanic Dataset in Github.


This post first appeared on What The Data Says, please read the originial post: here

Share the post

Learning Logistic Regression in R

×

Subscribe to What The Data Says

Get updates delivered right to your inbox!

Thank you for your subscription

×