Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Machine Learning Logistic Regression In Python: From Theory To Trading

By Vibhu Singh

In this blog post, we will learn how Logistic Regression works in machine learning and will implement the same to predict stock price movement in Python.

Any machine learning tasks can roughly fall into two categories:

  1. The expected outcome is defined
  2. The expected outcome is not defined

The 1st one where the data consists of an input data and the labelled output is called supervised learning. The 2nd one where the datasets consisting of input data without labelled responses is called unsupervised learning. There is also another category called as reinforcement learning that tries to retro-feed the model to improve performance.

Logistic Regression Machine Learning In PythonClick To Tweet

Logistic regression falls under the category of supervised learning; it measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function. In spite of the name ‘logistic regression’, this is not used for regression problem where the task is to predict the real-valued output. It is a classification problem which is used to predict a binary outcome (1/0, -1/1, True/False) given a set of independent variables.

Logistic regression is a bit similar to the linear regression or we can say it as a generalized linear model. In linear regression, we predict a real-valued output ‘y’ based on a weighted sum of input variables.

The aim of linear regression is to estimate values for the model coefficients c, w1, w2, w3 ….wn and fit the training data with minimal squared error and predict the output y.

Logistic regression does the same thing, but with one addition. The logistic regression model computes a weighted sum of the input variables similar to the linear regression, but it runs the result through a special non-linear function, the logistic function or sigmoid function to produce the output y. Here, the output is binary or in the form of 0/1 or -1/1.

The sigmoid/logistic function is given by the following equation.

y = 1 / 1+ e-x

As you can see in the graph, it is an S-shaped curve that gets closer to 1 as the value of input variable increases above 0 and gets closer to 0 as the input variable decreases below 0. The output of the sigmoid function is 0.5 when the input variable is 0.

Thus, if the output is more than 0.5, we can classify the outcome as 1 (or positive) and if it is less than 0.5, we can classify it as 0 (or negative).

Now, let us consider the task of predicting the stock price movement. If tomorrow’s closing price is higher than today’s closing price, then we will buy the stock (1), else we will sell it (-1). If the output is 0.7, then we can say that there is a 70% chance that tomorrow’s closing price is higher than today’s closing price and classify it as 1.

Now, we have a basic intuition behind the logistic regression and the sigmoid function. We will learn how to implement logistic regression in Python and predict the stock price movement using the above condition.

Code Overview

Import The Libraries

We will start by importing the necessary libraries.

# Data Manupulation
import numpy as np
import pandas as pd

# Techinical Indicators
import talib as ta

# Plotting graphs
import matplotlib.pyplot as plt

# Machine learning
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

# Data fetching
from pandas_datareader import data as pdr
import fix_yahoo_finance as yf
yf.pdr_override()

Import The Data

We will import the Nifty 50 data from 01-Jan-2000 to 01-Jan-2018. The data is imported from yahoo finance using ‘pandas_datareader’.

df = pdr.get_data_yahoo('^NSEI', '2000-01-01', '2018-01-01')
df = df.dropna()
df = df.iloc[:,:4]
df.head()

Let us print the top five rows of column ‘Open’, ‘High’, ‘Low’, ‘Close’.

Learn Algorithmic trading from Experienced Market Practitioners




  • This field is for validation purposes and should be left unchanged.

Define Predictor/Independent Variables

We will use 10-days moving average, correlation, relative strength index (RSI), the difference between the open price of yesterday and today, difference close price of yesterday and open price of today, open, high, low, and close price as indicators to make the prediction.

df['S_10'] = df['Close'].rolling(window=10).mean()
df['Corr'] = df['Close'].rolling(window=10).corr(df['S_10'])
df['RSI'] = ta.RSI(np.array(df['Close']), timeperiod =10)
df['Open-Close'] = df['Open'] - df['Close'].shift(1)
df['Open-Open'] = df['Open'] - df['Open'].shift(1)
df = df.dropna()
X = df.iloc[:,:9]

You can print and check all the predictor variables used to make a prediction.

Define Target/Dependent Variable

The dependent variable is same as discussed in the above example. If the tomorrow’s closing price is higher than today’s closing price, then we will buy the stock (1), else we will sell it (-1).

y = np.where (df['Close'].shift(-1) > df['Close'],1,-1)

Split The Dataset

We will split the dataset into a training dataset and test dataset. We will use 70% of our data to train and the rest 20% to test. To do this, we will create a split variable which will divide the data frame in a 70-30 ratio. ‘X_train’ and ‘Y_train’ are the train dataset. ‘X_test’ and ‘Y_test’ are the test dataset.

split = int(0.7*len(df))

X_train, X_test, y_train, y_test = X[:split], X[split:], y[:split], y[split:]

Instantiate The Logistic Regression

We will instantiate the logistic regression using ‘LogisticRegression’ function and fit the model on the training dataset using ‘fit’ function.

model = LogisticRegression()

model = model.fit (X_train,y_train)

Examine The Coefficients

pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))

Calculate Class Probabilities

We will calculate the probabilities of the class for the test dataset using ‘predict_proba’ function.

probability = model.predict_proba(X_test)

print probability

Predict Class Labels

Next, we will predict the class labels using predict function for the test dataset.

predicted = model.predict(X_test)

If you print ‘predicted’ variable, you will observe that the classifier is predicting 1, when the probability in the second column of variable ‘probability’ is greater than 0.5. When the probability in the second column is less than 0.5, then the classifier is predicting -1.

Evaluate The Model

Confusion Matrix

The Confusion matrix is used to describe the performance of classification model on a set of test dataset for which the true values are known. We will calculate the confusion matrix using ‘confusion_matrix’ function.

print metrics.confusion_matrix(y_test, predicted)

You can interpret the above matrix as:

Classification Report

This is another method to examine the performance of classification model.

print metrics.classification_report(y_test, predicted)

The f1-score tells you the accuracy of the classifier in classifying the data points in that particular class compared to all other class. It is calculated by taking the harmonic mean of precision and recall. The support is the number of samples of the true response that lies in that class.

Model Accuracy

We will calculate the model accuracy on the test dataset using ‘score’ function.

print model.score(X_test,y_test)

0.528

We can see the accuracy of 52%.

Cross-Validation

We will cross check the accuracy of the model using 10-fold cross-validation. For this, we will use ‘cross_val_score’ function which we have imported from ‘sklearn.cross_validation’ library.

cross_val = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)

print cross_val

print cross_val.mean()

The accuracy is still 52% which means the model is working fine.

Learn Algorithmic trading from Experienced Market Practitioners




  • This field is for validation purposes and should be left unchanged.

Create Trading Strategy Using The Model

We will predict the signal to buy (1) or sell (-1) and calculate the cumulative Nifty 50 returns for test dataset. Next, we will calculate the cumulative strategy return based on the signal predicted by the model in the test dataset. We will also plot the cumulative returns.

df['Predicted_Signal'] = model.predict(X)
df['Nifty_returns'] = np.log(df['Close']/df['Close'].shift(1))
Cumulative_Nifty_returns = np.cumsum(df[split:]['Nifty_returns'])
df['Startegy_returns'] = df['Nifty_returns']* df['Predicted_Signal'].shift(1)
Cumulative_Strategy_returns = np.cumsum(df[split:]['Startegy_returns'])
plt.figure(figsize=(10,5))
plt.plot(Cumulative_Nifty_returns, color='r',label = 'Nifty Returns')
plt.plot(Cumulative_Strategy_returns, color='g', label = 'Strategy Returns')
plt.legend()
plt.show()

Conclusion

It can be observed that the Logistic Regression model predict the classes with an accuracy of approximately 52% and generates good returns. Now it’s your turn to play with the code by changing parameters and create a trading strategy based on it.

Next Step

Are you keen to learn various aspects of Algorithmic trading to enhance your existing skill set or to start trading on your own? Check out the Executive Programme in Algorithmic Trading (EPAT). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now to begin your career in Algorithmic Trading.

Download Data File

  • Machine Learning Logistic Regression Python Code

Download Python Code

The post Machine Learning Logistic Regression In Python: From Theory To Trading appeared first on .



This post first appeared on Best Algo Trading Platforms Used In Indian Market, please read the originial post: here

Share the post

Machine Learning Logistic Regression In Python: From Theory To Trading

×

Subscribe to Best Algo Trading Platforms Used In Indian Market

Get updates delivered right to your inbox!

Thank you for your subscription

×