Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Practical Data Mining with Python- Part 3

Since we have discussed how to use pandas library for data pre-processing and correlation checking with sci-kit learn we can now continue our discussion on predicting future based on historical data.

We will be using the same dataset from DrivenData website in this discussion too. Apart from the datasets we used in the previous post ( Training features set and training labels set), we will be using testing features set and submission format in this tutorial. All of these can be downloaded from this competition in DrivenData website for datamining competitions. So, from the previous post, we have identified some features have higher correlation with number of reported dengue cases compared to the others. For the city of San Juan, Puerto Rico (indicated as SJ in the dataset), highly correlated features are,

  • reanalysis_specific_humidity_g_per_kg
  • reanalysis_dew_point_temp_k
  • station_avg_temp_c
  • reanalysis_max_air_temp_k.

And for Iquitos in Peru, most correlated four features are

  • reanalysis_specific_humidity_g_per_kg
  • reanalysis_dew_point_temp_k
  • reanalysis_min_air_temp_k
  • station_min_temp_c.

Following sections will describe the code in this github gist.

Now we have a set of features and dataset with us, we can focus on building a Machine learning Model to predict future dengue cases.

As usual we load training dataset into the application using the pandas.read_csv() function and then fill the missing values(lines 31-32 ). In here also I’m using forward filling method in pandas library for simplicity. we will be discussing different techniques that we can used to fill missing values in an upcoming blog post. On the other-hand,  forward filling method is more suitable for time series prediction scenarios like this where most recent values will be used to fill missing values.

df = pd.read_csv('Data/lag_dengue_features_train.csv', index_col=[0, 1, 2])
df.fillna(method='ffill', inplace=True)

Then we need to filter out data for each city as we are going to build separate Machine Learning Models for two cities. Then all the features apart from above mentioned selected features will be removed from the dataframe as below.

sj = df.loc['sj']
iq = df.loc['iq']
features_sj =['reanalysis_specific_humidity_g_per_kg','reanalysis_dew_point_temp_k','station_avg_temp_c','reanalysis_max_air_temp_k']
features_iq = ['reanalysis_specific_humidity_g_per_kg','reanalysis_dew_point_temp_k','reanalysis_min_air_temp_k','station_min_temp_c']
sj = sj[features_sj]
iq = iq[features_iq]

Then we need to do the same things to the test dataset. Code from line 43 to line 50 will do that as describe above.

df_test = pd.read_csv('Data/lag_dengue_features_test.csv', index_col=[0, 1, 2])
df_test.fillna(method='ffill', inplace=True)
sj_test = df_test.loc['sj']
iq_test = df_test.loc['iq']
sj_test = sj_test[features_sj]
iq_test = iq_test[features_iq]

Then the labels or expected results of training data is also loaded to use with their corresponding training data to train the model.

df_labels = pd.read_csv('Data/lag_dengue_labels_train.csv', index_col=[0, 1, 2])
sj_labels = df_labels.loc['sj']
iq_labels = df_labels.loc['iq']

Since now we have loaded all the necessary data, we can proceed to evaluating different models for the best set of parameters. Then we will use those parameters (alpha in this scenario) to create a better model and produce predictions for test data.

To evaluate a model we will be using the evaluate method (line 6 to line 29) in the code.

def evaluate(train_set,features,a):
total_score =0
for x in range(10):
     train, test = train_test_split(train_set, train_size = 0.8)
     train_data = train[features]
     train_target = train.total_cases
     test_data = test[features]
     test_target = test['total_cases']
     testModel = linear_model.Lasso(alpha=a)
     testModel.fit(train_data,train_target)
     test_results= testModel.predict(test_data)
     test_results= [int(round(i)) for i in test_results]
     MAE=0
     for index in range(0,len(test_results)):
          MAE += abs(test_results[index]-test_target[index])
     total_score+=(MAE/float(len(test_results)))
     #print(MAE)
return total_score/(10.0)

Let me explain this function line by line. This function will take a dataset, features set and an alpha value and produce an average score that can be later used to measure the accuracy of that model with that alpha value.
First we set total score to be 0 and we will use that variable to store scores of each model. Then we loop 10 times evaluating similar 10 models. This number can be changed according to the sample size you want. In each model dataset will be separated into two parts as train set and test set using train_test_split method provided with scikit-learn. Then those training and test data will be filtered out to remove expected results (dengue cases in this scenario). Then we create a model as testModel. To do that we will be using Lasso algorithm which is a common algorithm in sci-kit learn. (I will post a short blog post about this algorithm in near future. You can always visit scikit-learn documentation or this video tutorial to learn more about lasso algorithm). Then we will fit or training data with training target values as shown in line 18.
This newly built model will then be used to predict test data values as shown in line 20 and the results will be stored as test_results. Then from line 23 to line 27, each result value will be tested against target value and difference will be added to MAE value. After evaluating all the instances, averaged MAE value will be added to the total_score value. Finally, average of the scores will be returned from the model.

Now we have understood how evaluate() method works in this program. Let’s look at how it is used in this program.

We define set of possible alpha values as a list, best score and best alpha variables for two cities as below.

alphas =[0.1,0.01,0.001,0.0001,0.00001,0.000001,0.0000001,0.00000001]
bestScore_sj =1000
bestScore_iq =1000
bestAlpha_sj =0.1
bestAlpha_iq =0.1

Then we will loop through alpha values and pass them to evaluate with training data sets of each cities to find best alpha value for each city as below. (line 66 to line 75)

for alpha in alphas:
    sj_score = evaluate(sj.join(sj_labels),features_sj,alpha)
    if(sj_score

Now we know what are the best alpha values for each city dataset. Now we can create a model using whole training dataset to predict test data values as below.

model_sj = linear_model.Lasso(alpha=bestAlpha_sj)
model_iq = linear_model.Lasso(alpha=bestAlpha_iq)
model_sj.fit(sj.values,sj_labels.total_cases)
model_iq.fit(iq.values,iq_labels.total_cases)
results_sj = model_sj.predict(sj_test)
results_iq = model_iq.predict(iq_test)

Now we have predicted results that need to be saved as outputs as the requirements of the competition. First, we join the results of two cities. Then we round result values to have integers according to the requirements of the competition and then we will convert any negative values into 0 as dengue cases cannot be less than 0.
Now we can load submission format and save our results as a different file for evaluation purposes.
This concept of loading data, evaluating for best configurations and then building model can be used with most of the algorithms and keeping this structure as it is with many algorithms will simplify most of the tasks. We will discuss about more algorithms while improving other things such as adding lagged values, having complex features etc. in upcoming posts.




This post first appeared on Never Stop Coding, please read the originial post: here

Share the post

Practical Data Mining with Python- Part 3

×

Subscribe to Never Stop Coding

Get updates delivered right to your inbox!

Thank you for your subscription

×