...but still you may think you have a heck of a model.
In the latest post of our Predicting Churn series articles, we sliced and diced the data from Mailchimp to try and gain some data insight and try to predict users who are likely to churn. In principle defining churn is a difficult problem, it was even the subject of a lawsuit against Netflix1.
However, in the case of Email marketing, the task is seemingly easier, as a user can be considered as churned when he unsubscribes from the list. Having a clear definition of what churn is in our case, we can proceed and start working with the available data. At the previous post, and following a long process, we ended up with a satisfactory result.
In data science it's imperative to have a feedback loop in place, where we try something -> get feedback & results -> learn from feedback & results and then try something new.
Like many other problems in data science, there is no silver bullet method for predicting churn. My feedback loop was the awesome Redditors in the Data Science subreddit. From the feedback we got, we did not take into account that the data are serially correlated. That means they may have an internal structure such as autocorrelation, trend or seasonal variation.
There is a crack in everything. That's how the light gets in.2
After this realization, to obtain an understanding of the underlying forces and structures that produced the observed data and fit a Model for forecasting we will try a different approach.
To reflect the sequential nature of the data, in the case of an email campaign it is important to understand that each subscriber must be represented by multiple data points inside the dataset, each one of them corresponding to a certain value of the time indicator we are going to choose.
Monthly or yearly intervals, days of subscription or an email "serial number" of emails received, can account for appropriate “time indicators”. The appropriate indicator depends on the data we have.
For example, in a case where we have only two years of data from Mailchimp, the yearly intervals may be too broad.
Based on this, the selected indicators are:
- Days of subscriptions
- Email serial number: It is an ordinal number where 1 corresponds to the 1st email, 2 to the 2nd, etc.
For this analysis, we consider the email serial number as more suitable as in the context of an email marketing campaign. A subscriber is more probable to churn exactly after he receives an email he considers irrelevant than on a random day.
Moreover, this choice may lead to better actionable insights as it will inform a marketer about those who are more likely to churn at his next email.
Of course, we encourage you to make different choices and evaluate which indicator works best for you.
The first steps of our analysis are quite the same as our previous post, so we will run through them with fewer details.
The data from Mailchimp we utilize are the same as before. We also included the serial number of the emails each member of a mailing list received before churned. We also have the corresponding sign-out timestamp from the
Unsubscribes table. The
Unsubscribes table keeps track of the email of each churned recipient, the time at which they churned, the reason and some other information.
We used Blendo as an ETL as a service platform to store our email marketing data from Mailchimp, into a PostgreSQL database consistently, and we can move forward and join the two tables together.
Now, we are going to evaluate the underlying structure of the data from the scope of our new variables. The conclusions we drew at the previous post still hold, and in addition to those, we will investigate our new variable and maximize our insight about them.
The new measure we introduced, the
days_since, reveals that there are certain time periods during which members are more likely to churn. For example, the “high risk” periods seem to be:
- The early days of subscription: During these days one is probable to realize that he is not interested in the content of the emails he receives and so unsubscribe.
- After a year of subscription: Perhaps it refers to users who have already interacted with the company and are unhappy with their services.
As far as the serial number of the email is concerned, we can see that the max density is reached at less than five emails both for those who have churned and those who have not. That means there is a significant number of recipients who have subscribed quite recently and thus haven’t yet received many emails from the campaigns.
The last observation may prove to be a problem in our analysis as it seems that our time series have very few data points. So predicting the future behavior of customers based on them will be difficult.
On the other hand, having many years of good data makes predictions more accurate and reveals the actual predictive ability of each feature.
Feature Engineering is about combining existing features into new ones that are more meaningful. That is a crucial part of our analysis as the quantity and quality of the features we use, will influence the results we will achieve. The way in which the existing features can be combined depends entirely on the problem we are trying to solve.
|`mailCount`||This value corresponds to the serial number of each email for each recipient.|
|`personalMail`||Detection of Personal/Business email. If the service provider of a recipient belongs in a list of the most common, then his email is classified as personal, otherwise as a business.|
|`daysSub`||The number of days the recipient remained subscribed.|
|`totalpractions`||The total number of actions the user has performed until now.|
|`avgactions`||The average number of actions the user performs per email until now.|
|`days_since`||Number of days days since the last email he received.|
At that stage we are going to transform the data in a way so that we can handle them in R. The two steps we followed are:
Dealing with NAs
The existence of NAs is a very common problem in every dataset in the real world. Especially when the data come from forms the users fill in, blank fields are in most cases present.
Depending on the feature, NAs are handled differently.
For example, the
timestamp_out variable will have NAs in cases where the recipient has not churned. The same applies to the variable
totalpractions for the users with not even one interaction with the email.
One the other hand missingness on the
email_address, which is the primary identifier for the users, cannot be handled and thus the record will be removed as we do not know who it is.
Splitting in train and test dataset
At this stage, the initial dataset is split into train and test dataset. The train dataset is used for model construction and the test for model evaluation. Evaluation of the model’s performance on unseen data through test dataset is necessary before deploying the model to assess the quality and the trustworthiness of our results.
At this point, we have to construct appropriate models all over again as since our previous post we completely altered the form of the dataset we use. That results in a new one where every recipient is represented by a time series. Consequently, that leads to a whole new problem.
So here we go...
After having defined our output variable as
status (subscribed/unsubscribed) of the recipient after the receipt of an email, we can move on in choosing an appropriate method for modeling. The paths we can follow are many, but here we are going to represent only some of them.
When working with linear models such as logistic regression, it is a good idea to utilize a variable selection method, like Lasso.t which will lead to model simplification, reduction of overfitting and shorter training time.
Additionally and as we will see below, such methods might help you to understand better the data you are working with and if this type of models is a good choice for the problem at hand. In this way, a data scientist can save much time of trying to generate models that will not work at the end.
The way lasso method works is by forcing the sum of absolute value of the regression coefficient to be less than a fixed value which leads to some coefficients being set to zero, effectively choosing a simpler model that includes only relevant features.
In our case, none of the variables was included in our model. That means that none of them seems to have predictive ability for the variable
status. So trying to build some linear model will not work with the data we currently have.
Despite the strong evidence we have, from both plots and lasso, that our data is insufficient, we are going to continue the analysis and see what we come up to.
The first model we considered was the logistic regression. The output of the model is the probability of the positive class, i.e. the probability that a recipient will churn after receiving the next email. However, before moving on, we should check if the statistical assumptions of the model are satisfied. Violation of any of the following assumptions may lead to misleading or biased results.
Assumption #1: The dependent variable (in our case
status) should be measured on a dichotomous scale. The assumption is satisfied as the variable
status takes only two values (0 or 1).
Assumption #2: The observations should be independent, and the dependent variable should have mutually exclusive and exhaustive categories. In our case, of course, the observations are not independent, as for every recipient the observation i is highly serially correlated with the i-1, as both are snapshots of a user’s interaction with the email campaign at a given moment.
Based on the above, a simple logistic model will give us false results if the above assumptions are not corrected.
However, we can go on and try a decision tree or a random forest, if the number of variables is sufficient.
Decision Tree & Random Forest
Although this is the approach we followed on the previous post too, it is worth examining it again as now our data have been expressed as time series. This change in scope from which we examine the data might alter the outcome of the method drastically.
In a classification decision tree, each node corresponds to one of the input variables, and there are edges to children nodes for each of the possible values of the input. Each leaf represents a possible decision of the tree given the input values for each feature represented by the path from the root to the leaf.
Random Forest, on the other hand, is nothing more than a set of trees each of them trained differently, as explained in our previous post where the final decision is computed as the most common answer between the individual decisions of all trees.
Although random forests have certain advantages against decision trees, such as resistance to overfitting and more robust results, it is important to make sure that we have a quite large number of different variables for the trees to be trained differently.
Getting back to our data we tried to grow a decision tree but as a result, we got only the root, meaning that the tree did not split on any of our variables and thus didn’t grow. That is not so uncommon as according to the documentation: “Any split that does not decrease the overall lack of fit by a factor of cp is not attempted”.
Of course, we can loose up the control parameters to force the tree to grow, but this will probably lead to totally misleading results with high Type I error rate.
Overall Data Evaluation
Now we have strong evidence that the data we are using are not capable of predicting whether a recipient will or will not churn after receiving the next email. This conclusion was derived from the plot of
mailCount against subscription status, the lasso output and the fact that the decision tree did not grow.
To define the reasons why this is happening, we need to perform further investigation. However, the most probable ones are:
- Lack of sufficient historical data
- Lack of variables with enough information about the dependent variable.
Trying different methods will not provide us with good results as now we are convinced that the problem lies in the quality of the data we have at our disposal. To overcome the obstacle, it is necessary that we enrich our data before moving to other approaches or reevaluate those mentioned before.
Although it seems that the data we have are not sufficient to develop a model with good predictive ability, don’t be too disappointed as this is a very common case in real world data.
Even in our case, this is not very surprising as the dataset contained no information about the context of the emails being sent, the recipient's demographics and the rest of their interaction with the company that launched the email marketing campaigns. For example, the disappointment of a customer regarding support is not reflected in any variable in our dataset.
To overcome that barrier, we need to enrich the dataset with data from other sources such as customer support events. That is an area that a data platform like Blendo really shines. As you can easily pull more data into your data warehouse and all you have to do is to focus on experimenting with your models.
After resolving the data quality problem, we can move forward and reevaluate the above techniques and also experiment with others like ARIMA, also know as Box-Jenkins approach, conditional logit models or time series classification methods like SVM, k-NN and neural networks with Discrete Wavelet Transform (DWT).
Note: Many thanks to redditors: easy_being_green & neo82087
definition of churn rate and the case of Netflix ↩
Leonard Cohen, Selected Poems, 1956-1968 ↩