Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Boost your career: Interesting insights of Data Mining & Machine Learning for economists


Author: Kateryna Volkovska
 
 
 
With this post, I want to share some useful information and introduce you a couple of Machine Learning (ML) concepts as well as highlight differences and similarities between ML and econometrics. Also, I am aimed to show how Data Mining (DM) and ML can be used by economists.

Basics

So firstly let's clarify definitions:
  • ML uses data to predict some variable as a function of other Variables (focuses on computing a good prediction of y given the new values of x). 
  • Econometrics uses statistical methods for prediction and inference of economic relationship.
In general, econometricians are thought to start with the theoretical model and then build a model that validates or invalidates the theory. Machine learners always start from data. 
Machine learning techniques (such as decision trees, support vector machines (SVM), neural networks and deep learning) allow for more effective ways to model complex economic relationships.
 
Table 1. The comparison of aims of Econometrics, DM and ML.

EconometricsMachine LearningData Mining
prediction
prediction
summarization
summarizationextract info from datafinding patterns
estimation

visualization
hypothesis testing

data manipulation
extract info from data


Source: Varian (2014)

So the main difference is that ML, for the most part, deals with pure prediction, while econometrics cares more on causal inference.

Predict & Classify

When econometrician faced up with prediction problem he or she usually employs the linear or logit regression. However, ML suggests more advanced nonlinear methods that are more useful for big data sets. Here are some of them:
  • Regression trees;
  • Random forest;
  • Least absolute shrinkage and selection operator (LASSO - regression analysis method);
  • Least-angle regression.
What economists always call “the out-of-sample prediction”, machine learners call “the case of overfitting”. The common difficulty for both is the classification problem. While econometrician in this case usually uses logit or probit, ML suggests using decision trees in order to classify the observation which will lead to good out-of-sample predictions (in literature you can find the abbreviation “CART” - classification and regression trees). The feature of decision trees is that they capture non-linearity in data, while logistic regression not. Hence, ML tool does better. Another cool thing of ML is that it prefers averaging over many small models which give better out-of-sample prediction than choosing a single model.

Data Structures  and Dimensionality Reduction

So what are the differences between data structures that are most commonly used by ML and econometrics? Firstly, econometricians deal usually with time-series and panel data, while machine learners prefer cross-sectional data with independent identically distributed observations. However, for time series ML offers a method called Bayesian structural time series (BSTS) aimed to work better for variable selection problems in time series application. 
I think all of you have heard about Principal component analysis (PSA) for dimensionality reduction of data. In fact, it is ML method, but it is widely used by econometricians and mathematicians. I used it also in my Bachelor thesis while making the analysis which particular factors influence most on the costs of insurance companies in the USA.

Regression Everywhere

Another common tool for machine learning specialists and econometricians is regression analysis. Its primary goal is to understand as far as possible with the available data, how the conditional distribution of the response y varies across subpopulation determined by the possible values of the predictors or predictor (Cook and Weisberg (1999)). In my post I want to catch your attention on the following economic example which provides methods for variable selection in the context of the growth regressions (Varian 2014).
In the example, he uses the dataset from Sala-i-Martin (1997) of 72 countries and 42 variables in order to determine the most important variables for economic growth. Sala-i-Martin (1997) computed all possible subsets of regressors and used the results to construct the measure called CDF(0). In the table below you can see the variables that have the highest CDF(0) and therefore the most useful in explaining economic growth according to Sala-i-Martin (1997). Ley and Steel (2009) for this problem used Bayesian model averaging, LASSO and spike-and-slab regressions (which is also a Bayesian technique) . In the following table LASSO column shows the ordinal importance of the variables or a dash meaning that it was not included in the chosen model. Other columns show the posterior probability of inclusion in the model.
 
Table 2. Comparing Variable Selection Algorithms: Which Variables Appeared as Important Predictors of Economic Growth?
 

Source: Ley and Steel (2009), data from Sala-i-Martin (1997).
These methods are efficient and useful for economic research in case you faced up with the problem of determining variables that are most important for the particular model.

Must-have Software

So what about software and programming languages? For ML it is definitely R and Python (check packages “scikit learn” and “statsmodels”). And for econometrics R, Stata and Eviews are the best. The last two are the statistical software and they are not for free, thus, on my opinion, R is the most suitable for both purposes. For those, who are interested, I highly encourage to read a book “An Introduction to Statistical Learning” by Gareth James (https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1461471370).

Conclusion and Further Inspiration

We can see that econometrics and ML are very closely related. However, I consider econometrics as a subpart of ML. Other important applications of ML include:
  • Computer vision; 
  • Speech recognition (e.g. Siri and Hello Google that you all know); 
  • Artificial intelligence (check game “Just dance”:D)
To summarize, what the econometric community can learn from the ML community:
  • Tests to avoid overfitting; 
  • Nonlinear estimations; 
  • Model averaging; 
  • Tools for manipulating big data (SQL, NoSQL databases); 
  • Computational Bayesian methods.
I am convinced that ML tools should be more widely known by young economists and researchers. Hope you have found from this post some interesting ideas what you should learn to grow more in your future career.
And I want to share with you this beautiful mind map which provides the great overview of all machine learning techniques: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Take blanket and cup of tea and start watching Data Mining video lectures of Jeef Leek (https://www.youtube.com/user/jtleek2007)

I would be very happy for your feedback and comments. Let's share ideas!

Happy blogging!:)


This post first appeared on Quantitative Economic Students', please read the originial post: here

Share the post

Boost your career: Interesting insights of Data Mining & Machine Learning for economists

×

Subscribe to Quantitative Economic Students'

Get updates delivered right to your inbox!

Thank you for your subscription

×