Data is the lifeblood of Machine Learning (ML) projects. At the same time, the data preparation process is one of the main challenges that plague most projects. According to a recent study, data preparation tasks take more than 80% of the time spent on ML projects. Data scientists spend most of their time on data cleaning (25%), labeling (25%), augmentation (15%), aggregation (15%), and identification (5%).
Percentage of time allocated to machine learning projects (source)
This article will talk about the most common data preparation challenges that require data scientists and Machine learning engineers to spend so much time on data preparation. We’ll also look at how self-service data preparation tools can help in overcoming these challenges.
Do you like this in-depth educational content on applied machine learning? Subscribe to our Enterprise AI mailing list to be alerted when we release new material.
The data preparation process
Essentially, data preparation refers to a set of procedures that readies data to be consumed by machine learning algorithms. Here are the typical steps involved in preparing data for machine learning.
Data Preparation Process (based on Jason Brownlee’s article)
1. Data collection
This is a critical first step that involves gathering data from various sources such as databases, files, and external repositories. Before starting the data collection process, it’s important to articulate the problem you want to solve with an ML model. Knowing the objectives you intend to achieve with your algorithm will assist in determining the type of data that is required for the project. Importantly, taking a structured approach to data collection will help you in getting a clear picture of all the data available, required, and missing.
With this clear picture, you can move to the next set of questions: if there is data that is not available but you would like to have it to achieve the established goal, is it possible to derive or simulate it?
Data augmentation. In some cases, data augmentation might be required to expand the size of the existing dataset without gathering more data. For example, if a dataset of images were collected, they can be augmented by rotating the original versions, cropping them differently, or altering the lighting conditions.
Data labeling. In the supervised machine learning setup, data labeling might be also a part of the data preparation process. It can be performed manually by crowd workers or automatically using specialized frameworks (e.g., Snorkel). Data labeling for machine learning is well covered in AltexSoft’s article, where they discuss different approaches and tools for data annotation.
2. Data preprocessing
Since the collected data may be in an undesired format, unorganized, or extremely large, further steps are needed to enhance its quality. The three common steps for preprocessing data are formatting, cleaning, and sampling.
Formatting is required to ensure that all variables within the same attribute are consistently written. For example, all phone numbers, addresses, or sums of money should be written in the same format, to avoid having values in the same column like 56.02, $275, and 43 dollars, 3 cents.
Data preparation with data wrangling tool Trifacta (source)
Secondly, data cleaning is applied to remove messy data and manage missing values. At this step, you can remove duplicates and even outliers if thorough exploratory data analysis accompanied by domain expertise shows that this data is not important for the model. Data cleaning also includes filling in the missing values with mean values or the most frequent items or just dummy values (e.g. 0).
Lastly, sampling might be required if you have too much data. During exploring and prototyping, a smaller representative sample can be fed into the model to save time and costs.
3. Data transformation
Also called feature engineering, this last stage in preparing data for machine learning tasks involves transforming the preprocessed data into forms that are more suitable for a specific ML algorithm. The data can be transformed through scaling, decomposition, or aggregation.
Since most datasets contain features that vary greatly in terms of range, units, or magnitude, feeding such unstandardized data into machine learning algorithms is a recipe for disaster. For example, if for predicting diabetes risk you are using such attributes as human height in meters and weight in kilograms, the later attribute will outweigh the former simply because of the larger numerical values. Therefore, scaling is necessary to suppress this effect by normalizing all features to a similar level of magnitude.
Secondly, if some values in the dataset are complicated, decomposing them into various constituent parts may be more meaningful to an ML model. For example, if a date has a day component and a time component, and only the time part is relevant to the problem being addressed, splitting it can capture more specific data.
Lastly, feature aggregation can be performed to bring related features together and decrease the dimensionality of an input set. In some cases, combining multiple features into a single feature can be more useful for an algorithm.
Data preparation challenges
Now let’s summarize the major challenges that make the data preparation process so challenging and time-consuming.
- Lack of necessary data. Even though companies collect vast amounts of data these days, you’ll usually still be missing pieces of data required for your ML project.
- No clear picture of the data available. When you have data input coming from different databases, warehouses, and systems, it’s sometimes difficult to get a full understanding of what data is available and what is missing.
- Data that’s not ready for use. A lot of time is spent on preparing data because the raw data is typically not ready-made for analysis or consumption by algorithms. ML practitioners typically retrieve data from platforms, applications, and systems that were not developed with machine learning applications in mind. Such platforms focus on fulfilling a particular function, for example, tracking web visitors’ activities or monitoring advertisement performance. Consequently, it is difficult to use data from sources designed for business purposes and not intended for analytics.
- Incompatible data formats. Data is usually collected in different formats from different sources. Therefore, converting all the attributes into a consistent format that is suitable for a machine learning model is usually taxing and time-consuming.
- Messy data. Data scientists often spend weeks solving issues with duplicates, missing data points, errors, and inconsistencies. This is one of the major data preparation challenges for ML practitioners.
- Unbalanced data. Unbalanced datasets can significantly hinder the performance of ML models. That’s why it’s important to conduct thorough exploratory data analysis to discover such issues early and consider solutions that can help with balancing datasets (i.e., data augmentation).
Solutions to accelerate data preparation
Self-service data preparation tools assist data scientists and other ML practitioners with the most tedious stage of any ML project – data cleaning and preparation. We are lucky to have such solutions today that make this whole process much more efficient.
Providers evaluated in The Forrester Wave™: Data Preparation Solutions, Q4 2018
Here are the ways data preparation tools assist in ingesting and orchestrating the data pulled from unstructured sources:
- Combining data. Data preparation tools assist with merging data from various sources, resolving data conflicts, and providing a comprehensive view of the data you have access to.
- Exploratory data analysis. Data preparation solutions allow data monitoring in real time by assisting with data analysis and visualization. It’s especially important when you use dynamic attributes for your ML model and need to discover any anomalies in your data on the fly.
- Data preprocessing. Self-service data preparation tools help data scientists with almost all data preprocessing steps, including standardization of data formats, normalization, removing or replacing invalid and duplicate data, filling in missing values, augmenting data, reducing data noise, anonymizing data, etc.
Well-prepared data is crucial for the success of machine learning models. However, data preparation is a time-intensive and sensitive process that is full of challenges. Therefore, self-service data preparation tools have been designed to enhance the productivity of data scientists and accelerate the performance of ML models.
Such tools empower practitioners to work within an easy-to-use visual application for cleaning, preparing, and deploying data using clicks, not code, without compromising on governance and security.
In the end, irrespective of the terabytes of data collected and the extent of machine learning expertise, the success of an ML algorithm is only as good as the quality of the data used.
Enjoy this article? Sign up for more updates on applied ML.
We’ll let you know when we release more technical education.