June 15th 2023

Sign upSign InSign upSign InArmin Norouzi, Ph.DFollowLevel Up Coding--ListenShareWelcome to this introductory SparkML tutorial. The world of data is growing exponentially, and traditional data analysis tools often fall short when dealing with big data. This is where Apache Spark comes into play. With its ability to perform in-memory processing and run complex algorithms at scale, Spark is an essential tool for every data scientist and big data enthusiast.This tutorial will demonstrate how to install and use PySpark in a Google Colab environment, load a real-world dataset of Data Science Salaries 2023, perform data preprocessing, and build Machine Learning models with SparkML. Whether you’re a beginner stepping into the field of data science, a data analyst looking to dive deeper into big data analytics, or a seasoned data scientist wanting to harness the power of Spark for machine learning, this tutorial is designed for you.By the end of this tutorial, you will have a strong understanding of how to install and run Pyspark in Google Colab, load and process data in `Spark,` and utilize SparkML for predictive modelling.You can run this post in Google Colab using this link:colab.research.google.comBefore jumping to the main topic, let’s go over on what I will cover in his post:Now, let’s get started.Apache Spark is an open-source, distributed computing system for big data processing and analytics. SparkML is the machine learning library with Spark, which provides a range of algorithms for classification, regression, clustering, collaborative filtering, and much more.SparkML was developed to address the need to process large-scale data using machine learning algorithms in a distributed environment. As datasets have continued to grow, traditional machine learning libraries like Scikit-learn, which are excellent for small to medium-sized data, may not scale effectively. SparkML, with its distributed computing capabilities, enables the processing of big data across a cluster of computers, thereby significantly speeding up the machine learning process.At its core, SparkML works by dividing data across multiple nodes in a cluster to process it in parallel. The results are then combined to produce the output. This process, called MapReduce, allows SparkML to handle large datasets efficiently. If you need to learn more about Spark with some great visualization, I suggest you these posts:towardsdatascience.comtowardsdatascience.comPySpark is the Python library for Apache Spark that allows developers to use Spark’s API with Python, thus combining the simplicity and accessibility of Python with the power and speed of Spark. PySpark supports powerful libraries, including MLlib for machine learning, GraphX for graph processing, and Spark Streaming. It can also be used to write Spark applications using Python APIs and allows data scientists to create complex data pipelines and analytics applications.While both SparkML and Scikit-learn are powerful tools for machine learning, there are some differences between the two:while Scikit-learn remains a great tool for traditional machine learning tasks, SparkML has a definite edge in big data. By using SparkML, you can leverage the power of distributed computing for machine learning tasks, making it a powerful tool in the era of big data.If you want to learn more about HDFS, I suggest this post:towardsdatascience.comBefore starting, let’s install Pyspark first. Installing Pyspark on google colab is very simple, and we can use pip install:Now that Spark is installed, let’s load SparkSession, the entry point to any Spark functionality.We will be using the “Data Science Salaries 2023” dataset, available at:https://raw.githubusercontent.com/arminnorouzi/sparkml/main/Data/ds_salaries.csvLet’s load this CSV data into a Spark DataFrame. We will use the spark.read.csv function, passing the path to the CSV file and setting inferSchema to True so that Spark automatically detects the data types for each column.Now our data is loaded into a Spark DataFrame named df. You can display the first few records in this DataFrame using the show method:You can also view the schema (Spark schema is the structure of the DataFrame or Dataset) of this DataFrame using the printSchema method:Data Science Job Salaries Dataset contains 11 columns, each are:This will give us an idea of the structure of our data, the number of records, and the types of variables we’re working with. This knowledge is crucial when preparing our data for machine learning algorithms.Exploratory Data Analysis is an essential step before building a model. It helps us to understand the dataset, brings important aspects of data into focus, and provides valuable insights.Let’s start by examining the overall summary statistics of the DataFrame.Let’s see how many unique values each of the categorical columns have.Understanding the number of unique values in each categorical column can help us determine whether to use one-hot or other encoding techniques when preprocessing the data for our machine learning models.Now, let’s examine the correlation between numerical variables. This can be done by calculating the correlation matrix.Based on this matrix of correlation, we can conclude that:Based on these observations, work_year has the strongest (though still moderate) relationships with salary_in_usd and remote_ratio. However, none of the variables show a strong correlation with each other. Therefore, while building the model, it would be important to consider other features (like categorical variables).Let’s explore some of the categorical variables in more detail. We can visualize the distribution of job titles and employee residences. Note: We will be using the matplotlib and seaborn libraries for creating the visualizations, and therefore we will first convert the Spark DataFrame to a Pandas DataFrame.As you can see, we have more data for the US — Let’s filter data to predict US-based employees:Let’s check the count of unique categories in categorical features.So filtering data to the US required removing some of the redundant columns which have only one unique value. Let’s remove employee_residence, salary_currency (as all in USD), and salary (salary will be same as salary_in_usd after filtering):Let’s check the count of each column one more time:So we are good for now. Let’s display dataframe one more time:Now, let’s visualize the distribution of salaries.Although this was the EDA section, we did some data cleaning and filtering! But still, we need to handle duplicates, null values, and creating features and labels for our modelling. Let’s dive into the data processing partData preprocessing is a crucial step in the machine learning pipeline. It involves cleaning the data and transforming it into a format that machine learning algorithms can use.First, let’s handle missing values. PySpark DataFrame provides na property (an instance of DataFrameNaFunctions) with many useful functions for handling missing or null data.To remove duplicates from a DataFrame in PySpark, you can use the dropDuplicates() function. If you call this function without any parameters, it will drop any rows that have exactly the same values in all columnAwesome, we had 1111 duplicates, and we removed them. That could cause problems in modelling.Instead of the continuous variable ‘salary_in_usd,’ which makes it a regression problem, we can look at the range of salaries to make this problem a classification problem. One way to do that is to convert ‘salary_in_usd’ into different classes based on income brackets. Here, we can make use of the U.S. Federal Tax Brackets. We’ll assign each record a class from 1 to 7 based on the ‘salary_in_usd’ column. Before, let’s find min and max salary:Now we can use 2023 Single Filer Tax Brackets to divide classes. As we don’t have the first and last bracket, let’s divide the data like this. I also squashed bracket 2 and 3.Now we are done with salary_in_usd; we can safely drop it:Let’s visualize the class distributions (it should be very close to the salary histogram we saw earlier!):Great, let’s convert categorical values to numerical values!Most machine learning models require numerical input. Although a tree-based model can handle categorical features, it’s always safe to encode it. Also, the old Spark’s tree-based models, such as Decision Trees and Random Forests, do not directly handle categorical variables. However, in Spark 3.4.0, tree-based models do support categorical features. The sparkML package supports decision trees and random forests for binary and multiclass classification and for regression, using both continuous and categorical features. But let’s encode categorical features so we can also try non-tree models. To do so, we can use One-Hot Encoding or String Indexing.The above code performs these two steps:These transformations are combined into a pipeline to ensure that they are applied in the correct sequence. The use of a pipeline also improves code cleanliness and helps prevent errors.Now, we can check what happened to df:After the encoding process, we have a bunch of new columns; each suffixed with _index or _ohe. To clean up our DataFrame and get rid of the original, non-encoded categorical columns, we can drop them.In general, once you have transformed categorical variables using one-hot encoding, there’s usually no need to keep the intermediate indexed column(s) created by StringIndexer. These columns are used as an intermediate step in the transformation process to convert string categorical values into numerical ones, and they are not typically used in the final machine learning model.You can safely drop the indexed columns after the one-hot encoding to avoid redundancy and potential issues with multicollinearity.Now, let’s display the table one more time:The structure (3,[0],[1.0]) (for example) is the representation of a sparse vector used by PySpark to save memory when dealing with high-dimensional data.In Spark ML, one-hot encoding and some other feature transformations create SparseVectors, especially when dealing with categorical variables with many levels.In the SparseVector representation, the first number (3 in your example) denotes the size of the vector. The second list denotes the indices at which the vector has non-zero entries, and the third list denotes the values of these non-zero entries.So, (3,[0],[1.0]) represents a vector of size 3, where the element at index 0 is 1.0, and the rest of the elements are 0. Therefore, the full vector would be [1.0, 0.0, 0.0].This type of representation is very memory efficient when dealing with high dimensional sparse data (data where most values are zero), as it doesn’t need to store any of the zeros. This becomes especially important when dealing with one-hot encoded features of high cardinality categorical variables.We are almost done here; let’s separate features and labels and then normalize features.Now, we will separate the features from the label, which is income_bracket. We create a ‘features’ column that combines all the feature vectors.Great, now we have only two columns in hour data-based, the first is feature vector, and the second is liable!Finally, we can normalize the feature vectors to bring them on the same scale using StandardScaler.Let’s display df one last time before starting modelling:Smooth, this is the dream of all DS and MLE in the entire universe! We could do one last thing and deciding data train and testing using randomSplit; then we can move to modelling:That’s what we have it. Magical 80/20 split of data for our training and testing! Let’s jump on modelling.Let’s start modelling our multi-class classification by our favourite model: The tree-based model. We will go over the decision tree and random forest! The modelling part is easy; it’s very similar to sklearn syntax:Accuracy is the proportion of true results (both true positives and true negatives) in the population. A higher accuracy means that the model correctly predicted more instances. As you can see, the Decision Tree Classifier has slightly higher accuracy than the Random Forest Classifier in both the training, but the random forest has done better on test datasets.F1 Score is the harmonic mean of precision and recall. A higher F1 Score signifies better model performance, particularly when class imbalances exist.In general, looking at the results, the Decision Tree Classifier and Random Forest Classifier both show a reasonable performance, with the Decision Tree performing slightly better regarding the F1 score.The Naive Bayes Classifier, however, appears to perform poorly with very low accuracy and F1 score. This could be due to the fact that Naive Bayes assumes independence between features, which might not hold in our dataset. Here, the Decision Tree Classifier also has a higher F1 score than the Random Forest Classifier.To improve the performance of the models, we could:Let’s do some grid searches on the decision tree, and it’s running much faster than random forest to optimize our tree models further.Slightly increase in accuracy, but it’s not notable!Visualizing the predictions versus the actual values can help us understand how well our model performs. Here we can use Confusion Matrix. This table is often used to describe the performance of a classification model on a set of data for which the true values are known.Unfortunately, PySpark does not have built-in functionalities to visualize these metrics. So, we need to extract prediction and label columns, convert them to Pandas DataFrame and use Python’s popular data visualization library, Matplotlib.Let’s breakdown the above confusion matrix:The numbers on the diagonal of the matrix represent correct predictions:The off-diagonal numbers represent incorrect predictions:To summarize, the model shows a good True Positive rate for income_bracket 2, with 181 instances correctly identified. There’s a significant number of False Negatives for income_bracket 2. It’s often misclassified as income_bracket 1, 3, and 4.The model has many False Positives for income_bracket 2, wrongly identifying instances from other classes as income_bracket 2. Also, income_bracket 1, 3, and 4 have low True Positive rates, indicating weak identification of these classes.The model is biased towards predicting income_bracket 2, potentially due to class imbalance. We could use class weights, resampling, using a different algorithm, or collecting more diverse data. The practical next step is adding weights to classes. I will ask the reader to implement that and let me know in the comments how that goes.Now, before wrapping up, let’s try all of the above with normalized features:As I don’t see any improvement, I will stop here. Please let me know what we should do to improve results, and I will keep updating this post!In this tutorial, we explored PySpark’s MLlib for predicting US employee salary brackets, starting from installing PySpark to carrying out exploratory data analysis. We transformed our dataset, implemented multiple regression and classification models, and optimized the models using a grid search. Despite these results, it’s worth noting that model performance can always be enhanced with further feature engineering, advanced models, and hyperparameter tuning. This tutorial demonstrated a blueprint for PySpark’s machine learning capabilities, and we hope it encourages you to explore and experiment further. I will keep this post-open-ended and open to your suggestion on improving the model.As usual, thank you for reading my post, and I hope it was useful for you. If you enjoyed the article and would like to show your support, please consider taking the following actions:👏 Give the story a round of applause (clap) to help it gain visibility.📖 Follow me on Medium to access more of the content on my profile. Follow Now🔔 Subscribe to the newsletter to not miss my latest posts: Subscribe Now🛎 Connect with me on LinkedIn for updates.Thanks for being a part of our community! Before you go:🚀👉 Join the Level Up talent collective and find an amazing job----Level Up CodingMachine Learning Engineer | Data ScientistHelpStatusWritersBlogCareersPrivacyTermsAboutText to speechTeams

The Ultimate Guide to Cloud Gaming: D…
best projectors for home

This post first appeared on VedVyas Articles, please read the originial post: here

People also like

The Ultimate Guide to Cloud Gaming: Discover the Best Services

best projectors for home

Master SparkML: Practical Guide for Machine Learning

Related Articles

Master SparkML: Practical Guide for Machine Learning

Related Articles

Share the post

Subscribe to Vedvyas Articles

Thank you for your subscription