Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Discover how to use machine learning for software estimation

How complex is a given software development task?

Let’s find out how to use machine learning for software estimation

In this work, we will present some ideas on how to build a smart component that is able to predict the complexity of a software development task. In particular, we will try to automate the process of sizing a task based on the information that is provided as part of its title and description and also leveraging the historical data of previous estimations.

In agile development, this technique is known as story points estimation and it defers from other classic estimation techniques that predict hours because the goal of this estimation is not to estimate how long a task will take, but to predict how complex a task is.

Typically, this estimation process is done by agile teams in order to know what all the tasks are that the team can commit to complete in the next sprint, generally a period of 2 weeks. Based on previous experience in the past 2 or 3 sprints, they can know in advance the average amount of points they are able to complete and so, they take that average as a measure of the threshold for the next sprint. Then, during the estimation process (generally through a fun activity like a planning poker), each member of the team gives a number of points that reflects how complex they think the task is.

There is a set of tasks that the team has decided to be the “base histories” that are well-known tasks already labeled with their associated complexity (that the team has agreed on) and can be used later as a base for comparison. After each sprint, new tasks can be added to this list. With time, this list can collect several examples of tasks with different complexities that the team can use later to compare and estimate new tasks. Teams generally reach a high level of accuracy in their estimations after some time, due to the continuous improvement based on accumulated experience, collecting more and more estimations.

Generally, the mental process that each team member goes through when estimating a new task is:

  • Based on his/her previous experience, he/she looks for similar tasks that they have done in the past
  • He/she gives the same number of points that similar tasks have been assigned in the past.
  • If there isn’t a similar task, he/she starts an ordered comparison from the least to the most complex tasks. The reasoning is something like this: “Is this task more complex than these ones?” If so, he/she moves on with the set of well-known base tasks in order of increasing complexity. This process is repeated until the new task falls in one of the categories of complexity. By convention, in case a new task looks more complex than size X but less complex than size Y (Y being the next size in order of complexity after X) the size assigned for the task’s estimation will be Y.

If we look at this process, we can find lots of similarities with a classic machine learning problem, where we have a task, T, that improves over time with experience, E, by a performance measure, P. In our case, T is the task of estimating/predicting the complexity of a new ticket (bug, new feature, improvement, support, etc), the experience, E, is the historical data of previous estimations and the performance measure, P, is the difference between the actual level of complexity and the estimation.

In the following, we present a Machine Learning approach to predict the complexity of a new task, based on the historical data of previously estimated tasks. We will use the Facebook FastText tool to learn text vector representations (word and sentence embeddings) that will be implicit used as input for a text classifier that classifies a task in three categories: easy, medium and complex. Note that we are changing things a bit, going from a point based estimation to a category based estimation.

This is because unfortunately, the number of estimates were very unbalanced in our dataset, and by grouping tasks into these three categories, we can slightly simplify our problem. Anyway, we can think of each of these classes (easy, medium, complex) as points in a simplified version of the story points estimation process (in a real story points estimation, size generally follows a Fibonacci sequence 1, 2, 3, 5, 8, 13, 21, etc., or some minor variation of this sequence).

In the end, we will build a basic web application like the one below, that is able to use the model that we trained so you can see it in action. It will allow you to search and pick up stories from the backlog (testing set) so you can compare the team’s average estimation vs the AI estimation. Sounds cool, yeah? Well, let’s dive in!

Out[1]:
Although it’s not required, in order to follow this work better, we recommend that you know some basic concepts around machine Learning. Our free eBook, The Business Executive’s Guide to Smart Applications, can give you a quick introduction to the topic if you need it. Also, our practical machine learning tutorial, Soccer and Machine Learning: 2 hot topics for 2018, can be a good resource for learning about a typicall machine learning workflow from scratch.

Having said that, now let’s start!

Preparing the data

In order to train a neural network model for text classification we will use part of the dataset collected during the research presented in the paper, A deep learning model for estimating story points, that you can download from this github repository

Let’s start by loading the appceleratorstudio dataset

In [22]:
import pandas as pd
import numpy as np

df = pd.read_csv("appceleratorstudio.csv", usecols=['issuekey', 'title', 'description', 'storypoint'])

Since our estimations will be based on the text information provided in the title and description columns, let’s check for null or empty values in the dataset

In [23]:
df.isnull().sum()
Out[23]:
issuekey        0
title           0
description    43
storypoint      0
dtype: int64

We can see that there are 43 entries that have null values in the description column. So, let’s remove any entry that is not complete:

In [24]:
df = df.dropna(how='any')

Now, let’s see how our data looks in the first few rows:

In [25]:
df.head()
Out[25]:
issuekey title description storypoint
0 TISTUD-6 Add CA against object literals in function inv… {html}

The idea here is that if our met…

1
1 TISTUD-9 Update branding for Appcelerator plugin to App… {html}

At least fix feature icons, asso…

1
2 TISTUD-11 Create new JSON schema for SDK team {html}

Create JSON schema containing pr…

1
3 TISTUD-13 Create Project References Property Page {html}

Create property page for project…

1
4 TISTUD-16 New Desktop Project Wizard {html}

Desktop (need to convert existin…

1

A very good approach is to take a look at the main characteristics of the data that you are going to be working on. In order to do this, we can use the describe operation available in any pandas dataframe:

In [26]:
df.storypoint.describe()
Out[26]:
count    2876.000000
mean        5.636300
std         3.309936
min         1.000000
25%         3.000000
50%         5.000000
75%         8.000000
max        40.000000
Name: storypoint, dtype: float64

As you can see above, by using the describe operation provided by any Pandas dataframe we can get a summary of some important properties of our datalike:

  • the number of rows (a.k.a observations)
  • average values
  • minimums and maximums
  • percentiles’ values
  • standard deviation

Another good idea is to plot a histogram. A histogram can give us a good notion of the underlying data distribution. What it basically does is split the possible values/results into different bins and counts the number of occurrences (observations) where the variable under study falls into each bin. We can do this easily by using matplotlib, one of the most popular Python libraries for 2D plotting.

In [27]:
import matplotlib.pyplot as plt
 
plt.hist(df.storypoint, bins=20, alpha=0.6, color='y')
plt.title("#Items per Point")
plt.xlabel("Points")
plt.ylabel("Count")
 
plt.show()

We can easily see that the number of occurrences is not uniform throughout the different size categories (points).

Let’s see the amount of items per point:

In [28]:
df.groupby('storypoint').size()
Out[28]:
storypoint
1      148
2      112
3      571
5     1126
8      751
9        1
13     137
20      22
21       3
34       1
40       4
dtype: int64

By seeing the histogram and the number of examples per point, we can see that we have many more examples of 5 and 8 than the others. That’s known as an imbalanced dataset and it could be an issue in classification problems. There are different techniques to deal with imbalanced data, starting by collecting more samples of the entries with low frequency or generating new artificial entries (oversampling) or removing some entries in the classes with higher frequency (downsampling).

In our case, we will start by grouping points into three different categories to reduce the imbalanced data.

In [29]:
df.loc[df.storypoint <= 2, 'storypoint'] = 0 #small
df.loc[(df.storypoint > 2) & (df.storypoint <= 5), 'storypoint'] = 1 #medium
df.loc[df.storypoint > 5, 'storypoint'] = 2 #big
In [30]:
df.groupby('storypoint').size()
Out[30]:
storypoint
0     260
1    1697
2     919
dtype: int64

After grouping we continue having imbalanced data, so we will do a basic oversampling and downsampling, but we will do it later when we apply a cross-validation technique that is an approach for training and evaluating the performance of a machine learning model through different partitions of our dataset as we will explain later.

At this point, it’s important to note that in this work the goal is to solve a classification problem (predict the class associated to the complexity of a task: 0-easy, 1-medium or 2-complex) instead of a regression problem (predict a continuous real value) like in the paper, A deep learning model for estimating story points.

Before we continue, let’s do some cleanup to our data. This is also a common step that generally any machine learning process needs to apply because of the following issues:

Common issues generally faced during the data preparation phase:

  • Format and structure normalization
  • Detect and fix missing values
  • Remove duplicates
  • Normalize units
  • Validate constraints
  • Detect and remove anomalies
  • Study features importance/relevance
  • Dimentional reduction, feature selection & extraction

For this work, most of these issues were already addressed by the authors of A deep learning model for estimating story points when collecting the dataset. Anyway, we will need to do some extra cleanup to the data for our purpose: remove some html tags as well English stop words (words like thethisthat, etc) because they can add noise to our problem and it’s better to remove them.

In [31]:
import numpy as np
import csv
from nltk.corpus import stopwords

#Define some known html tokens that appear in the data to be removed later
htmltokens = ['{html}','<div>','<pre>','<p>', '</div>','</pre>','</p>']

#Clean operation
#Remove english stop words and html tokens
def cleanData(text):
    
    result = ''
    
    for w in htmltokens:
        text = text.replace(w, '')
    
    text_words = text.split()    
    
    resultwords  = [word for word in text_words if word not in stopwords.words('english')]
    
    if len(resultwords) > 0:
        result = ' '.join(resultwords)
    else:
        print('Empty transformation for: ' + text)
        
    return result

def formatFastTextClassifier(label):
    return "__label__" + str(label) + " "

Important: Since we are removing stop words and html tags in our dataset, later when we want to predict with some unseen data we will need to apply the same transformation before requesting the model’s prediction for that input.

In order to work easily, we will also create two new columns:

  • One new column called “title_desc” that is just the concatenation of the title and description columns
  • A second column called “label_title_desc” that contains the number of points with a specific prefix expected by FastText to recognize it as the labeled information (class)

While doing this, we will also change everything to lower case to make the training phase case insensitive. These new columns will be used later for training our learning algorithms.

In [32]:
df['title_desc'] = df['title'].str.lower() + ' - ' + df['description'].str.lower()
df['label_title_desc'] = df['storypoint'].apply(lambda x: formatFastTextClassifier(x)) + df['title_desc'].apply(lambda x: cleanData(str(x)))

Finally, since we were removing some empty entries, we will re-index our dataset to fix it and have continuous indices again:

In [33]:
df = df.reset_index(drop=True)

Dealing with the imbalanced dataset – Oversampling

As you will see in the final main method, in order to deal with imbalanced dataset we are doing a basic oversampling that simply consists of adding copies of the datapoints in the minority class until reaching the number of items in the majority class.

Other more complex oversampling techniques exist, like SMOTE, where artificial datapoints (called synthetic data points) are created by taking two datapoints in the minority class (one datapoint and one of its k nearest neighbors) creating the new artificial point in the space between the two real points. If we think about this technique in a 2D scenario, the new datapoint is created in some random place on the line between the two points as you can see in the image below:

Smote Oversampling (ref: http://rikunert.com/SMOTE_explained)
If you want to try more complex techniques you can try Python’s excellent package, imbalanced-learn with several algorithms already implemented for you.

Anyway, for this work, a basic oversampling technique that creates copies of the existing data was used. The main reason for that is simplicity, because dealing with new synthetic datapoint created artificially implies finding a text representation for a sentence that could map to that new vector representation, because in the end, the FastText tool expects text in sentences and not the embeddings. Possible workarounds exist for this like approximating the synthetic point with a new sentence generated by averaging the embeddings of words used by the nearest k sentences to the synthetic points, for instance. This could be something interesting to try, so if you do it please let us know your results!

Note: Basic random downsampling of the majority class that is also a common and simple technique was combined with the oversampling, but didn’t improve the results. So, in the end, just a basic oversampling was used in order to minimize the effect of an imbalanced dataset.

In [34]:
from collections import Counter

def SimpleOverSample(_xtrain, _ytrain):
    xtrain = list(_xtrain)
    ytrain = list(_ytrain)

    samples_counter = Counter(ytrain)
    max_samples = sorted(samples_counter.values(), reverse=True)[0]
    for sc in samples_counter:
        init_samples = samples_counter[sc]
        samples_to_add = max_samples - init_samples
        if samples_to_add > 0:
            #collect indices to oversample for the current class
            index = list()
            for i in range(len(ytrain)):
                if(ytrain[i] == sc):
                    index.append(i)
            #select samples to copy for the current class    
            copy_from = [xtrain[i] for i in index]
            index_copy = 0
            for i in range(samples_to_add):
                xtrain.append(copy_from[index_copy % len(copy_from)])
                ytrain.append(sc)
                index_copy += 1
    return xtrain, ytrain

Before ending this section, I’d like to highlight that sometimes people underestimate the data preparation stage (mainly those who are just starting with their first machine learning project) but you should know that this innocent-seeming first stage, most of the time, takes more than half of the total project time (sometimes even up to 60-80%!). So, keep that in mind and treat this stage with the importance it deserves.

Creating our classifier

In this work, we want to give FastText a try, a tool developed by Facebook that is an extension of the well-known Word2Vec word embedding tool previously created by a research team led by Tomas Mikolov at Google.

Without entering into much detail, we can say that embeddings consist of techniques that learn vector representations of words, sentences, or documents, so that the vector representation of similar and semantically related words, sentences, or documents are close together in the high dimensional vector space. Then, by leveraging this characteristic onto the learned vectors, we can use them as features for any kind of machine learning algorithm, to train a classifier or as input to a clustering algorithm, etc.

Word embedding techniques are not something new in Natural Language Processing (NLP), although in the last years, new embeddings techniques based in predictive neural networks models have become very popular and they have revolutionized the machine learning field in many domains, not just that of NLP. Word embeddings have started to be used in other domains like e-commerce and recommender systems with the variation known as prod2vec, meta-prod2vec, or in mobile applications like app2vec, among others. Recently, I’ve applied different embedding techniques to create Internet Domain Name embeddings from DNS trace logs and they have demonstrated to be a good approach for learning semantic similarities and analogy tasks between Internet Domain Names. You can see details about how I have used word2vec for learning Internet Domain Names in Vector representation of Internet Domain Names using a Word Embedding technique

In regard to FastText, its main advantage over word2vec is that it considers subwords inside a word. Instead of considering each word as a single token, a word is split into a set of substrings called ngrams and later the training phase is done considering each subword of a word, and the vector representation of a word is formed by averaging the vector representation of its subwords (and the word itself). The most important parameters to tune when using FastText are minn and maxn that define the min and max length of n-grams when splitting words.

Additionally, FastText can be used either in supervised or unsupervised mode. When using FastText in supervised mode, you can train a supervised model by using a dataset specially prepared (labeled) using a set of sentences (one per line) along with a label that acts as the class to which the sentence belongs. So, by training a FastText model in supervised mode, you can later perform classification tasks over new unseen sentences, which is very helpful for a lot of text classification and sentiment analysis problems.

Having said this, we present a simple custom python wrapper for the supervised mode of the native FastText interface. Although there is already a wrapper for FastText and a native module in the well-known Gensim package, none of them include support for the supervised mode of FastText (only the unsupervised mode). So, I decided to create our custom and very basic wrapper with the minimum that we need for our purpose, that is:

  • a constructor to create new instances of the wrapper with its own state
  • a fit method to trigger the training process by calling the executable file and passing the required parameters (*)
  • a predict method that receives an array with a list of sentences and returns another array (of the same size) with the integer predictions in {0, 1, 2} for each sentence

(*) The training process is executed with the following parameters:

  • 500 epochs (iterations over the corpus)
  • Vector size of 300 dimensions
  • minn=4 and maxn=6 (minimum and maximum numbers of n-grams respectively)
  • pretrained file used to transfer previous knowledge of the English language and domain-specific knowledge (I’ve tried using the pretrained vectors for English language provided by FastText, but in the end generating my own pretrained models using other system datasets in the same domain worked better. You can download these other datasets from the same github repository in order to use them for building your own pre-trained model.
In [35]:
import uuid
import subprocess

class FastTextClassifier:

    rand = ""
    inputFileName = ""
    outputFileName = ""
    testFileName = ""
    
    def __init__(self):
        self.rand = str(uuid.uuid4())
        self.inputFileName = "issues_train_" + self.rand + ".txt"
        self.outputFileName = "supervised_classifier_model_" + self.rand
        self.testFileName = "issues_test_" + self.rand + ".txt"
    
    def fit(self, xtrain, ytrain):
        outfile=open(self.inputFileName, mode="w", encoding="utf-8")
        for i in range(len(xtrain)):
            #line = "__label__" + str(ytrain[i]) + " " + xtrain[i]
            line = xtrain[i]
            outfile.write(line + '\n')
        outfile.close()            
        p1 = subprocess.Popen(["cmd", "/C", "fasttext supervised -input " + self.inputFileName + " -output " + self.outputFileName + " -epoch 500 -wordNgrams 4 -dim 300 -minn 4 -maxn 6 -pretrainedVectors pretrain_model.vec"],stdout=subprocess.PIPE)
        p1.communicate()[0].decode("utf-8").split("


This post first appeared on UruIT, please read the originial post: here

Share the post

Discover how to use machine learning for software estimation

×

Subscribe to Uruit

Get updates delivered right to your inbox!

Thank you for your subscription

×