June 10th 2013

Various implementations can be found in Apache Mahout and Weka.

1) k-means

It is a cluster analysis methodology which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This in turn results in partitioning of the data space into Voronoi cells. In a Voronoi diagram, a set of points (called seeds) is specified beforehand and for each seed t here will be a corresponding region consisting of all points closer to that seed than to any other. The k-means algorithm is also known as Lloyd’s algorithm. It shifts the mean through iterative refinement. It is useful for finding natural groupings. Members of a cluster are more like each other than they are like members of a different cluster.

Ex:- Finding new customer segments.

Java implementation of Generalized Lloyd / Linde-Buzo-Gray Algorithm

2) Singular Value Decomposition

In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. The decomposition of a mxn real or complex matrix M is a factorization of the form M = UΣV* where U is a mxm real or complex unitary matrix, Σ is an mxn rectangular diagonal matrix with nonnegative real numbers on the diagonal and V* (conjugate transpose of V) is an nxn real or complex unitary matrix. The diagonal entries Σi,i of Σ are known as the singular values of M. The m columns of U and the n columns of V are called the left-singular vectors and right-singular vectors of M.
Research paper on Incremental SVD -
http://www.bradblock.com/Incremental_singular_value_decomposition_of_uncertain_data_with_missing_values.pdf
Implementation of Incremental SVD can be found in the Apache Mahout project.
Installation instructions on Hadoop cluster.

3) Naive Bayes Classifier

It is a simple probabilistic classifier based on applying Bayes’ theorem with strong(naive) independence assumptions. Wikipedia has an example of statistical classifiers created from using a Gaussian distribution assumption on the training set. It’s popular since it is pretty simple to implement and commonly applicable in situations like prediction of outcome such as likely to buy / won’t buy, high/medium/low value customer. This is the first algorithm which introduced me to the ML world through a ML contest held in IISc – the objective of the contest was to categorize tweets into Sprts or Politics category.

The winner of the event – Mani Kumar Adari’s implementation of a Fisher Score based naive Bayes implementation along with SVM implementation can be found here - http://events.csa.iisc.ernet.in/opendays2013/twitminer/mani/.

4) Support Vector Machine

It is a supervised learning model with associated learning algorithms that analyze data and recognize patterns. It is used for classification and regression analysis. It is basically a non-probabilistic binary linear classifier. It uses the training set to map the examples of the separate categories by a clear gap that is as wide as possible. There are a lot of variations and extensions of this model and in turn a lot of ways of implementing it. It relies on heuristics for dividing and conquering the Quadratic Programming problem that arises from SVMs.

Various Implementations -
http://www.mpi-inf.mpg.de/~mtb/svmlight/ – Relatively easier to understand and implement.
https://code.google.com/p/svmlearn/source/browse/trunk/svmlearn/src/svmlearn/SVM.java?r=6

5) Decision Trees

This is very commonly used in statistics and AI. This learning method uses decision tree as a predictive model (aka classification trees or regression trees) which maps observations about an item to conclusions about an item’s target value. There is a lot of subtypes of this model which is increasingly being discussed in forums and implemented in projects such as CART(Classification and Regression Tree), Random Forest and Bagging.
Implementation example - http://www.run.montefiore.ulg.ac.be/~francois/software/jaDTi/example/
Weka is a popular suite used to implement this and other popular models such as k-means – Home and Sourceforge pages.

Sources :- http://bickson.blogspot.in/2011/06/what-are-most-widely-deployed-machine.html - Mined from Mahout user-mailing lists.
http://www.cs.uvm.edu/~icdm/algorithms/index.shtml - IEEE Conference held in 2006 on Data Mining identified 10 top algorithms.
http://www.analytics1305.com/documentation/ – Analytics 1305 Documentation.
http://jtonedm.com/2011/06/07/first-look-11ants-analytics/ - 11 algorithms implemented by 11 Ants.
http://www.oracle.com/technetwork/database/enterprise-edition/odm-techniques-algorithms-097163.html – Oracle Data Mining Techniques and Algorithms

This post first appeared on Night Without End, please read the originial post: here

People also like

The Ultimate Guide to Cloud Gaming: Discover the Best Services

Top 5 Machine Learning Algorithms

Related Articles

Top 5 Machine Learning Algorithms

Related Articles

Share the post

Subscribe to Night Without End

Thank you for your subscription