The 5 Most Common Classification Algorithms for Machine Learning

Steven Kyle
4 min readNov 14, 2021
Photo by gemma on Unsplash

In this blog post we will go over five of the most common classification algorithms that Data Analyst/Scientists like to use when creating a classification model. If you are stuck on a classification problem and need to find out which model to use, I hope this post will help bring some insight on which approaches to try. This blog post will not go into code but will talk more about the algorithms ideas and their concepts. However, I will put a link for each algorithm that will take you to the respective python library so that you can read and implement it from their documentation.

Classification

So what is classification? In a general sense, classification is the act of grouping things into categories. For example, the above picture shows the grouping of fruits into their respective categories depending on the type of fruit it is. The same principle of grouping things into categories can be used with all sorts of data. Data Scientists often use classification algorithms to train a model to be able to classify data into their respective categories. Often times they will build/train the model on a labeled training set and test the model on a test set.

The 5 classification algorithms

— Logistic Regression
— Naive Bayes
— Decision Tree
— KNN (K-Nearest Neighbors)
— SVM (Support Vector Machines)

Logistic Regression

The first type of classification algorithm we will talk about is the Logistic Regression. Logistic Regression is used to model the probability of an outcome between 0 and 1. 0 being that it does not happen or 1 if it does happen. Since we are speaking about classification, the 0 would represent that it is not that class and 1 would represent that it is that class.

The independent variables in Logistic Regression can be either continuous or categorical. When a Logistic Regression model is created each independent variable will be assigned a coefficient and the model will calculate the dependent variable based off of the independent variables which will be between 0–1.

Logistic Regression is often times very easy to implement so it is a simple go to if you are lost on which algorithm to use. However, if the number of data points is less than the number of features (independent variables) this algorithm should not be used since it might overfit the model on the training set.

Sklearn link: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Naive Bayes

The second algorithm we will talk about is Naive Bayes. Naive Bayes is a simple technique that calculates the probability of a data point being part of a designated category.

Naive Bayes assumes that all independent features are independent from one another and that there is no correlation between them. Even though Naive Bayes might be simple, it has historically performed very well against other algorithms even on more complicated datasets.

Naive Bayes is advantageous in that it is a simple but powerful algorithm. It is also advantageous in that it does not need a large training set to estimate the parameters for the model.

Sklearn link: https://scikit-learn.org/stable/modules/naive_bayes.html

Decision Tree

The third algorithm we will go over is Decision Tree. Decision Tree is very useful in the fact that it is essentially a flow chart and it makes it very easy to explain a model to a non-technical audience. Each split or “node” in the decision tree is a “test” of a feature (independent variable) and each branch from that node represents the outcome from that “test”. At the end of each branch is a “leaf” or another “node”. A “leaf” represents a categorical label and each data point within that leaf belongs in the respective category. A visual representation in the picture below might clear up any confusion.

This image was taken from http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/lguo/decisionTree.html

Decision trees are a useful algorithm to add to your Data Science toolkit. It is especially handy because it can be used for both classification and regression problems. This algorithm is also useful in the fact that it will help identify the most important variables. This will give additional insight into what is important in your dataset, especially if there are hundreds of variables.

Sklearn link: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

KNN (K-Nearest Neighbors)

The fourth algorithm we will talk about is KNN or K-Nearest Neighbors. KNN is an algorithm that detects patterns and tries to find the closest “k” neighbors from each data point. Since the algorithm tries to find the closest/similar data points, it is essential that the features are normalized so that they’re on a similar scale. KNN is especially useful for when the dataset contains multiple classes for classification.

Sklearn link: https://scikit-learn.org/stable/modules/neighbors.html

SVM (Support Vector Machines)

The fifth algorithm we will talk about is SVM or Support Vector Machines. SVM maps data points in space and maximizes the distance/space between the data from two categories so that when new data comes in it will go into one category or the other. A hyperplane is drawn in space to try and split the dataset into their respective categories. Since it assigns the categories based on space it is a non-probabilistic binary classifier.

SVM is advantageous in that it is effective in high dimension spaces (lots of features). Even if there are more dimensions (features) than there are samples.

This image was taken from https://scikit-learn.org/stable/modules/svm.html#:~:text=Support%20vector%20machines%20(SVMs)%20are,than%20the%20number%20of%20samples.

Sklearn link: https://scikit-learn.org/stable/modules/svm.html

Conclusion

In this blog post we covered five of the most common classification algorithms. There are many many many more algorithms out there.

--

--

Steven Kyle

25 year old Texan in the midst of a career change into DataScience.