Published in

The Startup

5 min readJan 17, 2021

It’s all about Assumptions, Pros & Cons

This blog is all about the assumptions made by the popular ML Algorithms, their pros & cons.

source: https://media.geeksforgeeks.org/wp-content/cdn-uploads/machineLearning3.png

Getting started with this blog let me first tell the reason for coming up with this blog.

There are tons of Data Science enthusiasts who are either looking for the job in Data Science or switching jobs for better opportunities. Given that every individual has to go through some strict hiring process containing several rounds of interview.

And there are several basic questions which a recruiter/interviewer expect to be answered by us. And knowing about the assumptions of the popular Machine Learning Algorithms along with their Pros & Cons is one of them.

Going ahead in the blog, I will first introduce you with the assumptions made by that specific algorithm followed by its pros & cons. So it's just gonna take 5 minutes from your precious time, and at the end of the blog, you would surely get something to learn.

I am gonna follow the below sequence to introduce assumption, pros & cons.

K-NN(K-Nearest Neighbours)
Logistic Regression
Linear Regression
Support Vector Machines
Decision Trees
Naive Bayes
Random Forest(Bagging Algorithm)
XGBoost(Boosting Algorithm)

1. K-NN

source: https://www.analyticsvidhya.com/wp-content/uploads/2014/10/scenario2.png

Assumptions:

The data is in feature space, which means data in feature space can be measured by distance metrics such as Manhattan, Euclidean etc.
Each of the training data points consists of a set of vectors and class label associated with each vector.
Wishes to have ‘K’ as an odd number in case of 2 class classification.

Pros:

Easy to understand, implement and explain.
Is a non-parametric algorithm so don’t have strict assumptions.
No training steps required. It uses training data at run time to make predictions making it faster than all those algorithms which need to be trained.
Since it doesn’t need training on train data so data points can be easily added.

Cons:

Inefficient and slow when the dataset is large. As the cost of the calculating, the distance between the new point and train points is large.
Don’t work well in with high dimensional data. As it becomes harder to find the distance in higher dimensions.
Sensitive to outliers. Get easily affected by outliers.
Can’t work when data is missing. So data needs to be manually imputed to make it work.
Needs feature scaling/normalization.

2. Logistic Regression

source: https://miro.medium.com/max/2400/1*RqXFpiNGwdiKBWyLJc_E7g.png

Assumptions:

It assumes that there is minimal or no multicollinearity among the independent variables.
It usually requires a large sample size to predict properly.
It assumes the observations to be independent of each other.

Pros:

Easy to interpret, implement and train. Doesn’t require too much computational power.
Makes no assumption of Class-Distribution.
Fast in classifying unknown records.
Can easily accommodate new data points.
Is very efficient when features are linearly separable.

Cons:

Tries to predict precise probabilistic outcomes. Which leads to overfitting in high dimensions.
Since has linear decision surface. So, can’t solve non-linear problems.
Tough to obtain complex relations other than linear relations.
Requires very less or no multicollinearity.
Need large dataset and also sufficient training examples for all the categories to make correct predictions.

3. Linear Regression

source: https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/1200px-Linear_regression.svg.png

Assumptions:

There should be a linear relationship.
There should be no or less multicollinearity.
Homoscedasticity: The variance of residual should be the same for any value of X.

Pros:

Performs very well when there is a linear relationship between the independent and dependent variable.
If overfits, overfitting can be reduced easily by L1 or L2 Norms.

Cons:

Its assumption of data independence.
Assumption of linear separability.
Sensitive to outliers.

4. Support Vector Machines

source: https://www.researchgate.net/publication/304611323/figure/fig8/AS:668377215406089@1536364954428/Classification-of-data-by-support-vector-machine-SVM.png

Assumptions:

It assumes data is independent and identically distributed.

Pros:

Works really well on high dimensional data.
Memory efficient.
Effective in cases where the number of dimensions is greater than the number of samples.

Cons:

Not suitable for large datasets.
Don’t work well when the dataset has noise i.e target classes are overlapping.
Slow to train.
No probabilistic explanation for classification.

5. Decision Trees

source: https://miro.medium.com/max/3840/1*jojTznh4HOX_8cGw_04ODA.png

Assumptions:

Initially, whole training data is considered as root.
Records are distributed recursively on the basis of the attribute value.

Pros:

Compared to other algorithms data preparation requires less time.
Don’t require data to be normalized.
Missing values till and extent don’t affect its performs much.
Is very intuitive as can be explained as if-else conditions.

Cons:

Need high time to train the model.
A small change in data can cause a considerably large change in the Decision Tree structure.
Comparatively expensive to train.
Not good for regression tasks.

6. Naive Bayes

source: https://www.kdnuggets.com/wp-content/uploads/bayes-nagesh-0.jpg

Assumptions:

The biggest and only assumption is the assumption of conditional independence.

Pros:

Gives high performance when the conditional independence assumption is satisfied.
Easy to implement as only probabilities need to be calculated.
Works well with high dimensional data. Such as text.
Fast in real-time predictions.

Cons:

If conditional independence does not hold perform poorly.
The problem of Numerical Stability or Numerical Underflow because of the multiplication of several small digits.

7. Random Forest

source: https://www.researchgate.net/publication/316982197/figure/fig5/AS:559887665303554@1510499029585/The-structure-of-random-forest-algorithm-The-random-forest-is-composed-of-the-generated.png

Assumptions:

Assumption of no formal distributions. Being a non-parametric model can handle skewed and multi-modal data.

Pros:

Robust to outliers.
Works well for non-linear data.
Low risk of overfitting.
Runs efficiently on large datasets.

Cons:

Slow training.
Biased when dealing with categorical variables.

8. XGBoost

source: https://miro.medium.com/max/4000/1*IWBGb4PC7F2q0fszK-Iplw.png

Assumptions:

It may have an assumption that encoded integer value for each variable has ordinal relation.

Pros:

Can work parallelly.
Can handle missing values.
No need for scaling or normalizing data.
Fast to interpret.
Great execution speed.

Cons:

Can easily overfit if parameters are not tuned properly.
Hard to tune.

These are the pros, cons & assumptions of all the above Machine Learning Algorithm. You can always find a better explanation of things on GeeksforGeeks, Stackexchange, Quora, Stackoverflow etc.

But my aim to write this blog is to help anyone preparing for an interview to get everything in one place as some quick notes.

I hope you like my work. If you find any problems in the above details please comment it out so that I can improve them. Thanks

1. K-NN

2. Logistic Regression

3. Linear Regression

4. Support Vector Machines

5. Decision Trees

6. Naive Bayes

7. Random Forest

8. XGBoost

Written by Vishalmendekarhere