Pareto Principle in Machine Learning Development: Work Smarter, not Harder

Be efficient when we develop our machine learning model

Feb 02, 2023

person working on laptop — Photo by Austin Distel on Unsplash

The pareto principle or 80/20 rule is a theory that states where that 80% of the effects came from 20% of the causes. In layman's terms, 80% of what happened is caused by 20% of reasons. A smaller number of inputs might have a more significant impact.

There are a few examples of the pareto principle in real-life cases:

80% of companies profit comes from 20% of the customers,
80% of the bugs come from 20% of the codes,
80% of the project time was spent on 20% of the tasks,
80% of the crime problems came from 20% of the population,
80% of the world's wealth is kept by 20% of the population.

The list could keep going on, but the point is that smaller things often cause the majority. Although, the pareto principle wasn’t necessarily required the percentage to be 80/20, just a smaller portion affecting the bigger portion.

So, how pareto principle works in the machine learning development? Let’s explore it further.

Pareto Principle in ML Development

As data scientists, we are trained to develop a machine learning model with the best performances. But, we are rarely taught efficient development, as many suggest experimentation over strategizing.

True to the pareto principle, 80% of the model’s performance could be achieved driven by 20% of the work. By prioritizing our efforts, we can efficiently develop a helpful model.

Especially in the business, a working model in production is better than the perfect model that never arrives at all—spend our effort wisely.

Let’s explore areas within ML development that would benefit by applying pareto principle.

Thank you for reading Non-Brand Data. This post is public so feel free to share it.

Features

Many new learners’ instincts when developing machine learning model is to use all the features you have without much feature selection. While it’s good to try experimenting with every feature we had, the notion might be wrong.

In the machine learning, we have a concept called the Curse of Dimensionality. The concept state that the model performance might increase until the optimal number of features then it would be diminished following that. We can see the illustration in the image below.

Feature dimensionality versus classifier performance — Source: The Curse of Dimensionality in classification by Visiondummy

It’s not always an optimal way to have all the features in your machine learning model. Most of the time, a smaller number of impactful features were what you needed rather than an enormous number of features. 20% of the available features comprised the 80% of the model performance.

So it would be worthwhile to focus on selecting the most significant features when building a model.

Data Cleaning

It’s unusual to have a clean dataset when we start the data project. We often get data with inconsistent categories, wrong feature names, missing data, and many others. In other words, we need to clean the data before we proceed further.

If we implement the pareto principle, 80% of the data error might comes from 20% of the data. Also, 80% of the error types in the dataset might come from 20% of the actual error types.

We can imagine that most of our errors include missing data, outliers, or wrong data types. It’s the common occurrence that usually makes up the majority of error that needs to be cleaned; which we can focus on first compared to other kinds of errors. This allowed us to prioritize which features and types of analysis that required more attention compared to the others.

Speaking of data cleaning, there are various Python packages for data cleaning that might help you clean the data. You can read it further in this article.

Model Tuning

Many were trapped in the notion that we need to achieve the highest performance as much as possible. But, as I stated above, a working model in production is better than the perfect model that is not in production.

Hyperparameter tuning might boost the model performances to the top when given enough times, but it’s often taking too much time and usually peaks early before stabilising. Let’s see the example in the graph below.

Source: Hyperparameter Tuning the Random Forest in Python by Will Koehrsen

The number of times taken for training the model increases with the number of the trees but the performance is already peaking at the specific point. It was useless to train the model further, as the performances might stay around the same numbers.

As pareto principle states, the earlier stage of your model development (20%) might take your model to the “almost'“ best performance already (80%).

Conclusion

The pareto principle states that 80% of the effect comes from 20% of the causes. We can say that most of the occurrence happens because of the smaller things.

The pareto principle could be applied to the machine learning development as the principle help the task prioritization. For example, pareto principle helps in these ML development areas:

Features
Data Cleaning
Model Tuning

Non-Brand Data