Use XGBoost Like a Pro
There are many ways to improve your XGBoost experience. Let's see if you already knew these tips.
XGBoost, or Extreme Gradient Boosting, is a machine learning algorithm under the Gradient Boosting framework. It is based on the ensemble of decision trees and sequentially optimizing them to give a better prediction.
It’s a popular model because its performance is stellar compared to the other models in practical applications. Besides its performance, XGBoost is also known for its speed processing (parallel processing), regularization to prevent overfitting, and handling of missing data and categorical data.
I enjoy XGBoost and use it in my personal and professional life. As Bojan Tunguz said in X,
“XGBoost is all you need.”
Many new fancy models exist, especially with LLM and Generative AI becoming so popular. However, XGBoost could easily solve 90% of the business problems.
While it’s easy to use, many people haven’t utilize the full capability of what XGBoost could do. That’s why I want to share my tips on improving your experience working with XGBoost with you guys.
💪I assure you these tips will make you a pro at XGBoost.
So, let’s get into it!
Preparation
I assume you already have Python installed. We need the XGBoost package with Seaborn to load the dataset example in this sample. So we can install them using the following code:
pip install xgboost seaborn pandas scikit-learn numpy
Then, we would use Titanic data for our example, using only a few columns and dropping the missing data.
import xgboost as xgb
import seaborn as sns
from sklearn.model_selection import train_test_split
df = sns.load_dataset('titanic')
df = df[['survived', 'pclass', 'age', 'sibsp', 'fare']].dropna()
X_train, X_test, y_train, y_test = train_test_split(df.drop('survived', axis=1), df['survived'], test_size=0.2, random_state=42)
With the packages and dataset ready, here are some pro tips you can use to use XGBoost.
1. Use DMatrix XGBoost for Efficient Data Handling
DMatrix is a data structure used by XGBoost internally. It’s designed to handle large datasets, allowing faster computation and better data memory management. That’s why it’s preferable to use DMatrix when you have large-scale machine-learning tasks.
The way to convert the Dataframe into DMatrix is by using the xgb.DMatrix
and pass the Pandas DataFrame.
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=X_train.columns.to_list())
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=X_test.columns.to_list())
Then, how you train the model and get the prediction is similar to when using the Pandas Dataframe.
params = {
'objective': 'binary:logistic',
'max_depth': 4,
'learning_rate': 0.1,
'eval_metric': 'logloss'
}
model = xgb.train(
params,
dtrain,
num_boost_round=100)
y_pred = model.predict(dtest)
With DMatrix, we can tweak some of their parameters as well.
First, we can control for the missing data representation. For example, we can put the missing data as -999
.
dtrain = xgb.DMatrix(X_train, label=y_train, missing=-999)
We can also assign weights to each data in the dataset. It’s useful for an imbalanced dataset problem.
weights = [0.1 if label == 0 else 1.0 for label in y_train]
dtrain = xgb.DMatrix(X_train, label=y_train, weight=weights)
Lastly, we can control the CPU usage when building the DMatrix.
dtrain = xgb.DMatrix(X_train, label=y_train,nthread=4)
Using DMatrix, we can improve our raining process more finely and optimize the model performance even better.
Keep reading with a 7-day free trial
Subscribe to Non-Brand Data to keep reading this post and get 7 days of free access to the full post archives.