Hyperparameter Tuning in XGBoost using RandomizedSearchCV
In this article, we will try to uncover one of the most critical and trickiest problem of Hyperparameter Tuning in XGBoost algorithm. Let us first understand what are hyperparameters.
Hyperparameters are the parameters whose value is used to control the learning process of any algorithm. These parameters differ for every algorithm. For example: In case of a Random Forest algorithm, the hyperparameters can be the Number of Decision Trees or the Depth of each tree.
What is Hyperparameter Tuning?
Hyperparameter tuning or optimization is the process of choosing a right set of hyperparameters for a Machine Learning algorithm. It is a very important task in any Machine Learning use case. These parameters have to be specified manually to the algorithm and fixed through a training pass.
XGBoost Algorithm
XGBoost(Extreme Gradient Boosting) is a decision-tree based Ensemble Machine Learning technique which uses a Gradient Boosting framework. Here, we create decision trees in such a way that the newly created tree depends upon the information obtained from previous tree, meaning that the trees are sequential and dependent upon each other.
Right now. Let’s dive into our main topic of interest.
Dataset
After importing the necessary libraries, we loaded the dataset. I’m using the ‘Churn Modelling’ dataset that is available in Kaggle. The dataset contains some information about customers of a bank such as Customer ID, Credit Score, Tenure and so on. Here ‘Exited’ column is our dependent feature and rest all are the independent features.
Problem Statement
We will try to perform a classification task where in, we have to predict whether a particular customer will withdraw the bank services in the future or not, based on all the features that are present in the dataset. In this way, the bank can provide better offers to the customer in the future if he/she decides to withdraw.
EDA(Exploratory Data Analysis)
Next, we will see the correlation between various features in the dataset.
Correlation is a statistical measure which expresses the degree to which two variables are related. It ranges between -1 to +1. A correlation of -1 indicates negative correlation and a correlation of +1 indicates positive correlation.
#Plotting a correlation matrix
plt.figure(figsize = (12,10))
import seaborn as sns
c = df.corr()
sns.heatmap(c, annot = True, cmap = ‘twilight’)
We can visualize the correlation between features using a heatmap.
Here we can see that RowNumber, CustomerId, and CreditScore are not important for the output feature, as they are negatively correlated. So we won’t take these features into consideration.
Next we split those independent and dependent features and store them in X and y variables respectively.
#Getting the independent features
X = df.iloc[:,3:-1]
#Getting the dependent feature
y = df.iloc[:,-1]
Let us print the shapes of X and y.
print(X.shape)
print(y.shape)(10000, 10)
(10000,)
Feature Engineering
In our dataset, the features Geography and Gender are categorical features. So, we will have to convert them into dummy variables using get_dummies method from Pandas and we will set drop_first variable as ‘True’ to avoid Dummy Variable Trap.
#Dummy variable for ‘Geography’ column
geography = pd.get_dummies(X[‘Geography’], drop_first = True)
#Dummy variable for ‘Gender’ column
gender = pd.get_dummies(X[‘Gender’], drop_first = True)#Dropping the original ‘Geography’ and ‘Gender’ columns
X = X.drop([‘Geography’,’Gender’], axis = 1)#Adding the dummy columns to the dataset
X = pd.concat([X,geography,gender], axis = 1)
X.head()
After getting the dummy columns, we drop the original categorical columns and finally we concat(add) the dummy columns to the dataset. The new dataset looks something like this:
Now comes the most important part..
We import the xgboost package. To install XGBoost, run ‘pip install xgboost’ in command prompt. Then we select an instance of XGBClassifier() present in XGBoost.
We will use RandomizedSearchCV for hyperparameter optimization. It basically works with various parameters internally and finds out the best parameters that XGBoost algorithm can work better with.
#Hyperparameter optimization using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
import xgboost
classifier = xgboost.XGBClassifier()
So, initially we create a dictionary of some parameters to be trained upon. Here the keys are basically the parameters and the values are a list of values of the parameters to be trained upon. So the RandomizedSearchCV will test each value and find out the particular value which gives the highest accuracy.
params = {
“learning_rate” : [0.05,0.10,0.15,0.20,0.25,0.30],
“max_depth” : [ 3, 4, 5, 6, 8, 10, 12, 15],
“min_child_weight” : [ 1, 3, 5, 7 ],
“gamma”: [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
“colsample_bytree” : [ 0.3, 0.4, 0.5 , 0.7 ]
}
Model Building
Next, we call the RandomizedSearchCV() and pass the following parameters as below:
rs_model=RandomizedSearchCV(classifier,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)
Here,
- classifier : Default instance of XGBoost classifier
- param_distributions : parameters to be passed
- scoring : scoring attribute
- n_iter : Number of iterations
- n_jobs : Number of cores to be used(-1 indicates all the cores)
- cv : cross validation
- verbose : To generate messages while training the model
Now. let us fit the model
#model fitting
rs_model.fit(X,y)
Ok, so our model has been fitted. Let us now look into all the parameters that have been selected by the RandomizedSearch() for the XGBClassifier. We can do this with the help of best_estimators_ method.
#parameters selected
rs_model.best_estimator_
Here is the output,
We can see that, the learning_rate of 0.05 was selected, similarly max_depth of 6 was selected and so on for other parameters.
Now, as we know all the best parameters, we can simply build our final classifier model by passing all those parameters.
#Building final classifier model
classifier=xgboost.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,colsample_bynode=1, colsample_bytree=0.7, gamma=0.3,learning_rate=0.05, max_delta_step=0, max_depth=6,min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,nthread=None, objective='binary:logistic', random_state=0,reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,silent=None, subsample=1, verbosity=1)
Now, we will find the cross validation score of our above model using cross_val_score.
from sklearn.model_selection import cross_val_score
score=cross_val_score(classifier,X,y,cv=10)
After running this, we will get 10 different accuracies, as we have cv = 10.
Finally, if we see the mean of the accuracies, we get an accuracy of 86.74%.
And that guys, is how we perform hyperparameter tuning in XGBoost algorithm using RandomizedSearchCV.
Summary
In this article, we have learnt about how to perform hyperparameter tuning in XGBoost. Similarly we can perform the same in other algorithms such as for Logistic Regression, KNN, Random Forest, or anything.
The entire code can be found in this GitHub link. Let me know if you have any questions or comments.
Thank you for reading this article. Keep learning!
Connect with me on LinkedIn: Jayanta Kumar Pal
References