My Journey to Kaggle Competitions Grandmaster Status: Hyperparameter Tuning a Random Forest Classifier Model (Titanic dataset)

Picture courtesy of Kaggle.com

Hello and welcome back! This will be my final installment using Kaggle’s ever-popular Titanic dataset. Let’s hope we can make the best of it. For this installment I decided to use the features and preprocessing I used in the first Titanic dataset installment. For a refresher, my dataset looks like so:

My Titanic dataframe and the first five rows of data

Today’s focus was on hyperparameter tuning. This is an important concept to understand, as it can easily mean the difference between a great model and a sub-par baseline model. When tuning your model’s parameters it is best to do some research. Get a feel for the parameters that you are working with, common approaches done by other data scientists, and most importantly plan, plan, plan. For my model I decided to tune n_estimators, max_depth, min_samples_split, and min_samples_leaf.

I began as any decent Data scientist, and indeed Computer scientist, should: planning out a course of action. After listing out each parameter used in a RandomForestClassifier I whittled my list down into the parameters I wanted to focus on tuning. I did some research in an attempt to better understand these parameters and what past tuning methods looked like. I decided on approaching this problem using arrays of selected values for each parameter I wanted to tune. I worked through one parameter at a time, fitting the model to the training data then using the validation data to calculate an Area Under the Curve value.

To ensure this process went a little bit quicker and more efficiently, I coded up a quick function that would take care of most of this process, I would simply need to call this function and pass in the parameter I wanted to tune and the list of selected values to use for tuning. My output would be the maximum AUC score achieved in the tuning process as well as the value used to reach that max score. The function also printed out a graph tracking the AUC score over each tuning value.

Function used to automate the model tuning, training, and prediction process

I then created my list of values for each parameter, being careful to ensure I used what I felt to be the best range.

Lists of values for each parameter I decided to work on tuning

After passing each list through the function I was given four graphs to visualize the AUC score of each parameter tune. Using these graphs I was able to find the global maximums as well as where the value began to flatten out, thus making any further values obsolete.

Graphs used to track changing AUC score based on parameter. Parameters used: min_samples_leaf(top-left), min_samples_split(top-right), max_depth(bottom-left), and n_estimators(bottom-right)

Wanting to understand the specific values I printed out each parameter’s best value obtained in the tuning process as well as the AUC score that was achieved. This gave me the following numbers:

Code used to print out the best value for each parameter and its corresponding AUC score (top) and the output I received (bottom)

Using these parameter values I put together my final, hyperparameter tuned model, fit the training data, and ran the validation data through to get a validation Area Under the Curve score. This ended up being lower than all of my previous attempts, but not by much, and after all the most important score comes when running the test data through the model.

The code I used to create my RandomForestClassifier model using the tuned parameter values (top) and the output AOC and Accuracy Score of the validation data (bottom)

At the last minute I decided, since I had my model built, I wanted to do a quick feature importance check to see if there might be some noise I could possibly get rid of. I then plotted the feature importance results for a more readable visualization.

The code for testing my model’s feature importance (top) and the subsequent output of feature importance results (bottom)
The code used to visualize my feature importance results (top) and the subsequent graph (bottom)

Using these results I determined to remove “Age” and “Parch” from the data, cutting out the possible noise they might create. I recreated my model with the modified data and got a new validation AOC and Accuracy Score, slightly lower than the first but again nothing too worrisome.

The code and output of the modified data’s validation AOC and Accuracy Score

I took the final step with this data set, ensuring the test subset was preprocessed correctly and the final dataframe was formatted to the Kaggle Gods’ contentment. I then made my final submission and crossed my fingers.

This submission score was an increase of nearly 6% over our previous submission and over 5% our previous best submission. 77% is still slightly lower than I would like, but it is still reasonable and certainly proof of the importance of hyperparameter tuning. We could probably increase this accuracy by testing out different machine learning algorithms (neural networks anyone?) and hyperparameter tuning them, but we will save all of that for future datasets. For now I can say that I am satisfied by the results, as I hope you are as well. As always thank you again, brave reader, and I hope you join me next time when I tackle a new dataset. Keep on repping on and happy coding.

I am a Data Scientist and writer prone to excitement and passion. I look forward to a future I am able to focus those characteristics into work I love.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store