My Journey To Kaggle Competitions Grandmaster Status: Titanic Baseline

Daniel Benson
Analytics Vidhya
Published in
5 min readSep 28, 2020

--

I began my journey where many others began theirs: testing out the limits of Kaggle notebooks using the ever-popular Titanic dataset. This dataset includes 11 base attributes of which we have to decide their usefulness in predicting a passenger’s survival on the infamous Titanic voyage.

The attributes provided to us include PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked. This list and a quick glance at the first five rows of data can be see in the picture below.

Attribute/column names and their first five rows
Brief description of each attribute

Right off the bat I made the decision to drop “PassengerId”, “Name”, “Ticket”, and “Embarked” columns from the dataframe under the assumption that they were “useless” in contributing to my goal at hand, which is creating a highly accurate model that predicts a given passenger’s survival status. My dataframe was thusly whittled down.

My dataframe after removing initial useless columns

At this point we can see a glaring elephant in the room (table/dataframe?) above: the all-too-yummy NaN, or empty values. I decided to give these values a closer look across the whole dataframe.

A quick glance at the number of empty values in each column of the dataframe

I then compared these numbers to the total number of instances within my dataframe and found the following:

Percent of NaNs found in the “Age” column (top in the output) and “Cabin” column (bottom in the output)

With a scary number of values missing from the “Cabin” column it was easy to decide that would be the next attribute to drop from the dataframe, which was my next course of action.

After removing such a hideous beast, I decided to take a look at the dataframe’s statistical summary to get a better feel for the data’s characteristics — kind of like getting to know each other on a first date!

Statistical summary of the numerical data (top picture) and non-numerical data (bottom picture)

Eyeing these tables I noticed a few things. First, there is a major outlier in the “Fare” column. Nothing to write home about immediately, but something to take note of. I also noticed the mean age of the passengers was 29.7 and the standard deviation was 14.5. This gave me the confidence to make a baseline decision: set the ever-haunting NaN values in the age column to the mean value of that same column — here, essentially, padding the empty data with several instances of 29.7. I felt comfortable with this decision given how close this number is to the 25 percentile value, 50 percentile value, 75 percentile value, and the fairly low standard deviation.

Next I printed out a quick pair plot (seaborn to the rescue!). This gave me an even more expansive feel for the remaining columns and how they correlated with passenger survivability. For instance, as shown below, I could see slight correlations between higher age groups and low survivability, while the rest of the age groups were fairly equivalent across the board. I also found what appeared to be a correlation between a passenger having family on the ship, including sibling/spouse/parent/child and lower survivability. I was also able to confirm my first assumption on the low useability of the Fare column for this model predictor.

A pair plot used to visualize the relationship between each of the columns

With the information gathered thus far I had yet to be able to make a decision regarding the “Pclass” column. Given the fact that both “Survived” and “Pclass” columns are categorical, I decided to take a closer look using a bar chart. This helped me determine at first glance a baseline correlation between a passenger’s class and their survivability, with 1st class passengers seemingly surviving at a higher rate than their lower class counterparts.

Passenger survival compared to passenger’s class

I then looked at the percentage of passengers by class to determine the confidence at which I could take the above information, and I found that 3rd class passengers heavily outnumbered 1st and 2nd class passengers. This could cause problems, but it wasn’t enough for me to throw out the information entirely. Here I should mention that a closer look at survivability within each class could be a fair objective for future runs through the data. That being said, this is simply a quick baseline, and as such we will drudge forward with little caution.

Percent of passengers by class

So, after dropping the Fare column, my final attributes had been cut down to “Survived”, “Pclass”, “Sex”, “Age”, “SibSp”, and “Parch”.

The current status of my dataframe

With my baseline cleaning and wrangling done, I split my data into training and validation subsets (and of course their target subsets, being the Survived column) using 80% of the data for training and 20% of the data for validation. I then used a sklearn LabelEncoder to encode my categorical data, the “Sex” column, into numerical form (switching “male” values to the integer 1 and “female” values to the integer 0) for use in the prediction model (after all, as we all know, prediction models can be oh-so finicky).

Use of LabelEncoder to encode categorical “Sex” values into numerical values

Finally, I decided on the use of a RandomForestClassifier as my baseline model with all attributes set to their default values. Using sklearn.metrics accuracy score I calculated my model’s accuracy score against the validation subset, coming up with an accuracy of roughly 82.7%.

A RandomForestClassifier model along with the validation subset accuracy score

Finally, I preprocessed the test data provided me by Kaggle, created my test predictions, and formatted it correctly to allow its acceptance by the Kaggle Gods. After the built-in prediction scorer my baseline model landed at a somewhat reasonable 72.248% prediction accuracy. Obviously this number can and will be increased as I test my model for the best attribute values, but that will be a goal for the next installent.

Thanks again, brave reader, for joining me on my baseline jump into the world of Kaggle competitions; I certainly hope to see you here again for my next installment.

--

--

Daniel Benson
Analytics Vidhya

I am a Data Scientist and writer prone to excitement and passion. I look forward to a future I am able to focus those characteristics into work I love.