This post is from a series of posts around the Kaggle Titanic dataset.
To understand the problem better, we try to do some analysis on the training and test data. We’re going to be using Python’s pandas
and numpy
for handling the data.
Reading the Data
First we do some imports:
import numpy as np
import pandas as pd
from tabulate import tabulate
Then we load the data into a pandas.DataFrame
:
data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')
Let’s look at the dimensions we’re given:
print(data_train.columns.values)
>>
['PassengerId' 'Survived' 'Pclass' 'Name'
'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare'
'Cabin' 'Embarked']
print(data_train['PassengerId'].count())
>>
891
So we have 891 training items. Each has 11 attributes. 10 of them are predictors (PassengerId
, Pclass
, Name
, Sex
, Age
, SibSp
, Parch
, Ticket
, Fare
, Cabin
, Embarked
). And one of them is the target (Survived
). Here’s more explanation of what each attribute means taken from the problem statement:
Survival
Whether the passenger survived or not 0 = No, 1 = Yes
Pclass
Socio-Economic Status 1 = Upper, 2 = Middle, 3 = Lower
Sex
– male, female
Age
– in years
SibSp
– # of siblings / spouses aboard the Titanic
Parch
– # of parents / children aboard the Titanic
Ticket
– Ticket number
Fare
– Passenger fare
Cabin
– Cabin number
Embarked
– Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
What we’re trying to do is to build a mode M
such that:
M(PassengerId, Pclass, Name, Sex, Age, SibSp,
Parch, Ticket, Fare, Cabin, Embarked) ≈ Survived
You may already be excited to start processing the data, but we need to check if the data is complete and sane first!
Sanity Checks
We can do describe()
to know some basic stats about the numeric values:
print(tabulate(
pd.concat([data_train, data_test]).describe(),
headers='keys',
))
>>
Age Fare Parch
----- --------- --------- -----------
count 1046 1308 1309
mean 29.8811 33.2955 0.385027
std 14.4135 51.7587 0.86556
min 0.17 0 0
25% 21 7.8958 0
50% 28 14.4542 0
75% 39 31.275 0
max 80 512.329 9
>>
PassengerId Pclass SibSp
----- ------------- ----------- -----------
count 1309 1309 1309
mean 655 2.29488 0.498854
std 378.02 0.837836 1.04166
min 1 1 0
25% 328 2 0
50% 655 3 0
75% 982 3 1
max 1309 3 8
One thing we notice is that the count
statistic is not the same for all attributes. Maybe this suggests that we have some values missing? Let’s check:
print(
pd.concat([data_train, data_test])
.isnull().sum()
)
>>
Age 263
Cabin 1014
Embarked 2
Fare 1
Name 0
Parch 0
PassengerId 0
Pclass 0
Sex 0
SibSp 0
Survived 418
Ticket 0
Indeed, we are missing values for Age
, Cabin
, and Embarked
. We can either decide to:
- Ignore rows with missing values – we cannot afford to do that since the training dataset is too small.
- Fill those missing values – e.g. with the mean/mode of the corresponding column. This is not ideal for the
Age
column for example, since we would be skewing 177
values out of 891
towards a fake value (177 / 891 = 19.86%
).
- Build a model for predicting the missing value – a very viable option. But let’s keep things simple for now.
- Represent missing values in a special way – e.g. add an extra boolean column that’s
True
when the data is missing and False
when the data exists. This would work for some kinds of models but not for others.
But we’re getting ahead of ourselves here. Let’s look closer at the data first, to see which attributes are useful in predicting the passenger’s survival.
Predictors vs Target
We’re going to look at the predictors one by one (PassengerId
, Pclass
, Name
, Sex
, Age
, SibSp
, Parch
, Ticket
, Fare
, Cabin
, Embarked
) and see how much information they give us about the target (Survived
).
1. Pclass
It would make sense if passengers with a higher class were more likely to survive, but let’s see what the data says:
print(
data_train.groupby(['Pclass'])['Survived']
.mean().sort_values(ascending=False)
)
>>
Pclass
1 0.629630
2 0.472826
3 0.242363
Our assumption was correct! Class does predict survival probability.
2. Sex
Running the same code on the Sex
column gives us:
print(
data_train.groupby(['Sex'])['Survived']
.mean().sort_values(ascending=False)
)
>>
Sex
female 0.742038
male 0.188908
High correlation! We should definitely include the Sex
column in our predictors then!
Actually, let’s experiment with using Sex
as the single predictor. More specifically, we’re going to predict that every female will survive and every male will not. To compute the accuracy of that model on the training dataset, we do the following:
pd.concat([
# Prediction
data_train['Sex'].apply(
# if female, survive
lambda a: 1 if a == 'female' else 0
),
# Actual
data_train['Survived'],
], axis=1)\
.apply(lambda r: r[0] == r[1], axis=1).mean()
>>
0.786756453423
78.68%! This should be our lower bound for accuracy then. If we find a more complex model, it should give us a higher accuracy.
3. Age
Let’s see what age ranges are more likely to survive:
data_train['AgeGroup'] = data_train['Age']\
.round(decimals=-1)
print(pd.concat([
data_train.groupby(['AgeGroup'])['AgeGroup'].count(),
data_train.groupby(['AgeGroup'])['Survived'].mean(),
], axis=1))
>>
AgeGroup Survived
AgeGroup
0.0 44 0.704545
10.0 34 0.411765
20.0 223 0.354260
30.0 178 0.404494
40.0 132 0.424242
50.0 61 0.409836
60.0 34 0.352941
70.0 7 0.000000
80.0 1 1.000000
We can see a pattern where children are more likely to survive, then comes older passengers, followed by middle-aged ones.
Let’s bucket the passengers based on those finding. We can also make a special N/A
bucket for people that don’t have their age on record:
data_train['AgeGroup'] = data_train['Age'].apply(
lambda a:
'N/A' if a != a else
'0-15' if a <= 15 else
'16-30' if a <= 30 else '30+' ) print(pd.concat([ data_train.groupby(['AgeGroup'])['AgeGroup'].count(), data_train.groupby(['AgeGroup'])['Survived'].mean(), ], axis=1).sort_values(by='Survived', ascending=False)) >>
AgeGroup Survived
AgeGroup
0-15 83 0.590361
30+ 305 0.406557
16-30 326 0.358896
N/A 177 0.293785
Looking good!
4. SpSib+Parch
Let’s add the “sibling spouse” column together with the “parent child” column to make a family column and see what we get:
data_train['Relatives'] = data_train[['SibSp','Parch']]\
.sum(axis=1)
print(pd.concat([
data_train.groupby(['Relatives'])['Relatives'].count(),
data_train.groupby(['Relatives'])['Survived'].mean(),
], axis=1))
>>
Relatives Survived
Relatives
0 537 0.303538
1 161 0.552795
2 102 0.578431
3 29 0.724138
4 15 0.200000
5 22 0.136364
6 12 0.333333
7 6 0.000000
10 7 0.000000
Looks like people who are on their own are less likely to make it. Then people who have a family size of 3 to 5 are more likely to survive. Families larger than that are more likely not to survive. Let’s do some bucketing:
data_train['Relatives'] = data_train[['SibSp','Parch']]\
.sum(axis=1).apply(
lambda r:
'0' if r <= 0 else
'1-3' if r <= 3 else '4+' ) print( data_train.groupby(['Relatives'])['Survived'] .mean().sort_values(ascending=False) ) >>
Relatives
1-3 0.578767
0 0.303538
4+ 0.161290
5. Name
At a first glance, you might think it’s safe to completely ignore this column, but upon further inspection of the values:
print(data_train['Name'])
>>
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
...
we can see that titles such as “Mr.”, “Mrs.”, and “Miss.” may be useful. Let’s extract the title and see how many distinct titles we have:
data_train['Title'] = data_train['Name'] \
.str.extract(', ([A-Za-z]+)\.', expand=False)
print(
data_train.groupby(['Title'])['Title']
.count().sort_values(ascending=False)
)
>>
Title
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Mlle 2
Major 2
Col 2
Sir 1
Ms 1
Mme 1
Lady 1
Jonkheer 1
Don 1
Capt 1
Let’s only consider the more frequent titles (Mr, Miss, Mrs, and Master) and see the survival probability. Any other title not in this list will be dropped into an “Others” fallback bucket.
data_train['Title'] = data_train['Name'] \
.str.extract(', ([A-Za-z]+)\.', expand=False) \
.apply(
lambda t:
t if t in ['Master', 'Mr', 'Miss', 'Mrs'] else
'Other'
)
print(data_train.groupby(['Title'])['Survived']
.mean().sort_values(ascending=False))
>>
Title
Mrs 0.792000
Miss 0.697802
Master 0.575000
Other 0.444444
Mr 0.156673
With what we know about the Pclass
, Sex
, SibSp+Parch
, and Age
columns, this completely makes sense. Females and higher-class men are more likely to survive. Also older (married) females are more likely to survive than single (younger) ones.
6. Fare
Let’s bucket the fares and see the effect:
data_train['FareGroup'] = data_train['Fare'].apply(
lambda f:
'0-10' if f <= 10 else
'11-55' if f <= 55 else '55+' ) print(data_train.groupby(['FareGroup'])['Survived'] .mean().sort_values(ascending=False)) >>
FareGroup
55+ 0.690647
11-55 0.430288
0-10 0.199405
The more you paid, the more likely you survive.
7. Cabin
Looking at the values:
print(data_train['Cabin'])
>>
0 NaN
1 C85
2 NaN
3 C123
4 NaN
we can see that the value is very sparse, but the cabin name can give us the section of the ship the passanger is in. Let’s group passengers by their cabin name first character and see what happens:
print(
data_train.groupby(
data_train['Cabin'].str[0:1]
)['Survived'].count().sort_values(ascending=False)
)
>>
Cabin
C 59
B 47
D 33
E 32
A 15
F 13
G 4
T 1
But we did not take missing values into account. So we’ll make an extra bin for those. In that bin we can put the less frequent cabin names as well:
data_train['Section'] = data_train['Cabin'].apply(
lambda c: 'Other'
if c != c or c[0] not in ['C', 'B', 'D', 'E']
else c[0]
)
print(data_train.groupby('Section')['Survived']
.mean().sort_values(ascending=False))
>>
Section
D 0.757576
E 0.750000
B 0.744681
C 0.593220
Other 0.309722
It seems that people in cabins are more likely to survive. Let’s have two bins then: Cabin and NoCabin:
data_train['Section'] = data_train['Cabin'].apply(
lambda c: 'NoCabin' if c != c else 'Cabin'
)
print(data_train.groupby('Section')['Survived']
.mean().sort_values(ascending=False))
>>
Section
Cabin 0.666667
NoCabin 0.299854
8. Embarked
Let’s look at the embarkment port distribution:
data_train['EmbarkedClean'] = data_train['Embarked']\
.replace(np.nan, 'X')
print(
data_train.groupby(['EmbarkedClean'])['EmbarkedClean']
.count()
)
>>
EmbarkedClean
C 168
Q 77
S 644
X 2
It seems that most people embarked from 'S'
. We can use that value to fill in the two missing values.
Then let’s see how the port correlates to survival probability:
data_train['EmbarkedFilled'] = data_train['Embarked']\
.replace(np.nan, 'S')
print(
data_train.groupby(['EmbarkedFilled'])['Survived']
.mean().sort_values(ascending=False)
)
>>
EmbarkedFilled
C 0.553571
Q 0.389610
S 0.339009
This is strange. The embarkment port on its own should not affect the prediction. Here’s the Titanic route, we can see that most of the survivors are the ones who embarked from Cherbourg. Nothing special stands out about that port location to justify the number of survivors. We’ll keep it for now, but we must stay skeptical.
9. Ticket
This is probably random and probably is safe to ignore.
Next
With the insight we gained here, we can start transforming the test and training data to be able to build our model!