Introduction

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In the current Project, a dataset contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic is examined. The obvious question that will be answered by the end of the Project is which were the factors that made people more likely to survive.

About the Data

The Dataset is a highly structured dataset consisted of the following attributes:

VARIABLE DESCRIPTIONS:

Variable Description
survival Survival (0 = No; 1 = Yes)
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES: Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

  • Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
  • Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
  • Parent: Mother or Father of Passenger Aboard Titanic
  • Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children traveled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

Loading the Dataset and the necessary libraries

%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import math
import numpy as np
from IPython.display import Image
from scipy.stats import norm
from scipy.stats import stats
#Loading the data to a dataframe
#"titanic_original" will be the initial dataframe.
#All following dataframes will be alterations of "titanic_original"
filename = "titanic_data.csv"
titanic_original = pd.read_csv(filename)

Preparing the Data

Before trying any data cleaning, let’s visualize the Dataset and check the Data Types of its fields.

#Previewing the data
titanic_original.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

We will rename the “Embarked” column to “Port” because we want to reserve the “Embarked” as a variable name for the number of passengers from a specific that embarked (got on-board) on the ship.
We will create a new DataFrame named “titanic_df” and we will use this for the rest of the analysis.

titanic_df=titanic_original.rename(columns = {'Embarked':'Port'})

Checking Data Types

#Print the data types of each column
titanic_df.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Port            object
dtype: object

The Data Types are the expected, so there is no need for any corrections on this level.

Checking Completeness

Next, we will check the Dataset for any missing values.

#Count the number of values on each column
titanic_df.count()
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Port           889
dtype: int64

There are some missing values in Age, Cabin and Port of Embarkation variables.

We will not drop any records right now. Any drops will take place when we will examine the specific factors.

Investigating Data Problems

Next, we will check for any surprising values in our Dataset. The expected values for each variable are listed in the following table:

Variable Expected Data
PassengerId Continuous Integers, starting from “1” and ending to “891”
Survived Integer of values “0” or “1”
Pclass Integer of values “1”, “2”, or “3”
Name (Nothing to check here)
Sex “male” or “female”
Age Min and Max values should make sense
SibSp Min and Max values should make sense
Parch Min and Max values should make sense
Ticket (Nothing to check here)
Fare Min and Max values should make sense
Cabin There should be one value per record and the values should be in the format DeckCabin# (e.g C128)
Embarked The values should be either “C”, “Q”, or “S”

PassengerId

#Calculating the min/max values, the # of values and the existance of duplicates
min_val = titanic_df["PassengerId"].min()
max_val = titanic_df["PassengerId"].max()
num_val = titanic_df["PassengerId"].count()
dup_val = titanic_df.duplicated(subset=["PassengerId"]).sum()

d = [min_val, max_val, num_val, dup_val]
i = ["Min Value", "Max Value", "Number of values", "Duplicate values"]

df = pd.DataFrame({"PassengerId":d}, index=i)
df
PassengerId
Min Value 1
Max Value 891
Number of values 891
Duplicate values 0

Since the minimum value is 1, the maximum 891, with 891 entries and no duplicates, the PassengerId is a continuous integer from 1 to 891.


Survived

#Finding unique values in "Survived" column
titanic_df["Survived"].unique()
array([0, 1])

The Survived column contains the expected values.

It may sound a good idea to turn this variable to a boolean, but letting it as a integer will help us later to calculate the survival rates.
More specifically, since “0” indicates non-survival and “1” survival, the average “Survived” of a sample (e.g. a group of passengers) equals the Survival Rate of the sample.


Pclass

#Finding unique values in "Pclass" column
titanic_df["Pclass"].unique()
array([3, 1, 2])

The Pclass column contains the expected values.


Sex

#Finding unique values in "Sex" column
titanic_df["Sex"].unique()
array(['male', 'female'], dtype=object)

The Sex column contains the expected values.


Age

#Finding Min/Max values in "Age" column
min_val = titanic_df["Age"].min()
max_val = titanic_df["Age"].max()

d = [min_val, max_val]
i=["Min Value", "Max Value"]

df = pd.DataFrame({"Age":d}, index=i)
df
Age
Min Value 0.42
Max Value 80.00

The Age column contains non surprising values.


SibSp

#Finding Min/Max values in "SibSp" column
min_val = titanic_df["SibSp"].min()
max_val = titanic_df["SibSp"].max()

d = [min_val, max_val]
i=["Min Value", "Max Value"]

df = pd.DataFrame({"SibSp":d}, index=i)
df
SibSp
Min Value 0
Max Value 8

The SibSp column contains non surprising values.


Parch

#Finding Min/Max values in "Parch" column
min_val = titanic_df["Parch"].min()
max_val = titanic_df["Parch"].max()

d = [min_val, max_val]
i=["Min Value", "Max Value"]

df = pd.DataFrame({"Parch":d}, index=i)
df
Parch
Min Value 0
Max Value 6

The Parch column contains non surprising values.


Fare

#Finding Min/Max values in "Fare" column
min_val = titanic_df["Fare"].min()
max_val = titanic_df["Fare"].max()

d = [min_val, max_val]
i=["Min Value", "Max Value"]

df = pd.DataFrame({"Fare":d}, index=i)
df
Fare
Min Value 0.0000
Max Value 512.3292
#Finding the # of "0" fare records
(titanic_df["Fare"] == 0).astype(int).sum()
15

We can see that there are 15 “0” fares which looks strange.
Let’s take a closer look on these records:

#Return all records with "0" fare
titanic_df[titanic_df["Fare"] == 0]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Port
179 180 0 3 Leonard, Mr. Lionel male 36.0 0 0 LINE 0.0 NaN S
263 264 0 1 Harrison, Mr. William male 40.0 0 0 112059 0.0 B94 S
271 272 1 3 Tornquist, Mr. William Henry male 25.0 0 0 LINE 0.0 NaN S
277 278 0 2 Parkes, Mr. Francis "Frank" male NaN 0 0 239853 0.0 NaN S
302 303 0 3 Johnson, Mr. William Cahoone Jr male 19.0 0 0 LINE 0.0 NaN S
413 414 0 2 Cunningham, Mr. Alfred Fleming male NaN 0 0 239853 0.0 NaN S
466 467 0 2 Campbell, Mr. William male NaN 0 0 239853 0.0 NaN S
481 482 0 2 Frost, Mr. Anthony Wood "Archie" male NaN 0 0 239854 0.0 NaN S
597 598 0 3 Johnson, Mr. Alfred male 49.0 0 0 LINE 0.0 NaN S
633 634 0 1 Parr, Mr. William Henry Marsh male NaN 0 0 112052 0.0 NaN S
674 675 0 2 Watson, Mr. Ennis Hastings male NaN 0 0 239856 0.0 NaN S
732 733 0 2 Knight, Mr. Robert J male NaN 0 0 239855 0.0 NaN S
806 807 0 1 Andrews, Mr. Thomas Jr male 39.0 0 0 112050 0.0 A36 S
815 816 0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0 B102 S
822 823 0 1 Reuchlin, Jonkheer. John George male 38.0 0 0 19972 0.0 NaN S

There are some obvious similarities.
All of them were males, embarked in Southampton and only one survived.
The above facts make them look like crew members but further investigation to Encyclopedia Titanica reveals that they were not.
Instead, it seems that they had some relation with the White Star Line, owner of RMS Titanic.

For example,

Mr Francis Parkes was a member of Harland & Wolff : Titanic Guarantee Group, a Belfast team sent by shipbuilders Harland & Wolff to accompany the Titanic on her maiden voyage.

Researching for Leonard, Mr. Lionel led me to the following reference:
It is believed Shannon worked for the American Line and possibly held US citizenship, using the name Lionel Leonard for reasons unknown. By 1912 he was quartermaster of the SS Philadelphia but the coal strike caused scheduling problems and Philadelphia”s westbound voyage was canceled, with Andrew and several other shipmates (August Johnson, William Cahoone Jr. Johnson, Alfred John Carver, Thomas Storey and William Henry Törnquist) forced to travel aboard Titanic as passengers.” (https://www.encyclopedia-titanica.org/titanic-victim/lionel-leonard.html)

The above facts lead to the conclusion that the “zero fare” passengers had some relation with the ship owner company and were traveling for free.
(A detailed investigation of all the above names would be out of scope of the specific project.)


Cabin

#Counting the number of cabins in each entry
titanic_df["Cabin"].str.split(" ", expand=True).count().rename(lambda x: x+1)
1    204
2     24
3      8
4      2
dtype: int64

We can see that:
2 passengers have 4 registered cabins
6 passengers have 3 registered cabins (8-2)
14 passengers have 2 registered cabins (24-8-2)
170 passengers have 1 registered cabin (204-24-8-2)

To further examine the Cabin data we will export them from the dataframe

#Extracting, removing empty and splitting entries
cabin = titanic_df["Cabin"]
cabin = cabin.dropna()
cabin = cabin.str.split(" ", expand=True)

#As an example, print the entries that have 3 cabins.
cabin.dropna(subset=[1,2])
0 1 2 3
27 C23 C25 C27 None
88 C23 C25 C27 None
311 B57 B59 B63 B66
341 C23 C25 C27 None
438 C23 C25 C27 None
679 B51 B53 B55 None
742 B57 B59 B63 B66
872 B51 B53 B55 None

Passengers with PassengerIds 27, 88, 341 and 438 looks to occupy the same cabins.

titanic_df.loc[[27, 88, 341, 438]]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Port
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0 C23 C25 C27 S
88 89 1 1 Fortune, Miss. Mabel Helen female 23.0 3 2 19950 263.0 C23 C25 C27 S
341 342 1 1 Fortune, Miss. Alice Elizabeth female 24.0 3 2 19950 263.0 C23 C25 C27 S
438 439 0 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0 C23 C25 C27 S

It seems that some families had booked more than one adjacent cabins.
We assume that there is nothing wrong with the data.


Embarked

titanic_df["Port"].unique()
array(['S', 'C', 'Q', nan], dtype=object)

We were expecting some null values, so everything looks good here.


According to the above findings, no problematic data found, thus there isn’t any wrangling actions to perform.

Data Exploration

In the current section we will investigate the correlation of several factors with the Survival Rate.
An initial investigation can be made between the non-categorical data by computing a pairwise correlation of the columns.

titanic_df.corr()
PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000

A strong correlation between the Survival Rate and Pclass and Fare can be seen.
It seems that the higher the socio-economic status of the passenger the better the possibility of surviving the accident.

Further investigation will take place.

There are also some other “secondary” correlations, not directly relevant with the answers we are looking for:

  • Negative correlation between the Age and the Passenger’s Class (The younger passengers could not affort an expensive class” ticket)
  • Negative correlation between the Passenger’s Class and the Fare (The “higher” the class, the more expensive the ticket)
  • Negative correlation between the Passenger’s Age and the number of Siblings (The older the passenger the fewer siblings onboard)
  • Positive correlation between the number of Spouses and the number of Siblings (Large families onboard constituted both from siblings and spouses)

Also, it would be useful to calculate the Survival Rate for the whole sample as a baseline .

titanic_df["Survived"].mean()
0.3838383838383838

In the analysis, we will need to group several dataframes and rename some of their columns.
The following function takes an original_df as an input (titanic_df by default), drops the lines with NaNs across the column and creates a new dataframe named df_name grouped across the column axis. Also it renames some of the variables.

def grouped(column,original_df=titanic_df):
    
    a = original_df.dropna(subset=[column]).groupby(column).agg({"PassengerId" : "count", "Survived" : "sum"})
    a = a.rename(columns = {"PassengerId":"Embarked", "Survived":"Survived"})
    b = original_df.dropna(subset=[column]).groupby(column).agg({"Survived" : "mean"})
    b = b.rename(columns = {"Survived" : "Survival Rate"})
    df_name = pd.concat([a,b], axis=1)
        
    return df_name

Survival Rate per Passenger’s Class

By calculating the average value of the Survived variable for each Class, we are calculating the Survival Rate of each Class.

#Create a grouped by "Pclass" DataFrame with the average "Survived"
#No need to dropna() because there are not NaN on "Pclass" or "Survived" variables
pclass_df = grouped("Pclass")
pclass_df
Embarked Survived Survival Rate
Pclass
1 216 136 0.629630
2 184 87 0.472826
3 491 119 0.242363

The passengers of the 1st and 2nd Class had a greater than the average Survival Rate.

#putting the plotting code in a function so we can called again in the conclusions
def conclusion1():
    
    plt.subplots(figsize = (14, 5))
    
    #Plotting the passengers distribution per Class
    plt.subplot(121)
    
    N = len(pclass_df.index)
    ind = np.arange(N)  # the x locations for the groups
    width = 0.35       # the width of the bars
    
    bar1=plt.bar(ind, pclass_df["Embarked"], width, color="#5975A4", label="Embarked")
    bar2=plt.bar(ind + width, pclass_df["Survived"], width, color='#5F9E6E', label="Survived")

    plt.xlabel("Passenger Class", fontsize=12)
    plt.ylabel("Number of Passengers", fontsize=12)
    plt.title("Passengers' Distributions per Class", fontsize=14)
    plt.xticks(ind + width, pclass_df.index.values)

    plt.legend(loc=2)
        
    #Plotting the Survival Rate per Class
    plt.subplot(122)

    bar3 = sns.barplot(x="Pclass", y="Survival Rate", data=pclass_df.reset_index(),color='#5975A4')

    #Adding the average Survival Rate
    plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

    plt.xlabel("Class", fontsize=12)
    plt.ylabel("Survival Rate", fontsize=12)
    plt.title("Survival Rate per Passenger's Class", fontsize=14)
    
    
conclusion1()
plt.show()

png

It is obvious that the “higher” (smaller number) the Passenger’s Class, the higher the Survival Rate.

Survival Rate per Passenger’s Gender

“Women and children first” is a code of conduct whereby the lives of women and children are to be saved first in a life-threatening situation, typically abandoning ship, when survival resources such as lifeboats were limited. (Source: Wikipedia)

Let’s see if the women on Titanic had a higher Survival Rate than the men.

#Create a grouped by "Sex" DataFrame with the average "Survived"
#No need to dropna() because there are not NaN on "Sex" or "Survived" variables
sex_df = grouped("Sex")
sex_df
Embarked Survived Survival Rate
Sex
female 314 233 0.742038
male 577 109 0.188908
#putting the plotting code in a function so we can called again in the conclusions
def conclusion2():
    plt.subplots(figsize = (14, 5))
    
    #Plotting the passengers distribution per Class
    plt.subplot(121)
    
    N = len(sex_df.index)
    ind = np.arange(N)  # the x locations for the groups
    width = 0.35       # the width of the bars
    
    bar1=plt.bar(ind, sex_df["Embarked"], width, color="#5975A4", label="Embarked")
    bar2=plt.bar(ind + width, sex_df["Survived"], width, color='#5F9E6E', label="Survived")

    plt.xlabel("Gender", fontsize=12)
    plt.ylabel("Number of Passengers", fontsize=12)
    plt.title("Passengers' per Gender", fontsize=14)
    plt.xticks(ind + width, sex_df.index.values)

    plt.legend(loc=2)
    
    
    #Plotting the Survival Rate per Class
    plt.subplot(122)

    bar3 = sns.barplot(x="Sex", y="Survival Rate", data=sex_df.reset_index(),color='#5975A4')

    #Adding the average Survival Rate
    plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

    plt.xlabel("Gender", fontsize=12)
    plt.ylabel("Survival Rate", fontsize=12)
    plt.title("Survival Rate per Gender", fontsize=14)

conclusion2()
plt.show()

png

The Women had over 4 times the Survival Rate of the men.
So far, this is the most crucial factor of the Survival Rate.

Survival Rate per Passenger’s Age

Investigating the second group of the “Women and children first” code of conduct, we will analyze the Age as survival factor.

Since the Age variable is comprised of almost indiscrete values, it will have very small practical value to group the DataFrame by Age. A better approach would be to group the passengers in “Decades” so that each passenger will be “moved” to the nearest decade.

The subsets that will be created will be (0,5),[5,15),[15,25),[25,35),[35,45),[45,55),[55,65),[65,75),[75,85).

#Drop the NaN values
age_df=titanic_df.dropna(subset = ["Age"])

#A function that round the age to the neares decade
def decade(age):
    return int((round(age/10)*10))

#Applying the "decade" function to the "Age" column
Decade = age_df[["Age"]].applymap(decade)
Decade.columns = ["Decade"]

#Concatenate the new column to the "age_df" DataSet
dec_df = pd.concat([age_df, Decade], axis = 1)
#Create a grouped by "Decade" DataFrame with the average "Survived"
#No need to dropna() because we have already droped the null values during the creation of the DataFrame

decade_df = grouped("Decade",dec_df)
decade_df
Embarked Survived Survival Rate
Decade
0 40 27 0.675000
10 38 18 0.473684
20 200 73 0.365000
30 201 78 0.388060
40 120 51 0.425000
50 73 30 0.410959
60 31 12 0.387097
70 10 0 0.000000
80 1 1 1.000000
#Plotting the resulting DataFrame
fig = plt.subplots(figsize = (14, 10))

plt.subplot(221)

N = len(decade_df.index)
ind = np.arange(N)  # the x locations for the groups
width = 0.35       # the width of the bars
    
bar1=plt.bar(ind, decade_df["Embarked"], width, color="#5975A4", label="Embarked")
bar2=plt.bar(ind + width, decade_df["Survived"], width, color='#5F9E6E', label="Survived")

plt.xlabel("Age", fontsize=12)
plt.ylabel("Number of Passengers", fontsize=12)
plt.title("Passengers' per Gender", fontsize=14)
plt.xticks(ind + width, decade_df.index.values)

plt.legend(loc=2)

#Survival Rate per Age
plt.subplot(222)

p = sns.barplot(x="Decade", y="Survival Rate", color='#5975A4', data=decade_df.reset_index())

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

p.set_xlabel("Age", fontsize=12)
p.set_ylabel("Survival Rate", fontsize=12)
p.set_title("Survival Rate per Passenger's Age", fontsize=14)

#Linear Regression
plt.subplot(223)

age_df=titanic_df.dropna(subset = ["Age"])

#An order of "4" has been selected so that the regression model will follow the histogram's trend
sns.regplot(x="Age", y="Survived", data=dec_df, order=4, y_jitter=0.01, scatter_kws={"s": 80});

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

plt.xlabel("Age", fontsize=12)
plt.ylabel("Survival Rate", fontsize=12)
plt.title("Survival Rate per Passenger's Age (regression)", fontsize=14)

plt.tight_layout()
plt.show()

png

In the above graphs, we can notice some extreme values to the right end of the scale (70: 0%, 80: 100%). Let”s dig a little bit further by having a closer look at the passengers of these two subsets.

#Return all the rows with "Decade" 70 or more.
dec_df.loc[dec_df["Decade"] >= 70]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Port Decade
33 34 0 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 NaN S 70
54 55 0 1 Ostby, Mr. Engelhart Cornelius male 65.0 0 1 113509 61.9792 B30 C 70
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C 70
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q 70
280 281 0 3 Duane, Mr. Frank male 65.0 0 0 336439 7.7500 NaN Q 70
456 457 0 1 Millet, Mr. Francis Davis male 65.0 0 0 13509 26.5500 E38 S 70
493 494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 NaN C 70
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0 27042 30.0000 A23 S 80
672 673 0 2 Mitchell, Mr. Henry Michael male 70.0 0 0 C.A. 24580 10.5000 NaN S 70
745 746 0 1 Crosby, Capt. Edward Gifford male 70.0 1 1 WE/P 5735 71.0000 B22 S 70
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S 70

There were 10 passengers in the Decade 70 that none survived and only one in the Decade 80, who survived.
The last one can be considered as an outlier and removed from the sample.

dec_df = dec_df.drop(630)
decade_df = grouped("Decade",dec_df)
decade_df
Embarked Survived Survival Rate
Decade
0 40 27 0.675000
10 38 18 0.473684
20 200 73 0.365000
30 201 78 0.388060
40 120 51 0.425000
50 73 30 0.410959
60 31 12 0.387097
70 10 0 0.000000
#putting the plotting code in a function so we can called again in the conclusions
def conclusion3():

    fig = plt.subplots(figsize = (14, 10))

    plt.subplot(221)

    N = len(decade_df.index)
    ind = np.arange(N)  # the x locations for the groups
    width = 0.35       # the width of the bars

    bar1=plt.bar(ind, decade_df["Embarked"], width, color="#5975A4", label="Embarked")
    bar2=plt.bar(ind + width, decade_df["Survived"], width, color='#5F9E6E', label="Survived")

    plt.xlabel("Age", fontsize=12)
    plt.ylabel("Number of Passengers", fontsize=12)
    plt.title("Survival Rate per Passenger's Age", fontsize=14)
    plt.xticks(ind + width, decade_df.index.values)

    plt.legend(loc=2)

    #Survival Rate per Age
    plt.subplot(222)

    p = sns.barplot(x="Decade", y="Survival Rate", color='#5975A4', data=decade_df.reset_index())

    #Adding the average Survival Rate
    plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

    p.set_xlabel("Age", fontsize=12)
    p.set_ylabel("Survival Rate", fontsize=12)
    p.set_title("Survival Rate per Passenger's Age", fontsize=14)

    #Linear Regression
    plt.subplot(223)

    age_df=titanic_df.dropna(subset = ["Age"])

    #An order of "3" has been selected so that the regression model will follow the histogram's trend
    sns.regplot(x="Age", y="Survived", data=dec_df, order=3, y_jitter=0.01, scatter_kws={"s": 80});

    #Adding the average Survival Rate
    plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

    plt.xlabel("AgeAge", fontsize=12)
    plt.ylabel("Survival Rate", fontsize=12)
    plt.title("Survival Rate per Passenger's Age", fontsize=14)

    plt.tight_layout()

conclusion3()
plt.show()

png

From the above results, we can conclude that the Age was a crucial factor for the survival of the passengers with the children under 15 having the greatest probability to survive.

The rest of the variables could not affect (at least obviously) the Survival Rate but let’s continue the analysis in case there are connections our intuition cannot spot.

Survival Rate per Number of Siblings/Spouses

#Create a grouped by "SibSp" DataFrame with the average "Survived"
#No need to dropna() because there are not NaN on "SibSp" or "Survived" variables
SibSp_df = grouped("SibSp")

#Plotting the resulting DataFrame
fig = plt.subplots(figsize = (14, 10))

plt.subplot(221)

N = len(SibSp_df.index)
ind = np.arange(N)  # the x locations for the groups
width = 0.35       # the width of the bars

bar1=plt.bar(ind, SibSp_df["Embarked"], width, color="#5975A4", label="Embarked")
bar2=plt.bar(ind + width, SibSp_df["Survived"], width, color='#5F9E6E', label="Survived")

plt.xlabel("Number of Siblings/Spouses", fontsize=12)
plt.ylabel("Number of Passengers", fontsize=12)
plt.title("Passengers' per Number of Siblings/Spouses", fontsize=14)
plt.xticks(ind + width, SibSp_df.index.values)

plt.legend(loc=2)

plt.subplot(222)

p = sns.barplot(x="SibSp", y="Survival Rate", ci=None, color='#5975A4', data=SibSp_df.reset_index())

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

p.set_xlabel("Number of Siblings/Spouses", fontsize=12)
p.set_ylabel("Survival Rate", fontsize=12)
p.set_title("Survival Rate per Number of Siblings/Spouses", fontsize=14)

plt.subplot(223)

sns.regplot(x="SibSp", y="Survived", data=titanic_df, order=2, scatter_kws={"s": 80});

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

plt.xlabel("Number of Siblings/Spouses", fontsize=12)
plt.ylabel("Survival Rate", fontsize=12)
plt.title("Survival Rate per Number of Siblings/Spouses", fontsize=14)

plt.tight_layout()
plt.show()

png

There is a patern in the plot but as we saw earlier, there is a negative correlation between Number of Siblings/Spouses and Age.
Since, the Number of Siblings/Spouses doesn’t make much sense to affect the Survival Rate we can assume that Age is a Common Cause for both Survival Rate and Number of Siblings/Spouses.

fig = plt.subplots(figsize = (14, 10))

plt.figure(1)

plt.subplot(221)

p = sns.barplot(x="Decade", y="Survived", ci=None, color='#5975A4', data=dec_df.reset_index())

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

p.set_xlabel("Age", fontsize=12)
p.set_ylabel("Survival Rate", fontsize=12)
p.set_title("Survival Rate per Passenger's Age", fontsize=14)

plt.subplot(222)

p = sns.barplot(x="SibSp", y="Survival Rate", ci=None, color='#5975A4', data=SibSp_df.reset_index())

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

p.set_xlabel("Number of Siblings/Spouses", fontsize=12)
p.set_ylabel("Survival Rate", fontsize=12)
p.set_title("Survival Rate per Number of Siblings/Spouses", fontsize=14)

plt.subplot(223)
sns.regplot(x="Age", y="Survived", data=dec_df, order=3, y_jitter=0.01, scatter_kws={"s": 80});

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

plt.xlabel("Age", fontsize=12)
plt.ylabel("Survival Rate", fontsize=12)
plt.title("Survival Rate per Passenger's Age", fontsize=14)

plt.subplot(224)

sns.regplot(x="SibSp", y="Survived", data=titanic_df, order=2, y_jitter=0.01, scatter_kws={"s": 80});

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

plt.xlabel("Number of Siblings/Spouses", fontsize=12)
plt.ylabel("Survival Rate", fontsize=12)
plt.title("Survival Rate per Number of Siblings/Spouses", fontsize=14)

plt.tight_layout()
plt.show()

png

This negative corelation between the Passenger’s Age and the Number of Siblings/Spouses can be further highlighted in the following plot.

a = pd.DataFrame(data = pd.DataFrame(titanic_df.dropna(subset=['Age'])).dropna(subset=['SibSp']))
sns.lmplot(x="SibSp", y="Age", data=a)

plt.xlabel("Number of Siblings/Spouses", fontsize=12)
plt.ylabel("Age", fontsize=12)
plt.title("Correlation between Number of Siblings/Spouses & Age", fontsize=14)

#plt.tight_layout()
plt.show()

png

Survival Rate per Number of Parents/Children Aboard

#Create a grouped by "Parch" DataFrame with the average "Survived"
#No need to dropna() because there are not NaN on "Parch" or "Survived" variables
parch_df = grouped("Parch")
plt.subplots(figsize = (14, 5))
    
#Plotting the passengers distribution per Class
plt.subplot(121)

N = len(parch_df.index)
ind = np.arange(N)  # the x locations for the groups
width = 0.35       # the width of the bars

bar1=plt.bar(ind, parch_df["Embarked"], width, color="#5975A4", label="Embarked")
bar2=plt.bar(ind + width, parch_df["Survived"], width, color='#5F9E6E', label="Survived")

plt.xlabel("Number of Parents/Children", fontsize=12)
plt.ylabel("Number of Passengers", fontsize=12)
plt.title("Passengers' Distributions per Number of Parents/Children", fontsize=14)
plt.xticks(ind + width, parch_df.index.values)

plt.legend(loc=1)

#Plotting the resulting DataFrame
plt.subplot(122)

p = sns.barplot(x="Parch", y="Survival Rate", ci=None, color='#5975A4', data=parch_df.reset_index())

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

p.set_xlabel("Number of Parents/Children", fontsize=12)
p.set_ylabel("Survival Rate", fontsize=12)
p.set_title("Survival Rate per Number of Parents/Children", fontsize=14)

plt.show()

png

The above diagram cannot give a clear picture of a correlation between the Number of Parents/Children and the Survival rate. We can say though that the passengers that were traveling with 1 to 3 Parents/Children had a greater Survival Ratio.

Survival Rate per Fare

We know that the “Higher” the Passenger’s Class the higher the fare, so we are expecting a possitive correlation between the Fare and the Survival Rate since we have already concluded that the Passenger’s Class was a Critical Factor.

fig = plt.subplots(figsize = (14, 5))

plt.figure(1)

plt.subplot(121)

sns.regplot(x="Fare", y="Survived", data=titanic_df, order=1, scatter_kws={"s": 80});

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

plt.xlabel("Fare", fontsize=12)
plt.ylabel("Survival Rate", fontsize=12)
plt.title("Survival Rate per Fare", fontsize=14)

plt.subplot(122)

sns.distplot(titanic_df["Fare"], kde=False)

plt.xlabel("Fare", fontsize=12)
plt.ylabel("Number of Passengers", fontsize=12)
plt.title("Number of Passengers per Fare", fontsize=14)

plt.show()

png

And if we remove the 300+ fare outliers:

fig = plt.subplots(figsize = (14, 5))

d = titanic_df[titanic_df['Fare'] < 300]

plt.figure(1)

plt.subplot(121)

sns.regplot(x="Fare", y="Survived", data=d, order=1, truncate=True, scatter_kws={"s": 80});

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

plt.xlabel("Fare", fontsize=12)
plt.ylabel("Survival Rate", fontsize=12)
plt.title("Survival Rate per Fare", fontsize=14)

plt.subplot(122)

sns.distplot(d["Fare"], kde=False)

plt.xlabel("Fare", fontsize=12)
plt.ylabel("Number of Passengers", fontsize=12)
plt.title("Number of Passengers per Fare", fontsize=14)

plt.show()

png

As we expected, the Fare was a critical Survival Factor.

Survival Rate per Port of Embarkation

Finally, let’s visualize the Survival Rate per Port of Embarkation to find out if the passengers from the three ports had the same Survival Rates.

embarked_df = grouped("Port")
embarked_df = embarked_df.set_index([['Cherbourg' , 'Queenstown', 'Southampton']])
plt.subplots(figsize = (14, 5))
    
#Plotting the passengers distribution per Class
plt.subplot(121)

N = len(embarked_df.index)
ind = np.arange(N)  # the x locations for the groups
width = 0.35       # the width of the bars

bar1=plt.bar(ind, embarked_df["Embarked"], width, color="#5975A4", label="Embarked")
bar2=plt.bar(ind + width, embarked_df["Survived"], width, color='#5F9E6E', label="Survived")

plt.xlabel("Port of Embarkation", fontsize=12)
plt.ylabel("Number of Passengers", fontsize=12)
plt.title("Passengers' Distributions per Port of Embarkation", fontsize=14)
plt.xticks(ind + width, embarked_df.index.values)

plt.legend(loc=2)

plt.subplot(122)

p = sns.barplot(x="Embarked", y="Survival Rate", ci=None, color='#5975A4', data=embarked_df.reset_index())

#Adding the average Survival Rate
plt.axhline(y=0.3838383838383838, ls='dashed', color='#0B559F', alpha=0.6)

p.set_xlabel("Port of Embarkation", fontsize=12)
p.set_ylabel("Survival Rate", fontsize=12)
p.set_title("Survival Rate per Port of Embarkation", fontsize=14)
p.set_xticklabels(embarked_df.index.values)

plt.show()

png

There are significant variations between the three ports.
Let’s explore the allocation of Gender and Passenger’s Class on each port.

fig = plt.subplots(figsize = (14, 5))

d = titanic_df[titanic_df['Fare'] < 300]

plt.figure(1)

plt.subplot(121)

a = titanic_df.groupby(['Port', 'Sex'])
b = a['PassengerId'].count().reset_index()

p = sns.barplot(x='Port', y='PassengerId', hue='Sex', data=b)

p.set_xlabel("Port of Embarkation", fontsize=12)
p.set_ylabel("Number of Passengers", fontsize=12)
p.set_title("Number of Passengers per Port", fontsize=14)
p.set_xticklabels(['Cherbourg' , 'Queenstown', 'Southampton'])
p.legend(title="Gender", loc=2)

plt.subplot(122)

c = titanic_df.groupby(['Port', 'Pclass'])
d = c['PassengerId'].count().reset_index()

p = sns.barplot(x='Port', y='PassengerId', hue='Pclass', data=d,)

p.set_xlabel("Port of Embarkation", fontsize=12)
p.set_ylabel("Number of Passengers", fontsize=12)
p.set_title("Number of Passengers per Port of Embarcation", fontsize=14)
p.set_xticklabels(['Cherbourg' , 'Queenstown', 'Southampton'])
p.legend(title="Passenger's Class", loc=2)

plt.show()

png

The port with the higher Survival Rate is the one with the most higher ratio of “prestigious” passengers and a good female/male ratio and the one with the lowest Rate the “worst” ratio in both categories. This explain the significant differences between the three ports.

Conclusions

Following the above analysis we can conclude that the most critical factors for the survival of the passengers were:

  • Gender
  • Age
  • Socio-economic Status

More specifically, women had over 4 times the Survival Rate of men (74.2% against 18.9%)…

sex_df
Embarked Survived Survival Rate
Sex
female 314 233 0.742038
male 577 109 0.188908
conclusion2()
plt.show()

png

…the “Upper Class” nearly 3 times more chances than the “Lower Class” (63.0% against 24.2%)…

pclass_df
Embarked Survived Survival Rate
Pclass
1 216 136 0.629630
2 184 87 0.472826
3 491 119 0.242363
conclusion1()
plt.show()

png

…and coming to the age factor, the most privileged were the infants (ages under 5) with a Survival Rate of 67.5%, almost double the average.

decade_df
Embarked Survived Survival Rate
Decade
0 40 27 0.675000
10 38 18 0.473684
20 200 73 0.365000
30 201 78 0.388060
40 120 51 0.425000
50 73 30 0.410959
60 31 12 0.387097
70 10 0 0.000000
conclusion3()
plt.show()

png

The above conclusions are tentative and further statistical analysis is required in order to prove their validity.

References

Encyclopedia Titanica
Wikipedia

Dataset: Kaggle - Titanic: Machine Learning from Disaster

Updated:

Leave a Comment