Python ‘GroupBy’ for SQL lovers — sailing with the Titanic

Shashank R
5 min readApr 2, 2021

Who would have guessed that the Titanic would be an immense source to master Python and building classification models? But in a macabre sort of way (and modelling Survivor rate at that!), it lends great data for the newbies in the Data Science world.

Like most of the folks getting into Data Analysis in the 2000s, I was brought up on a steady dose of SQL and was quite comfortable dishing out complex criteria with all kinds of joins with window functions partitioning over different variables.

As I started my journey onto Python, I found the ‘groupby’ function to be close to the SQL ‘group by’ function — and became my go-to tool in my initial foray into Python.

This blog primarily focuses on some simple ‘groupby’ levers to whip together high level views of a Pandas DataFrame.

Here we have the famous (and macabre) ‘Titanic’ dataset from Kaggle (https://www.kaggle.com/c/titanic) where folks continue to model survival rates. The data dictionary is available here.

The key dependent variable here is ‘Survived’ with various predictors such as Passenger Class (Pclass), Sex, Age, Siblings (Sibsp), Fare — all of which were drivers of Survival rate on the Titanic.

#Reading sample rows from the Titanic dataset
titanic.head(3)

Let’s say we wanted to compute the average of key numeric fields by Sex, the SQL code would be to group by Sex while the Python script is a shorter version of that. Note that average of PassengerId and Parch doesn’t make much sense, however, it gives you the Survival rate and average age for each gender.

#Using groupby to get information grouped by one variable for all numeric columns
titanic.groupby('Sex').mean()
# SQL equivalent of above Python script
select Sex, avg(PassengerId), avg(Survived), avg(Pclass), avg(Age), avg(SibSp), avg(Parch), avg(Fare)
from titanic
group by Sex

Males had a paltry 19% survival rate while Females had a relatively robust 74% survival rate. The null model does sort of validate okay-ish with Jack and Rose’s story with their 0% and 100% survival rate!

If we were not interested in all the metrics but only one metric such as the Survival rate, then we can use the following Python query to by calling out what column we want to target for the metric.

#Using group by for a specific target column like 'Survived'
titanic.groupby('Sex')['Survived'].mean()
####
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64

Now saw we wanted the results to be ordered by survival rate, one can accomplish this using ‘order by’ in SQL and would need ‘sort_values()’ in Python.

#results grouped by Pclass and sorted by selected variable ('Survived' here)
titanic.groupby('Pclass')['Survived'].mean().sort_values()
#SQL equivalent
select Pclass, avg(Survived) as survivor_rate
from titanic
group by Pclass
order by survivor_rate
#####
Pclass
3 0.242363
2 0.472826
1 0.629630
Name: Survived, dtype: float64

If we wanted not just the average, but also a view into count by key variable as well as the median, this can be easily accomplished using the ‘agg’ function in Python to request more aggregate values. Counts and Averages are easier in SQL while median needs more effort and I will leave that to the reader.

#results grouped by Pclass and get count, average age and median age
titanic.groupby('Pclass')['Age'].agg(['count', 'mean', 'median'])
#SQL equivalent - Median is harder in regular SQL and would need Windows function
select Pclass, count(*), avg(Age)
from titanic
group by Pclass

If we want to get the view by more dimensions, it is quite easy in either languages by adding more variables to the ‘groupby’. See below for an example grouping by Sex and Pclass.

#Grouping by multiple columns
titanic.groupby(['Sex', 'Pclass'])['Survived'].agg(['count', 'mean'])
#SQL equivalent
select Sex, Pclass, count(*), avg(Survived)
from titanic
group by 1, 2

If one wants to reduce the number of digits after the decimal to a fewer number, it is quite easy in Python by adding the round() function to the above script. I will leave the SQL equivalent to the reader.

#Grouping by multiple columns and rounding to 2 digits after decimal point
titanic.groupby(['Sex', 'Pclass'])['Survived'].mean().round(2)
######
Sex Pclass
female 1 0.97
2 0.92
3 0.50
male 1 0.37
2 0.16
3 0.14
Name: Survived, dtype: float64

Now if one wanted to add more criteria before grouping, one can accomplish this using a combination of ‘query’ and ‘groupby’ in Python and ‘where’ clause and ‘group by’ in SQL. See below for an example to look at Passenger Classes for passengers below 12 years old and their survival rate.

#Query to apply criteria, and then groupby and get mean
titanic.query('Age < 12').groupby('Pclass')['Survived'].agg(['count', 'mean'])
#SQL equivalent
select Pclass, avg(Survived)
from titanic
where Age < 12
group by Pclass

This should give a good set of sample scripts to get started to leverage ‘group by’ in Python especially for those users transitioning from SQL.

And to close out the macabre Titanic dataset, one thing I didn’t see on there was whether someone was a musician playing on the deck or not (maybe it’s buried in the name). If Cameron’s depiction of the Titanic was right, I would assume one can feature engineer those musicians - who would probably have a low survival rate.

--

--

Shashank R

Seasoned Expert across the full spectrum of Financial Services / FinTech— experienced in Risk, Credit, Payments, Fraud, Collections, Ops strategies