The AiEdge Newsletter

The AiEdge Newsletter

Share this post

The AiEdge Newsletter
The AiEdge Newsletter
Advanced Data Manipulation with Pandas
Copy link
Facebook
Email
Notes
More

Advanced Data Manipulation with Pandas

Data Science Fundamentals

Damien Benveniste's avatar
Damien Benveniste
Mar 10, 2023
∙ Paid
10

Share this post

The AiEdge Newsletter
The AiEdge Newsletter
Advanced Data Manipulation with Pandas
Copy link
Facebook
Email
Notes
More
Share

Pandas is a very powerful data manipulation library based on the development logic in SQL. We are going to look at:

  • Group By: split-apply-combine

  • The Multi-Index

  • Concatenate

  • Merge and Join

  • Reshaping and Pivot Tables


This lecture is part of the Data Science Fundamentals series.

  • Previous: Exploratory Data Analysis with Pandas

  • Next: Scrapping the Web for NLP applications

Group By: split-apply-combine

We use again the Titanic data as a toy dataset to demonstrate the different functionalities.

# We load the libraries
import pandas as pd
import seaborn as sns
import numpy as np

# we load the data
titanic_df = sns.load_dataset("titanic")
titanic_df

The groupby method works very much like in SQL. We can create a grouped object by columns

grouped = titanic_df.groupby(["survived", "sex"])

and we can apply methods on that object. For example we can sum by groups

# We sum by groups
grouped.sum()

Or we can count by groups

# we count by groups 
grouped.count()

We can look at the first row of each group

# we look at the first row
grouped.first()

We can look at the index of the different groups

grouped.groups

We can also iterate through the different groups

for name, group in grouped:
    print(name)
    print(group)

We can use the function agg (or aggregate) to apply function to the groups

# we sum per each group
grouped.agg(np.sum)

Which is equivalent to

grouped.sum()

We can apply many functions at once

# we compute the mean and the standard deviation
grouped.agg([np.mean, np.std])

We can use the transform method to transform the different groups

# we construct a function to normalize
normalize = lambda x: (x - x.mean()) / x.std() 

# and we apply it to each group
grouped[["pclass", "age", "sibsp"]].transform(normalize)

or equivalently

def normalize(x):
    return (x - x.mean()) / x.std()

grouped[["pclass", "age", "sibsp"]].transform(normalize)

The function apply can be used for more general use cases

# we can apply any function to each group
grouped[["pclass", "age", "sibsp"]].apply(lambda x: x.mean())

We can also plot by groups (which is kind of cool!)

grouped.plot()

The Multi-Index

The multi-index is one of most exciting perks of Pandas. It is often very useful to index the data with the existing columns. Here we create a multi-index

Keep reading with a 7-day free trial

Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 AiEdge
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More