PyByte
Posts
How to analyze data by grouping it 🏘️

How to analyze data by grouping it 🏘️

And the SQL equivalent!

Joel Johnson
November 16, 2023

from PyByte import spotlight_story, deep_dive, get_inspired, byte_break

# Welcome to the PyByte!

This week’s newsletter will be short & concise as I spent most of this week finishing Python for Professionals course! woohoo! 🥳 It’s now back to full price, but don’t worry, keep an eye out for a crazy Black Friday deal 👀

spotlight_story()

This story is aimed at the 15% of my followers who are women. Sorry guys! 🤷‍♀️

Thanks Instagram

The FIRST EVER PyLadies conference is coming up from December 1st to the 3rd of this year!

Gif by forgotenlove13 on Giphy

PyLadies is a global group that aims to involve more women in the Python open-source community.

There are many local PyLadies chapters around the world, hosting local workshops, meetups, and other events.

From December 1 to 3, the first PyLadies conference will be held online. There will be talks in English, Spanish, and Portuguese, including keynotes by Sophia Yang, Jimena Bermúdez, Marlene Mhangami, and Thais Viana.

Registering for the conference is free, and you can do so on the conference website. You can also preview some of the scheduled talks.

deep_dive()

Let’s group data in a DataFrame, and see what the SQL equivalent looks like!

As someone who was way more comfortable with SQL than I was with Python, this was a helpful method for me when I was first learning Pandas.

I’ve got a simple table with sales information for fruit that includes the ID, product description, number sold, and price.

Let’s first count how many sales of each fruit there are.

In SQL, we would do something like this:

SELECT Product, COUNT(*) AS Count
FROM sales_data
GROUP BY Product;

The equivalent in Python using the groupby() method would be this:

df.groupby('Product').size().reset_index(name='Count')

.size() counts the number of rows that we grouped, and .reset_index() formats the table and renames the new column as Count.

Awesome! But not terribly useful information. Let’s instead sum the total number of each fruit sold, instead of counting the sales of that fruit.

In SQL we could do this:

SELECT Product, SUM(Quantity) AS Total_Sold
FROM sales_data
GROUP BY Product;

But in Python we can get the same result with:

df.groupby('Product')['Quantity'].sum().reset_index(name='Total_Sold')

This is just a taste of how powerful groupby() is in Pandas!

Hint: Use .agg() with .groupby() to get even more interesting statistical analysis. Maybe a newsletter for another day?

get_inspired()

"You have to live with your mistakes, but you don't have to compound them.

To follow up an error with a foolish reaction is to lose twice. Given the reality of where you are right now, what is the best next action?

A wise response may not erase a dumb mistake, but it can redeem it."

James Clear

byte_break()

See you in 2 weeks!

Happy Thanksgiving (if you’re in the USA)

Joel