Sometimes you need to perform operations on subsets of data. Your rows might have attributes in common or somehow form logical groups based on other properties. Common operations like finding the average, maximum, count, or standard deviation of values from groups of data is a really common task, and Pandas makes this really easy to accomplish.
In this article, we will learn how to use the groupby
function and study some of the built-in aggregations you can run on groups. This will give you another valuable tool for data analysis, and I hope it'll help you accomplish your tasks in a much simpler way.
Great, let's get started!
Loading our sample data
We will use data from a CSV file I created with info about 16 Pokemon. It contains attributes like the Name, Color (Green, Blue, Red, Yellow), and other stats like HP, Attack, Defense, and Speed.
We are interested in calculating some common aggregations over groups of Pokemon with different colors.
import pandas as pd
pdata = pd.read_csv('./sample_data/poke_colors.csv')
pdata
Name | Color | Evolves | HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|---|---|---|
0 | Caterpie | Green | True | 45 | 30 | 35 | 20 | 20 | 45 |
1 | Metapod | Green | True | 50 | 20 | 55 | 25 | 25 | 30 |
2 | Scyther | Green | False | 70 | 110 | 80 | 55 | 80 | 105 |
3 | Bulbasaur | Green | True | 45 | 49 | 49 | 65 | 65 | 45 |
4 | Dratini | Blue | True | 41 | 64 | 45 | 50 | 50 | 50 |
5 | Squirtle | Blue | True | 44 | 48 | 65 | 50 | 64 | 43 |
6 | Poliwag | Blue | True | 40 | 50 | 40 | 40 | 40 | 90 |
7 | Poliwhirl | Blue | True | 65 | 65 | 65 | 50 | 50 | 90 |
8 | Charmander | Red | True | 39 | 52 | 43 | 60 | 50 | 65 |
9 | Magmar | Red | False | 65 | 95 | 57 | 100 | 85 | 93 |
10 | Paras | Red | True | 35 | 70 | 55 | 45 | 55 | 25 |
11 | Parasect | Red | False | 60 | 95 | 80 | 60 | 80 | 30 |
12 | Pikachu | Yellow | True | 35 | 55 | 40 | 50 | 50 | 90 |
13 | Abra | Yellow | True | 25 | 20 | 15 | 105 | 55 | 90 |
14 | Psyduck | Yellow | True | 50 | 52 | 48 | 65 | 50 | 55 |
15 | Kadabra | Yellow | True | 40 | 35 | 30 | 120 | 70 | 10 |
GroupBy
Pandas function groupby
is used to create GroupBy objects. These objects can perform lots of useful built-in aggregations with just a single function call. groupby
receives as argument a list of keys that decide how the grouping is performed. In our first example we will group the Pokemon by color:
pg = pdata.groupby('Color')
pg
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff848e80f28>
As you might have seen from the message on top, we got an object of type GroupBy. If you want to get a bit more information about it, use the size
method:
pg.size()
Color
Blue 4
Green 4
Red 4
Yellow 4
dtype: int64
Another way of getting some basic information about the group is using the groups
attribute (it's a dictionary):
pg.groups
{'Blue': Int64Index([4, 5, 6, 7], dtype='int64'),
'Green': Int64Index([0, 1, 2, 3], dtype='int64'),
'Red': Int64Index([8, 9, 10, 11], dtype='int64'),
'Yellow': Int64Index([12, 13, 14, 15], dtype='int64')}
Great!
Now that you know how to create a basic group, let's check two important properties:
# The first one is that you can iterate over GroupBy objects.
for color, group in pg:
print(color)
print(group)
print('\n\n')
Blue
Name Color Evolves HP Attack Defense SpAtk SpDef Speed
4 Dratini Blue True 41 64 45 50 50 50
5 Squirtle Blue True 44 48 65 50 64 43
6 Poliwag Blue True 40 50 40 40 40 90
7 Poliwhirl Blue True 65 65 65 50 50 90
Green
Name Color Evolves HP Attack Defense SpAtk SpDef Speed
0 Caterpie Green True 45 30 35 20 20 45
1 Metapod Green True 50 20 55 25 25 30
2 Scyther Green False 70 110 80 55 80 105
3 Bulbasaur Green True 45 49 49 65 65 45
Red
Name Color Evolves HP Attack Defense SpAtk SpDef Speed
8 Charmander Red True 39 52 43 60 50 65
9 Magmar Red False 65 95 57 100 85 93
10 Paras Red True 35 70 55 45 55 25
11 Parasect Red False 60 95 80 60 80 30
Yellow
Name Color Evolves HP Attack Defense SpAtk SpDef Speed
12 Pikachu Yellow True 35 55 40 50 50 90
13 Abra Yellow True 25 20 15 105 55 90
14 Psyduck Yellow True 50 52 48 65 50 55
15 Kadabra Yellow True 40 35 30 120 70 10
# The second is that you can access the subgroups by providing the right key to the get_group method
pg.get_group('Blue')
Name | Color | Evolves | HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|---|---|---|
4 | Dratini | Blue | True | 41 | 64 | 45 | 50 | 50 | 50 |
5 | Squirtle | Blue | True | 44 | 48 | 65 | 50 | 64 | 43 |
6 | Poliwag | Blue | True | 40 | 50 | 40 | 40 | 40 | 90 |
7 | Poliwhirl | Blue | True | 65 | 65 | 65 | 50 | 50 | 90 |
Groups are interesting because they let you calculate aggregations on user-defined subsections of your data. Let's calculate the mean of every stat for the color groups:
#Evolves is boolean, so True will be treated as a 1 and False as 0 when calculating the mean
pg.mean()
Evolves | HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|---|
Color | |||||||
Blue | 1.00 | 47.50 | 56.75 | 53.75 | 47.50 | 51.00 | 68.25 |
Green | 0.75 | 52.50 | 52.25 | 54.75 | 41.25 | 47.50 | 56.25 |
Red | 0.50 | 49.75 | 78.00 | 58.75 | 66.25 | 67.50 | 53.25 |
Yellow | 1.00 | 37.50 | 40.50 | 33.25 | 85.00 | 56.25 | 61.25 |
Grouping just a subset of columns
Sometimes what you want is to create groups with just one or two columns of interest. If this is the case, you can select them (and the column you want to group by) on the dataframe and then just call groupby
:
# Let's create groups with just the HP values and then calculate the mean
pdata[['HP', 'Color']].groupby('Color').mean()
HP | |
---|---|
Color | |
Blue | 47.50 |
Green | 52.50 |
Red | 49.75 |
Yellow | 37.50 |
# Let's do the same for Attack, Defense and Speed
pdata[['Color', 'Attack', 'Defense', 'Speed']].groupby('Color').mean()
Attack | Defense | Speed | |
---|---|---|---|
Color | |||
Blue | 56.75 | 53.75 | 68.25 |
Green | 52.25 | 54.75 | 56.25 |
Red | 78.00 | 58.75 | 53.25 |
Yellow | 40.50 | 33.25 | 61.25 |
Groups from multiple keys
You can create groups using more than one column as key. If you pass a list of keys to the groupby
method, you will create a hierarchical grouping. Let's group by both Color and Evolves attributes:
opg = pdata.groupby(['Color', 'Evolves'])
opg.size()
Color Evolves
Blue True 4
Green False 1
True 3
Red False 2
True 2
Yellow True 4
dtype: int64
opg.groups
{('Blue', True): Int64Index([4, 5, 6, 7], dtype='int64'),
('Green', False): Int64Index([2], dtype='int64'),
('Green', True): Int64Index([0, 1, 3], dtype='int64'),
('Red', False): Int64Index([9, 11], dtype='int64'),
('Red', True): Int64Index([8, 10], dtype='int64'),
('Yellow', True): Int64Index([12, 13, 14, 15], dtype='int64')}
Common aggregations:
Some built-in aggregations already come implemented in Pandas, and you can also define your own if you need it. In the following section, we will run the following functions to get an idea of Pandas' default behavior. All these functions take into consideration only non-NA values:
- count: Number of elements in each group.
- sum: Sum of values in each group.
- mean: Arithmetic mean of each group.
- max: Maximum value of each group
- min: Minimum value of each group
- std: Standard deviation of each group.
- var: Variance of each group.
Also, it's important to know that you don't need to assign the group to a variable. Very often, you will just use it as an intermediate value to call an aggregation. So, instead of writing this:
pg = pdata.groupby('Color')
pg.mean()
You can write this:
pdata.groupby('Color').mean()
# Let's drop the Name and Evolves properties to create a new dataframe
spdata = pdata.drop(['Name', 'Evolves'], axis=1)
spdata
Color | HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|---|
0 | Green | 45 | 30 | 35 | 20 | 20 | 45 |
1 | Green | 50 | 20 | 55 | 25 | 25 | 30 |
2 | Green | 70 | 110 | 80 | 55 | 80 | 105 |
3 | Green | 45 | 49 | 49 | 65 | 65 | 45 |
4 | Blue | 41 | 64 | 45 | 50 | 50 | 50 |
5 | Blue | 44 | 48 | 65 | 50 | 64 | 43 |
6 | Blue | 40 | 50 | 40 | 40 | 40 | 90 |
7 | Blue | 65 | 65 | 65 | 50 | 50 | 90 |
8 | Red | 39 | 52 | 43 | 60 | 50 | 65 |
9 | Red | 65 | 95 | 57 | 100 | 85 | 93 |
10 | Red | 35 | 70 | 55 | 45 | 55 | 25 |
11 | Red | 60 | 95 | 80 | 60 | 80 | 30 |
12 | Yellow | 35 | 55 | 40 | 50 | 50 | 90 |
13 | Yellow | 25 | 20 | 15 | 105 | 55 | 90 |
14 | Yellow | 50 | 52 | 48 | 65 | 50 | 55 |
15 | Yellow | 40 | 35 | 30 | 120 | 70 | 10 |
# Count the elements in each group
spdata.groupby('Color').count()
HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|
Color | ||||||
Blue | 4 | 4 | 4 | 4 | 4 | 4 |
Green | 4 | 4 | 4 | 4 | 4 | 4 |
Red | 4 | 4 | 4 | 4 | 4 | 4 |
Yellow | 4 | 4 | 4 | 4 | 4 | 4 |
# Calculate the sum in each group
spdata.groupby('Color').sum()
HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|
Color | ||||||
Blue | 190 | 227 | 215 | 190 | 204 | 273 |
Green | 210 | 209 | 219 | 165 | 190 | 225 |
Red | 199 | 312 | 235 | 265 | 270 | 213 |
Yellow | 150 | 162 | 133 | 340 | 225 | 245 |
# Calculate the mean in each group
spdata.groupby('Color').mean()
HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|
Color | ||||||
Blue | 47.50 | 56.75 | 53.75 | 47.50 | 51.00 | 68.25 |
Green | 52.50 | 52.25 | 54.75 | 41.25 | 47.50 | 56.25 |
Red | 49.75 | 78.00 | 58.75 | 66.25 | 67.50 | 53.25 |
Yellow | 37.50 | 40.50 | 33.25 | 85.00 | 56.25 | 61.25 |
# Find the maximum of each stat for every group
spdata.groupby('Color').max()
HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|
Color | ||||||
Blue | 65 | 65 | 65 | 50 | 64 | 90 |
Green | 70 | 110 | 80 | 65 | 80 | 105 |
Red | 65 | 95 | 80 | 100 | 85 | 93 |
Yellow | 50 | 55 | 48 | 120 | 70 | 90 |
# Find the minimum of each stat for every group
spdata.groupby('Color').min()
HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|
Color | ||||||
Blue | 40 | 48 | 40 | 40 | 40 | 43 |
Green | 45 | 20 | 35 | 20 | 20 | 30 |
Red | 35 | 52 | 43 | 45 | 50 | 25 |
Yellow | 25 | 20 | 15 | 50 | 50 | 10 |
# Calculate the standard deviation of each stat for every group
spdata.groupby('Color').std()
HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|
Color | ||||||
Blue | 11.789826 | 8.995369 | 13.149778 | 5.000000 | 9.865766 | 25.276801 |
Green | 11.902381 | 40.335055 | 18.803812 | 22.126530 | 29.580399 | 33.260337 |
Red | 14.952703 | 20.960280 | 15.456929 | 23.584953 | 17.559423 | 31.920474 |
Yellow | 10.408330 | 16.258331 | 14.221463 | 32.914029 | 9.464847 | 37.941841 |
# Calculate the variance of each stat for every subgroup
spdata.groupby('Color').var()
HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|
Color | ||||||
Blue | 139.000000 | 80.916667 | 172.916667 | 25.000000 | 97.333333 | 638.916667 |
Green | 141.666667 | 1626.916667 | 353.583333 | 489.583333 | 875.000000 | 1106.250000 |
Red | 223.583333 | 439.333333 | 238.916667 | 556.250000 | 308.333333 | 1018.916667 |
Yellow | 108.333333 | 264.333333 | 202.250000 | 1083.333333 | 89.583333 | 1439.583333 |
Custom-defined aggregations
Pandas has a method called agg
that lets you define and run your own aggregations over groups. In the next example, we will define a function that calculates the sum of the squares of the stats in every group.
def sum_of_squares(arr):
sos = 0
for element in arr:
sos += element**2
return sos
spdata.groupby('Color').agg(sum_of_squares)
HP | Attack | Defense | SpAtk | SpDef | Speed | |
---|---|---|---|---|---|---|
Color | ||||||
Blue | 9442 | 13125 | 12075 | 9100 | 10696 | 20549 |
Green | 11450 | 15801 | 13051 | 8275 | 11650 | 15975 |
Red | 10571 | 25654 | 14523 | 19225 | 19150 | 14399 |
Yellow | 5950 | 7354 | 5029 | 32150 | 12925 | 19325 |
Aggregations are powerful!
Grouping data for analysis using Pandas is efficient, elegant, and powerful. Traditional relational databases owe part of their great popularity to their capacity to perform this sort of functionality, and now you can get much of the same work done using Python.
This last function (agg
) might be one of the most useful capabilities Pandas provides. Custom-defined aggregations are incredibly powerful. They let you calculate any crazy function you can come up with, in a repeatable and consistent way.
This feature is so powerful that we will dedicate the next article in expanding the topic: We will talk about the powerful apply
method.
Thank you for reading!
What to do next
- Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
- You can find the source code for this series in this repo.
- This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
- Send me an email with questions, comments or suggestions (it's in the About Me page)