Hands-on Pandas(10): Group Operations using groupby

Sometimes you need to perform operations on subsets of data. Your rows might have attributes in common or somehow form logical groups based on other properties. Common operations like finding the average, maximum, count, or standard deviation of values from groups of data is a really common task, and Pandas makes this really easy to accomplish.

In this article, we will learn how to use the groupby function and study some of the built-in aggregations you can run on groups. This will give you another valuable tool for data analysis, and I hope it'll help you accomplish your tasks in a much simpler way.

Great, let's get started!

Loading our sample data

We will use data from a CSV file I created with info about 16 Pokemon. It contains attributes like the Name, Color (Green, Blue, Red, Yellow), and other stats like HP, Attack, Defense, and Speed.

We are interested in calculating some common aggregations over groups of Pokemon with different colors.

import pandas as pd

pdata = pd.read_csv('./sample_data/poke_colors.csv')
pdata

	Name	Color	Evolves	HP	Attack	Defense	SpAtk	SpDef	Speed
0	Caterpie	Green	True	45	30	35	20	20	45
1	Metapod	Green	True	50	20	55	25	25	30
2	Scyther	Green	False	70	110	80	55	80	105
3	Bulbasaur	Green	True	45	49	49	65	65	45
4	Dratini	Blue	True	41	64	45	50	50	50
5	Squirtle	Blue	True	44	48	65	50	64	43
6	Poliwag	Blue	True	40	50	40	40	40	90
7	Poliwhirl	Blue	True	65	65	65	50	50	90
8	Charmander	Red	True	39	52	43	60	50	65
9	Magmar	Red	False	65	95	57	100	85	93
10	Paras	Red	True	35	70	55	45	55	25
11	Parasect	Red	False	60	95	80	60	80	30
12	Pikachu	Yellow	True	35	55	40	50	50	90
13	Abra	Yellow	True	25	20	15	105	55	90
14	Psyduck	Yellow	True	50	52	48	65	50	55
15	Kadabra	Yellow	True	40	35	30	120	70	10

GroupBy

Pandas function groupby is used to create GroupBy objects. These objects can perform lots of useful built-in aggregations with just a single function call. groupby receives as argument a list of keys that decide how the grouping is performed. In our first example we will group the Pokemon by color:

pg = pdata.groupby('Color')
pg

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff848e80f28>

As you might have seen from the message on top, we got an object of type GroupBy. If you want to get a bit more information about it, use the size method:

pg.size()

Color
Blue      4
Green     4
Red       4
Yellow    4
dtype: int64

Another way of getting some basic information about the group is using the groups attribute (it's a dictionary):

pg.groups

{'Blue': Int64Index([4, 5, 6, 7], dtype='int64'),
 'Green': Int64Index([0, 1, 2, 3], dtype='int64'),
 'Red': Int64Index([8, 9, 10, 11], dtype='int64'),
 'Yellow': Int64Index([12, 13, 14, 15], dtype='int64')}

Great!

Now that you know how to create a basic group, let's check two important properties:

# The first one is that you can iterate over GroupBy objects.

for color, group in pg:
    print(color)
    print(group)
    print('\n\n')

Blue
        Name Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
4    Dratini  Blue     True  41      64       45     50     50     50
5   Squirtle  Blue     True  44      48       65     50     64     43
6    Poliwag  Blue     True  40      50       40     40     40     90
7  Poliwhirl  Blue     True  65      65       65     50     50     90



Green
        Name  Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
0   Caterpie  Green     True  45      30       35     20     20     45
1    Metapod  Green     True  50      20       55     25     25     30
2    Scyther  Green    False  70     110       80     55     80    105
3  Bulbasaur  Green     True  45      49       49     65     65     45



Red
          Name Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
8   Charmander   Red     True  39      52       43     60     50     65
9       Magmar   Red    False  65      95       57    100     85     93
10       Paras   Red     True  35      70       55     45     55     25
11    Parasect   Red    False  60      95       80     60     80     30



Yellow
       Name   Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
12  Pikachu  Yellow     True  35      55       40     50     50     90
13     Abra  Yellow     True  25      20       15    105     55     90
14  Psyduck  Yellow     True  50      52       48     65     50     55
15  Kadabra  Yellow     True  40      35       30    120     70     10

# The second is that you can access the subgroups by providing the right key to the get_group method
pg.get_group('Blue')

	Name	Color	Evolves	HP	Attack	Defense	SpAtk	SpDef	Speed
4	Dratini	Blue	True	41	64	45	50	50	50
5	Squirtle	Blue	True	44	48	65	50	64	43
6	Poliwag	Blue	True	40	50	40	40	40	90
7	Poliwhirl	Blue	True	65	65	65	50	50	90

Groups are interesting because they let you calculate aggregations on user-defined subsections of your data. Let's calculate the mean of every stat for the color groups:

#Evolves is boolean, so True will be treated as a 1 and False as 0 when calculating the mean
pg.mean()

	Evolves	HP	Attack	Defense	SpAtk	SpDef	Speed
Color
Blue	1.00	47.50	56.75	53.75	47.50	51.00	68.25
Green	0.75	52.50	52.25	54.75	41.25	47.50	56.25
Red	0.50	49.75	78.00	58.75	66.25	67.50	53.25
Yellow	1.00	37.50	40.50	33.25	85.00	56.25	61.25

Grouping just a subset of columns

Sometimes what you want is to create groups with just one or two columns of interest. If this is the case, you can select them (and the column you want to group by) on the dataframe and then just call groupby:

# Let's create groups with just the HP values and then calculate the mean
pdata[['HP', 'Color']].groupby('Color').mean()

	HP
Color
Blue	47.50
Green	52.50
Red	49.75
Yellow	37.50

# Let's do the same for Attack, Defense and Speed
pdata[['Color', 'Attack', 'Defense', 'Speed']].groupby('Color').mean()

	Attack	Defense	Speed
Color
Blue	56.75	53.75	68.25
Green	52.25	54.75	56.25
Red	78.00	58.75	53.25
Yellow	40.50	33.25	61.25

Groups from multiple keys

You can create groups using more than one column as key. If you pass a list of keys to the groupby method, you will create a hierarchical grouping. Let's group by both Color and Evolves attributes:

opg = pdata.groupby(['Color', 'Evolves'])
opg.size()

Color   Evolves
Blue    True       4
Green   False      1
        True       3
Red     False      2
        True       2
Yellow  True       4
dtype: int64

opg.groups

{('Blue', True): Int64Index([4, 5, 6, 7], dtype='int64'),
 ('Green', False): Int64Index([2], dtype='int64'),
 ('Green', True): Int64Index([0, 1, 3], dtype='int64'),
 ('Red', False): Int64Index([9, 11], dtype='int64'),
 ('Red', True): Int64Index([8, 10], dtype='int64'),
 ('Yellow', True): Int64Index([12, 13, 14, 15], dtype='int64')}

Common aggregations:

Some built-in aggregations already come implemented in Pandas, and you can also define your own if you need it. In the following section, we will run the following functions to get an idea of Pandas' default behavior. All these functions take into consideration only non-NA values:

count: Number of elements in each group.
sum: Sum of values in each group.
mean: Arithmetic mean of each group.
max: Maximum value of each group
min: Minimum value of each group
std: Standard deviation of each group.
var: Variance of each group.

Also, it's important to know that you don't need to assign the group to a variable. Very often, you will just use it as an intermediate value to call an aggregation. So, instead of writing this:

pg = pdata.groupby('Color')
pg.mean()

You can write this:

pdata.groupby('Color').mean()

# Let's drop the Name and Evolves properties to create a new dataframe
spdata = pdata.drop(['Name', 'Evolves'], axis=1)
spdata

	Color	HP	Attack	Defense	SpAtk	SpDef	Speed
0	Green	45	30	35	20	20	45
1	Green	50	20	55	25	25	30
2	Green	70	110	80	55	80	105
3	Green	45	49	49	65	65	45
4	Blue	41	64	45	50	50	50
5	Blue	44	48	65	50	64	43
6	Blue	40	50	40	40	40	90
7	Blue	65	65	65	50	50	90
8	Red	39	52	43	60	50	65
9	Red	65	95	57	100	85	93
10	Red	35	70	55	45	55	25
11	Red	60	95	80	60	80	30
12	Yellow	35	55	40	50	50	90
13	Yellow	25	20	15	105	55	90
14	Yellow	50	52	48	65	50	55
15	Yellow	40	35	30	120	70	10

# Count the elements in each group
spdata.groupby('Color').count()

	HP	Attack	Defense	SpAtk	SpDef	Speed
Color
Blue	4	4	4	4	4	4
Green	4	4	4	4	4	4
Red	4	4	4	4	4	4
Yellow	4	4	4	4	4	4

# Calculate the sum in each group
spdata.groupby('Color').sum()

	HP	Attack	Defense	SpAtk	SpDef	Speed
Color
Blue	190	227	215	190	204	273
Green	210	209	219	165	190	225
Red	199	312	235	265	270	213
Yellow	150	162	133	340	225	245

# Calculate the mean in each group
spdata.groupby('Color').mean()

	HP	Attack	Defense	SpAtk	SpDef	Speed
Color
Blue	47.50	56.75	53.75	47.50	51.00	68.25
Green	52.50	52.25	54.75	41.25	47.50	56.25
Red	49.75	78.00	58.75	66.25	67.50	53.25
Yellow	37.50	40.50	33.25	85.00	56.25	61.25

# Find the maximum of each stat for every group
spdata.groupby('Color').max()

	HP	Attack	Defense	SpAtk	SpDef	Speed
Color
Blue	65	65	65	50	64	90
Green	70	110	80	65	80	105
Red	65	95	80	100	85	93
Yellow	50	55	48	120	70	90

# Find the minimum of each stat for every group
spdata.groupby('Color').min()

	HP	Attack	Defense	SpAtk	SpDef	Speed
Color
Blue	40	48	40	40	40	43
Green	45	20	35	20	20	30
Red	35	52	43	45	50	25
Yellow	25	20	15	50	50	10

# Calculate the standard deviation of each stat for every group
spdata.groupby('Color').std()

	HP	Attack	Defense	SpAtk	SpDef	Speed
Color
Blue	11.789826	8.995369	13.149778	5.000000	9.865766	25.276801
Green	11.902381	40.335055	18.803812	22.126530	29.580399	33.260337
Red	14.952703	20.960280	15.456929	23.584953	17.559423	31.920474
Yellow	10.408330	16.258331	14.221463	32.914029	9.464847	37.941841

# Calculate the variance of each stat for every subgroup
spdata.groupby('Color').var()

	HP	Attack	Defense	SpAtk	SpDef	Speed
Color
Blue	139.000000	80.916667	172.916667	25.000000	97.333333	638.916667
Green	141.666667	1626.916667	353.583333	489.583333	875.000000	1106.250000
Red	223.583333	439.333333	238.916667	556.250000	308.333333	1018.916667
Yellow	108.333333	264.333333	202.250000	1083.333333	89.583333	1439.583333

Custom-defined aggregations

Pandas has a method called agg that lets you define and run your own aggregations over groups. In the next example, we will define a function that calculates the sum of the squares of the stats in every group.

def sum_of_squares(arr):
    sos = 0
    for element in arr:
        sos += element**2
    return sos

spdata.groupby('Color').agg(sum_of_squares)

	HP	Attack	Defense	SpAtk	SpDef	Speed
Color
Blue	9442	13125	12075	9100	10696	20549
Green	11450	15801	13051	8275	11650	15975
Red	10571	25654	14523	19225	19150	14399
Yellow	5950	7354	5029	32150	12925	19325

Aggregations are powerful!

Grouping data for analysis using Pandas is efficient, elegant, and powerful. Traditional relational databases owe part of their great popularity to their capacity to perform this sort of functionality, and now you can get much of the same work done using Python.

This last function (agg) might be one of the most useful capabilities Pandas provides. Custom-defined aggregations are incredibly powerful. They let you calculate any crazy function you can come up with, in a repeatable and consistent way.

This feature is so powerful that we will dedicate the next article in expanding the topic: We will talk about the powerful apply method.

Thank you for reading!

What to do next

Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
You can find the source code for this series in this repo.
This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
Send me an email with questions, comments or suggestions (it's in the About Me page)