BrainsToBytes

Hands-on Pandas(10): Group Operations using groupby

Sometimes you need to perform operations on subsets of data. Your rows might have attributes in common or somehow form logical groups based on other properties. Common operations like finding the average, maximum, count, or standard deviation of values from groups of data is a really common task, and Pandas makes this really easy to accomplish.

In this article, we will learn how to use the groupby function and study some of the built-in aggregations you can run on groups. This will give you another valuable tool for data analysis, and I hope it'll help you accomplish your tasks in a much simpler way.

Great, let's get started!

Loading our sample data

We will use data from a CSV file I created with info about 16 Pokemon. It contains attributes like the Name, Color (Green, Blue, Red, Yellow), and other stats like HP, Attack, Defense, and Speed.

We are interested in calculating some common aggregations over groups of Pokemon with different colors.

import pandas as pd

pdata = pd.read_csv('./sample_data/poke_colors.csv')
pdata
Name Color Evolves HP Attack Defense SpAtk SpDef Speed
0 Caterpie Green True 45 30 35 20 20 45
1 Metapod Green True 50 20 55 25 25 30
2 Scyther Green False 70 110 80 55 80 105
3 Bulbasaur Green True 45 49 49 65 65 45
4 Dratini Blue True 41 64 45 50 50 50
5 Squirtle Blue True 44 48 65 50 64 43
6 Poliwag Blue True 40 50 40 40 40 90
7 Poliwhirl Blue True 65 65 65 50 50 90
8 Charmander Red True 39 52 43 60 50 65
9 Magmar Red False 65 95 57 100 85 93
10 Paras Red True 35 70 55 45 55 25
11 Parasect Red False 60 95 80 60 80 30
12 Pikachu Yellow True 35 55 40 50 50 90
13 Abra Yellow True 25 20 15 105 55 90
14 Psyduck Yellow True 50 52 48 65 50 55
15 Kadabra Yellow True 40 35 30 120 70 10

GroupBy

Pandas function groupby is used to create GroupBy objects. These objects can perform lots of useful built-in aggregations with just a single function call. groupby receives as argument a list of keys that decide how the grouping is performed. In our first example we will group the Pokemon by color:

pg = pdata.groupby('Color')
pg
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff848e80f28>

As you might have seen from the message on top, we got an object of type GroupBy. If you want to get a bit more information about it, use the size method:

pg.size()
Color
Blue      4
Green     4
Red       4
Yellow    4
dtype: int64

Another way of getting some basic information about the group is using the groups attribute (it's a dictionary):

pg.groups
{'Blue': Int64Index([4, 5, 6, 7], dtype='int64'),
 'Green': Int64Index([0, 1, 2, 3], dtype='int64'),
 'Red': Int64Index([8, 9, 10, 11], dtype='int64'),
 'Yellow': Int64Index([12, 13, 14, 15], dtype='int64')}

Great!

Now that you know how to create a basic group, let's check two important properties:

# The first one is that you can iterate over GroupBy objects.

for color, group in pg:
    print(color)
    print(group)
    print('\n\n')
Blue
        Name Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
4    Dratini  Blue     True  41      64       45     50     50     50
5   Squirtle  Blue     True  44      48       65     50     64     43
6    Poliwag  Blue     True  40      50       40     40     40     90
7  Poliwhirl  Blue     True  65      65       65     50     50     90



Green
        Name  Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
0   Caterpie  Green     True  45      30       35     20     20     45
1    Metapod  Green     True  50      20       55     25     25     30
2    Scyther  Green    False  70     110       80     55     80    105
3  Bulbasaur  Green     True  45      49       49     65     65     45



Red
          Name Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
8   Charmander   Red     True  39      52       43     60     50     65
9       Magmar   Red    False  65      95       57    100     85     93
10       Paras   Red     True  35      70       55     45     55     25
11    Parasect   Red    False  60      95       80     60     80     30



Yellow
       Name   Color  Evolves  HP  Attack  Defense  SpAtk  SpDef  Speed
12  Pikachu  Yellow     True  35      55       40     50     50     90
13     Abra  Yellow     True  25      20       15    105     55     90
14  Psyduck  Yellow     True  50      52       48     65     50     55
15  Kadabra  Yellow     True  40      35       30    120     70     10
# The second is that you can access the subgroups by providing the right key to the get_group method
pg.get_group('Blue')
Name Color Evolves HP Attack Defense SpAtk SpDef Speed
4 Dratini Blue True 41 64 45 50 50 50
5 Squirtle Blue True 44 48 65 50 64 43
6 Poliwag Blue True 40 50 40 40 40 90
7 Poliwhirl Blue True 65 65 65 50 50 90

Groups are interesting because they let you calculate aggregations on user-defined subsections of your data. Let's calculate the mean of every stat for the color groups:

#Evolves is boolean, so True will be treated as a 1 and False as 0 when calculating the mean
pg.mean()
Evolves HP Attack Defense SpAtk SpDef Speed
Color
Blue 1.00 47.50 56.75 53.75 47.50 51.00 68.25
Green 0.75 52.50 52.25 54.75 41.25 47.50 56.25
Red 0.50 49.75 78.00 58.75 66.25 67.50 53.25
Yellow 1.00 37.50 40.50 33.25 85.00 56.25 61.25

Grouping just a subset of columns

Sometimes what you want is to create groups with just one or two columns of interest. If this is the case, you can select them (and the column you want to group by) on the dataframe and then just call groupby:

# Let's create groups with just the HP values and then calculate the mean
pdata[['HP', 'Color']].groupby('Color').mean()
HP
Color
Blue 47.50
Green 52.50
Red 49.75
Yellow 37.50
# Let's do the same for Attack, Defense and Speed
pdata[['Color', 'Attack', 'Defense', 'Speed']].groupby('Color').mean()
Attack Defense Speed
Color
Blue 56.75 53.75 68.25
Green 52.25 54.75 56.25
Red 78.00 58.75 53.25
Yellow 40.50 33.25 61.25

Groups from multiple keys

You can create groups using more than one column as key. If you pass a list of keys to the groupby method, you will create a hierarchical grouping. Let's group by both Color and Evolves attributes:

opg = pdata.groupby(['Color', 'Evolves'])
opg.size()
Color   Evolves
Blue    True       4
Green   False      1
        True       3
Red     False      2
        True       2
Yellow  True       4
dtype: int64
opg.groups
{('Blue', True): Int64Index([4, 5, 6, 7], dtype='int64'),
 ('Green', False): Int64Index([2], dtype='int64'),
 ('Green', True): Int64Index([0, 1, 3], dtype='int64'),
 ('Red', False): Int64Index([9, 11], dtype='int64'),
 ('Red', True): Int64Index([8, 10], dtype='int64'),
 ('Yellow', True): Int64Index([12, 13, 14, 15], dtype='int64')}

Common aggregations:

Some built-in aggregations already come implemented in Pandas, and you can also define your own if you need it. In the following section, we will run the following functions to get an idea of Pandas' default behavior. All these functions take into consideration only non-NA values:

  • count: Number of elements in each group.
  • sum: Sum of values in each group.
  • mean: Arithmetic mean of each group.
  • max: Maximum value of each group
  • min: Minimum value of each group
  • std: Standard deviation of each group.
  • var: Variance of each group.

Also, it's important to know that you don't need to assign the group to a variable. Very often, you will just use it as an intermediate value to call an aggregation. So, instead of writing this:

pg = pdata.groupby('Color')
pg.mean()

You can write this:

pdata.groupby('Color').mean()
# Let's drop the Name and Evolves properties to create a new dataframe
spdata = pdata.drop(['Name', 'Evolves'], axis=1)
spdata
Color HP Attack Defense SpAtk SpDef Speed
0 Green 45 30 35 20 20 45
1 Green 50 20 55 25 25 30
2 Green 70 110 80 55 80 105
3 Green 45 49 49 65 65 45
4 Blue 41 64 45 50 50 50
5 Blue 44 48 65 50 64 43
6 Blue 40 50 40 40 40 90
7 Blue 65 65 65 50 50 90
8 Red 39 52 43 60 50 65
9 Red 65 95 57 100 85 93
10 Red 35 70 55 45 55 25
11 Red 60 95 80 60 80 30
12 Yellow 35 55 40 50 50 90
13 Yellow 25 20 15 105 55 90
14 Yellow 50 52 48 65 50 55
15 Yellow 40 35 30 120 70 10
# Count the elements in each group
spdata.groupby('Color').count()
HP Attack Defense SpAtk SpDef Speed
Color
Blue 4 4 4 4 4 4
Green 4 4 4 4 4 4
Red 4 4 4 4 4 4
Yellow 4 4 4 4 4 4
# Calculate the sum in each group
spdata.groupby('Color').sum()
HP Attack Defense SpAtk SpDef Speed
Color
Blue 190 227 215 190 204 273
Green 210 209 219 165 190 225
Red 199 312 235 265 270 213
Yellow 150 162 133 340 225 245
# Calculate the mean in each group
spdata.groupby('Color').mean()
HP Attack Defense SpAtk SpDef Speed
Color
Blue 47.50 56.75 53.75 47.50 51.00 68.25
Green 52.50 52.25 54.75 41.25 47.50 56.25
Red 49.75 78.00 58.75 66.25 67.50 53.25
Yellow 37.50 40.50 33.25 85.00 56.25 61.25
# Find the maximum of each stat for every group
spdata.groupby('Color').max()
HP Attack Defense SpAtk SpDef Speed
Color
Blue 65 65 65 50 64 90
Green 70 110 80 65 80 105
Red 65 95 80 100 85 93
Yellow 50 55 48 120 70 90
# Find the minimum of each stat for every group
spdata.groupby('Color').min()
HP Attack Defense SpAtk SpDef Speed
Color
Blue 40 48 40 40 40 43
Green 45 20 35 20 20 30
Red 35 52 43 45 50 25
Yellow 25 20 15 50 50 10
# Calculate the standard deviation of each stat for every group
spdata.groupby('Color').std()
HP Attack Defense SpAtk SpDef Speed
Color
Blue 11.789826 8.995369 13.149778 5.000000 9.865766 25.276801
Green 11.902381 40.335055 18.803812 22.126530 29.580399 33.260337
Red 14.952703 20.960280 15.456929 23.584953 17.559423 31.920474
Yellow 10.408330 16.258331 14.221463 32.914029 9.464847 37.941841
# Calculate the variance of each stat for every subgroup
spdata.groupby('Color').var()
HP Attack Defense SpAtk SpDef Speed
Color
Blue 139.000000 80.916667 172.916667 25.000000 97.333333 638.916667
Green 141.666667 1626.916667 353.583333 489.583333 875.000000 1106.250000
Red 223.583333 439.333333 238.916667 556.250000 308.333333 1018.916667
Yellow 108.333333 264.333333 202.250000 1083.333333 89.583333 1439.583333

Custom-defined aggregations

Pandas has a method called agg that lets you define and run your own aggregations over groups. In the next example, we will define a function that calculates the sum of the squares of the stats in every group.

def sum_of_squares(arr):
    sos = 0
    for element in arr:
        sos += element**2
    return sos

spdata.groupby('Color').agg(sum_of_squares)
HP Attack Defense SpAtk SpDef Speed
Color
Blue 9442 13125 12075 9100 10696 20549
Green 11450 15801 13051 8275 11650 15975
Red 10571 25654 14523 19225 19150 14399
Yellow 5950 7354 5029 32150 12925 19325

Aggregations are powerful!

Grouping data for analysis using Pandas is efficient, elegant, and powerful. Traditional relational databases owe part of their great popularity to their capacity to perform this sort of functionality, and now you can get much of the same work done using Python.

This last function (agg) might be one of the most useful capabilities Pandas provides. Custom-defined aggregations are incredibly powerful. They let you calculate any crazy function you can come up with, in a repeatable and consistent way.

This feature is so powerful that we will dedicate the next article in expanding the topic: We will talk about the powerful apply method.

Thank you for reading!

What to do next

Author image
Budapest, Hungary
Hey there, I'm Juan. A programmer currently living in Budapest. I believe in well-engineered solutions, clean code and sharing knowledge. Thanks for reading, I hope you find my articles useful!