BrainsToBytes

Hands-on Pandas(6): Descriptive Statistics

Pandas provides many options for calculating descriptive statistics and other reduction operations with just a simple function call. You might want to calculate these values as part of a ML/Data Analysis pipeline, or just because you want to get a better understanding of the data you are dealing with.

Most of these operations are similar to NumPy reductions, as they compute and return a single value. In some cases, it returns a structure with equal-or-fewer dimensions than the original.

In this article, we will explore some of the most used functions and see some examples. Great, let's get started!

import pandas as pd
import numpy as np

frame = pd.DataFrame(np.random.rand(4,5),
                     index=['A', 'B', 'C', 'D'],
                     columns=['One', 'Two', 'Three', 'Four', 'Five'])
frame
One Two Three Four Five
A 0.973939 0.427195 0.790004 0.027722 0.686339
B 0.190250 0.891813 0.238110 0.636394 0.104428
C 0.951482 0.207945 0.081066 0.815889 0.785882
D 0.699541 0.154921 0.752932 0.066052 0.825628

The first thing we will learn is how to perform sums. The sum function performs sums along the rows axis by default (returns the sum of the values of every column). You can pass axis='columns' as an additional parameter to perform the sum along the columns axis:

frame.sum()
One      2.815213
Two      1.681874
Three    1.862111
Four     1.546057
Five     2.402276
dtype: float64
frame.sum(axis='columns')
A    2.905198
B    2.060995
C    2.842264
D    2.499073
dtype: float64

Pandas also lets you calculate the minimum and maximum values in a dataframe's columns or rows, for this, it provides the functions min and max. Like before, you can specify the axis:

frame.max()
One      0.973939
Two      0.891813
Three    0.790004
Four     0.815889
Five     0.825628
dtype: float64
frame.min(axis='columns')
A    0.027722
B    0.104428
C    0.081066
D    0.066052
dtype: float64

If instead, you are interested in the indexes where the minimum and maximum values are, just use idxmax and idxmin:

frame.idxmax() # All columns have their maximum values at row D
One      A
Two      B
Three    A
Four     C
Five     D
dtype: object
frame.idxmin(axis='columns') # All rows have their minimum at column One
A     Four
B     Five
C    Three
D     Four
dtype: object

Pandas also has functions for calculating (among many others) the mean, median, standard deviation and variance:

frame.mean()
One      0.703803
Two      0.420468
Three    0.465528
Four     0.386514
Five     0.600569
dtype: float64
frame.median(axis='columns')
A    0.686339
B    0.238110
C    0.785882
D    0.699541
dtype: float64
frame.var()
One      0.132691
Two      0.112631
Three    0.129139
Four     0.159410
Five     0.112835
dtype: float64
frame.var(axis='columns')
A    0.134738
B    0.113646
C    0.155681
D    0.129304
dtype: float64
frame.std()
One      0.364268
Two      0.335605
Three    0.359358
Four     0.399262
Five     0.335909
dtype: float64
frame.std(axis='columns')
A    0.367067
B    0.337114
C    0.394564
D    0.359588
dtype: float64

Pandas also has an incredibly useful function called describe. It will calculate a battery of standard reductions and show you the summary:

frame.describe()
One Two Three Four Five
count 4.000000 4.000000 4.000000 4.000000 4.000000
mean 0.703803 0.420468 0.465528 0.386514 0.600569
std 0.364268 0.335605 0.359358 0.399262 0.335909
min 0.190250 0.154921 0.081066 0.027722 0.104428
25% 0.572218 0.194689 0.198849 0.056470 0.540861
50% 0.825512 0.317570 0.495521 0.351223 0.736110
75% 0.957096 0.543349 0.762200 0.681268 0.795818
max 0.973939 0.891813 0.790004 0.815889 0.825628

The last thing we will learn about is correlation. You can use the corr method to calculate the correlation between two columns (or rows) of a dataframe. This is something you will probably do often if you are into data exploration/analysis:

# Calculate the correlation between the columns One and Three
frame['One'].corr(frame['Five'])

0.879646855332041

You can provide an additional parameter method to specify the correlation method used, the options are:

  • pearson : Standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation
frame['One'].corr(frame['Five'], method='spearman')
0.19999999999999998

Alternatively, you can calculate the correlation matrix of the dataframe by just calling the corr method:

frame.corr()
One Two Three Four Five
One 1.000000 -0.795497 0.275002 -0.269384 0.879647
Two -0.795497 1.000000 -0.275345 0.271683 -0.982924
Three 0.275002 -0.275345 1.000000 -0.999982 0.370300
Four -0.269384 0.271683 -0.999982 1.000000 -0.366111
Five 0.879647 -0.982924 0.370300 -0.366111 1.000000

Understanding your data usually starts with a call to describe or corr

Calculating a few values from your data can grant you a better understanding of the phenomenon that generated it.

One of the first things you will do when selecting features for a ML algorithm is plotting the result of the correlation matrix. This will give you an idea of which features have a better shot at predicting labels if you intend to train a supervised model.

This, again, is just an example of the many applications of statistical analysis, and these are just some basic functions to aid you in the process.

Now that we learned the basics, we need to talk about pulling the data into Pandas. In the next article, we will learn how to create dataframes from common file formats.

Thank you for reading!

What to do next

Author image
Budapest, Hungary
Hey there, I'm Juan. A programmer currently living in Budapest. I believe in well-engineered solutions, clean code and sharing knowledge. Thanks for reading, I hope you find my articles useful!