Pandas provides many options for calculating descriptive statistics and other reduction operations with just a simple function call. You might want to calculate these values as part of a ML/Data Analysis pipeline, or just because you want to get a better understanding of the data you are dealing with.
Most of these operations are similar to NumPy reductions, as they compute and return a single value. In some cases, it returns a structure with equal-or-fewer dimensions than the original.
In this article, we will explore some of the most used functions and see some examples. Great, let's get started!
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.rand(4,5),
index=['A', 'B', 'C', 'D'],
columns=['One', 'Two', 'Three', 'Four', 'Five'])
frame
One | Two | Three | Four | Five | |
---|---|---|---|---|---|
A | 0.973939 | 0.427195 | 0.790004 | 0.027722 | 0.686339 |
B | 0.190250 | 0.891813 | 0.238110 | 0.636394 | 0.104428 |
C | 0.951482 | 0.207945 | 0.081066 | 0.815889 | 0.785882 |
D | 0.699541 | 0.154921 | 0.752932 | 0.066052 | 0.825628 |
The first thing we will learn is how to perform sums. The sum
function performs sums along the rows axis by default (returns the sum of the values of every column). You can pass axis='columns'
as an additional parameter to perform the sum along the columns axis:
frame.sum()
One 2.815213
Two 1.681874
Three 1.862111
Four 1.546057
Five 2.402276
dtype: float64
frame.sum(axis='columns')
A 2.905198
B 2.060995
C 2.842264
D 2.499073
dtype: float64
Pandas also lets you calculate the minimum and maximum values in a dataframe's columns or rows, for this, it provides the functions min
and max
. Like before, you can specify the axis:
frame.max()
One 0.973939
Two 0.891813
Three 0.790004
Four 0.815889
Five 0.825628
dtype: float64
frame.min(axis='columns')
A 0.027722
B 0.104428
C 0.081066
D 0.066052
dtype: float64
If instead, you are interested in the indexes where the minimum and maximum values are, just use idxmax
and idxmin
:
frame.idxmax() # All columns have their maximum values at row D
One A
Two B
Three A
Four C
Five D
dtype: object
frame.idxmin(axis='columns') # All rows have their minimum at column One
A Four
B Five
C Three
D Four
dtype: object
Pandas also has functions for calculating (among many others) the mean, median, standard deviation and variance:
frame.mean()
One 0.703803
Two 0.420468
Three 0.465528
Four 0.386514
Five 0.600569
dtype: float64
frame.median(axis='columns')
A 0.686339
B 0.238110
C 0.785882
D 0.699541
dtype: float64
frame.var()
One 0.132691
Two 0.112631
Three 0.129139
Four 0.159410
Five 0.112835
dtype: float64
frame.var(axis='columns')
A 0.134738
B 0.113646
C 0.155681
D 0.129304
dtype: float64
frame.std()
One 0.364268
Two 0.335605
Three 0.359358
Four 0.399262
Five 0.335909
dtype: float64
frame.std(axis='columns')
A 0.367067
B 0.337114
C 0.394564
D 0.359588
dtype: float64
Pandas also has an incredibly useful function called describe
. It will calculate a battery of standard reductions and show you the summary:
frame.describe()
One | Two | Three | Four | Five | |
---|---|---|---|---|---|
count | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 |
mean | 0.703803 | 0.420468 | 0.465528 | 0.386514 | 0.600569 |
std | 0.364268 | 0.335605 | 0.359358 | 0.399262 | 0.335909 |
min | 0.190250 | 0.154921 | 0.081066 | 0.027722 | 0.104428 |
25% | 0.572218 | 0.194689 | 0.198849 | 0.056470 | 0.540861 |
50% | 0.825512 | 0.317570 | 0.495521 | 0.351223 | 0.736110 |
75% | 0.957096 | 0.543349 | 0.762200 | 0.681268 | 0.795818 |
max | 0.973939 | 0.891813 | 0.790004 | 0.815889 | 0.825628 |
The last thing we will learn about is correlation. You can use the corr
method to calculate the correlation between two columns (or rows) of a dataframe. This is something you will probably do often if you are into data exploration/analysis:
# Calculate the correlation between the columns One and Three
frame['One'].corr(frame['Five'])
0.879646855332041
You can provide an additional parameter method
to specify the correlation method used, the options are:
- pearson : Standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
frame['One'].corr(frame['Five'], method='spearman')
0.19999999999999998
Alternatively, you can calculate the correlation matrix of the dataframe by just calling the corr
method:
frame.corr()
One | Two | Three | Four | Five | |
---|---|---|---|---|---|
One | 1.000000 | -0.795497 | 0.275002 | -0.269384 | 0.879647 |
Two | -0.795497 | 1.000000 | -0.275345 | 0.271683 | -0.982924 |
Three | 0.275002 | -0.275345 | 1.000000 | -0.999982 | 0.370300 |
Four | -0.269384 | 0.271683 | -0.999982 | 1.000000 | -0.366111 |
Five | 0.879647 | -0.982924 | 0.370300 | -0.366111 | 1.000000 |
Understanding your data usually starts with a call to describe or corr
Calculating a few values from your data can grant you a better understanding of the phenomenon that generated it.
One of the first things you will do when selecting features for a ML algorithm is plotting the result of the correlation matrix. This will give you an idea of which features have a better shot at predicting labels if you intend to train a supervised model.
This, again, is just an example of the many applications of statistical analysis, and these are just some basic functions to aid you in the process.
Now that we learned the basics, we need to talk about pulling the data into Pandas. In the next article, we will learn how to create dataframes from common file formats.
Thank you for reading!
What to do next
- Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
- You can find the source code for this series in this repo.
- This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
- Send me an email with questions, comments or suggestions (it's in the About Me page)