Pandas provides many options for calculating descriptive statistics and other reduction operations with just a simple function call. You might want to calculate these values as part of a ML/Data Analysis pipeline, or just because you want to get a better understanding of the data you are dealing with.
Most of these operations are similar to NumPy reductions, as they compute and return a single value. In some cases, it returns a structure with equal-or-fewer dimensions than the original.
In this article, we will explore some of the most used functions and see some examples. Great, let's get started!
import pandas as pd import numpy as np frame = pd.DataFrame(np.random.rand(4,5), index=['A', 'B', 'C', 'D'], columns=['One', 'Two', 'Three', 'Four', 'Five']) frame
The first thing we will learn is how to perform sums. The
sum function performs sums along the rows axis by default (returns the sum of the values of every column). You can pass
axis='columns' as an additional parameter to perform the sum along the columns axis:
One 2.815213 Two 1.681874 Three 1.862111 Four 1.546057 Five 2.402276 dtype: float64
A 2.905198 B 2.060995 C 2.842264 D 2.499073 dtype: float64
Pandas also lets you calculate the minimum and maximum values in a dataframe's columns or rows, for this, it provides the functions
max. Like before, you can specify the axis:
One 0.973939 Two 0.891813 Three 0.790004 Four 0.815889 Five 0.825628 dtype: float64
A 0.027722 B 0.104428 C 0.081066 D 0.066052 dtype: float64
If instead, you are interested in the indexes where the minimum and maximum values are, just use
frame.idxmax() # All columns have their maximum values at row D
One A Two B Three A Four C Five D dtype: object
frame.idxmin(axis='columns') # All rows have their minimum at column One
A Four B Five C Three D Four dtype: object
Pandas also has functions for calculating (among many others) the mean, median, standard deviation and variance:
One 0.703803 Two 0.420468 Three 0.465528 Four 0.386514 Five 0.600569 dtype: float64
A 0.686339 B 0.238110 C 0.785882 D 0.699541 dtype: float64
One 0.132691 Two 0.112631 Three 0.129139 Four 0.159410 Five 0.112835 dtype: float64
A 0.134738 B 0.113646 C 0.155681 D 0.129304 dtype: float64
One 0.364268 Two 0.335605 Three 0.359358 Four 0.399262 Five 0.335909 dtype: float64
A 0.367067 B 0.337114 C 0.394564 D 0.359588 dtype: float64
Pandas also has an incredibly useful function called
describe. It will calculate a battery of standard reductions and show you the summary:
The last thing we will learn about is correlation. You can use the
corr method to calculate the correlation between two columns (or rows) of a dataframe. This is something you will probably do often if you are into data exploration/analysis:
# Calculate the correlation between the columns One and Three frame['One'].corr(frame['Five'])
You can provide an additional parameter
method to specify the correlation method used, the options are:
- pearson : Standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
Alternatively, you can calculate the correlation matrix of the dataframe by just calling the
Understanding your data usually starts with a call to describe or corr
Calculating a few values from your data can grant you a better understanding of the phenomenon that generated it.
One of the first things you will do when selecting features for a ML algorithm is plotting the result of the correlation matrix. This will give you an idea of which features have a better shot at predicting labels if you intend to train a supervised model.
This, again, is just an example of the many applications of statistical analysis, and these are just some basic functions to aid you in the process.
Now that we learned the basics, we need to talk about pulling the data into Pandas. In the next article, we will learn how to create dataframes from common file formats.
Thank you for reading!
What to do next
- Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
- You can find the source code for this series in this repo.
- This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
- Send me an email with questions, comments or suggestions (it's in the About Me page)