Hands-on NumPy(V): Reductions/Aggregations

Reductions (or aggregations) are a family of NumPy functions that operate over an array returning a result with fewer dimensions.

Many of these functions perform typical statistical operations on arrays, while others perform dimensionality-reductions.

In this article, we will learn about some of the most common aggregations, but before we get started we will create a couple of arrays to illustrate the functionality.

Let's do that and start using reductions!

# Let's create a small 1-d array to check some basic functions
import numpy as np 

arr = np.random.randint(-10, 10, 10)
print(arr)

[-9  9  8  7  3  6  3  3 -7  9]

# We will also use a 2-d array to show the dimension-reduction capabilities of aggregations
arr2 = np.random.randint(-10, 10, 16).reshape(4,4)
print(arr2)

[[ -7  -8  -1  -7]
 [  1   0  -6   3]
 [  7 -10  -8  -2]
 [  0  -7   3  -8]]

1-dimension, many dimensions

Many reductions let you specify an axis as an argument that tells the direction in which the operation will be applied. This, of course, is much easier to understand with an example:

We will use the sum function, that as you already know, performs the sum of elements in an array.

# The 1-dimensional scenario has the most obvious behavior: 
# It returns a single number with the value of the sum of every element in the array:
result = arr.sum()
print(result)

# Now, if we use it in an array with more dimensions, it will still return a single number
# with the value of the sum of every element
result = arr2.sum()
print(result)

-50

Now, it gets a bit different if we provide an axis. This parameter specifies the axis (or axes) along which the operation is performed. The default value (None) in this case will just sum all the elements, but we can also specify if we want to sum along rows, columns, or higher dimension axes.

# Sum the contents along every row (axis=0)
result = arr2.sum(axis=0)
print(result)

[  1 -25 -12 -14]

# Sum the contents along every column (axis=1)
result = arr2.sum(axis=1)
print(result)

[-23  -2 -13 -12]

It took me a while to understand what it means to perform an operation along an axis. The easiest way I found to explain it is like this:

Grab every nth element of the axes you are using and perform the operation.

In the first case (axis=0) this means that you will sum the nth elements of every row. You can see it in code by replacing the selector with the : symbol and see the results

# Sum(axis=0) first entry will be produced by grabbing this
sub_arr = arr2[:,0]
print(sub_arr)

# And calling sum on it
print(sub_arr.sum())

[-7  1  7  0]
1

# Sum(axis=0) second entry will be produced by grabbing this
sub_arr = arr2[:,1] # We are summing along rows, so keep that as : and change the column
print(sub_arr)

# And calling sum on it
print(sub_arr.sum())

[ -8   0 -10  -7]
-25

Now you understand why the result is [ 1 -25 -12 -14]. If you are still having problems understanding what perform an operation along an axis means, repeat the example we just wrote but put the : symbol on the column selector. If you got it, try a 3-dimensional array and see how this works, as a bonus, you can provide tuples when performing dimensionality reduction!

Now that we understand this axis business we can move on to other functions!

# mean computes the arithmetic mean along a specified axis
result = np.mean(arr)
print(result)

3.2

# median calculates the median along the specified axis
result = np.median(arr)
print(result)

4.5

# mverage calculates the weighted average along a specified axis
result = np.average(arr)
print(result)

3.2

# std calculates the standard deviation (yes, yes, axis axis axis)
result = np.std(arr)
print(result)

6.046486583132388

# var calculates the variance, and yes, along a specified axis
result = np.var(arr)
print(result)

36.55999999999999

# amin returns the minimum of an array or minimum along an axis.
result = np.amin(arr)
print(result)

-9

# amax returns the maximum of an array or minimum along an axis.
result = np.amax(arr)
print(result)

Most of these methods also have variants that ignore NaN values in the array. For example, amin has nanmin, mean has nanmean, and so on ...

Fast, concise, clean

NumPy's reductions are among some of its most useful features. Knowing how to use them forms part of the foundation for more advanced data analysis and dimensionality reduction. This article only outlines some common functions, but there are lots of cool things already implemented for you. You can find more statistical and mathematical functions in these places:

Take a look at the documentation and play around with the functions.

Cool, we are almost done with NumPy basics. The next article will be the last one, and it will cover Numpy's linear algebra utilities, see you there!

Thank you for reading!

What to do next

Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
You can find the source code for this series in this repo.
This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
Send me an email with questions, comments or suggestions (it's in the About Me page)

Hands-on NumPy(V): Reductions/Aggregations

1-dimension, many dimensions

Fast, concise, clean

What to do next

Newsletter

Recent Post

Categories