BrainsToBytes

Hands-on Pandas(2): Selection, Filtering, loc and iloc

In the last article, we learned about the two basic pandas data structures: Series and DataFrames. We also built a couple of them on our own and learned the basics of indexing and selection.

Today we will learn a bit more about selecting and filtering elements from Pandas data structures. This might seem like an incredibly basic topic, but it's very useful. That's why it's important to understand it well before tackling more advanced topics.

Knowing how to wrangle data is one of the most important skills for anyone working on data science and machine learning, and the foundation of those skills is data selection and filtering.

Good, let's get started!

Playing with Series

Selecting elements from a Series object is pretty straightforward, the next are examples of different ways of selecting elements from a small 8-element series

import pandas as pd
import numpy as np 

ser = pd.Series(np.arange(8), index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(ser)
a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
dtype: int64
# You can select elements from a series using its index
ser['d']
3
# You can also pass a list of index elements if you need to retrieve more than one element
ser[['a', 'd', 'g']]
a    0
d    3
g    6
dtype: int64

Pandas is so cool that it even supports selection with index-based-slices! There is an important distinction between this and regular slices: The last element of the slice is included.

# Select all elements from b to g (both edges included)
ser['b':'g']
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

The fact that you are not using the default index does not mean that position-based selection is not permitted. You can still select elements from a Series using integers.

# Select the third (index 2, remember? 0-indexed) from our series
ser[2]
2
# Now, select the elements at indexes 2, 3 and 6
ser[[2,3,6]]
c    2
d    3
g    6
dtype: int64
# And finally, slice selection is still supported (but in this case, the last element is excluded as usual)
ser[2:8]
c    2
d    3
e    4
f    5
g    6
h    7
dtype: int64

Playing with DataFrames

Because of an extra dimension, selecting elements from DataFrames is richer than from Series. We will start with the most basic scenario: Selecting whole columns.

pokedata = {'Name': ['Abra', 'Koffing', 'Milcery', 'Pikachu', 'Shellder', 'Vulpix'],
            'Type': ['Psychic', 'Poison', 'Fairy', 'Electric', 'Water', 'Fire'],
            'HP': [25, 40, 45, 35, 30, 38],
            'Speed': [90, 35, 34, 90, 40, 65],
            'Color': ['Yellow', 'Purple', 'White', 'Yellow', 'Purple', 'Red'],
            'FirstGen': [True, True, False, True, True, True]}

# We will use the Name column as index
pframe = pd.DataFrame(pokedata).set_index('Name')
pframe
Type HP Speed Color FirstGen
Name
Abra Psychic 25 90 Yellow True
Koffing Poison 40 35 Purple True
Milcery Fairy 45 34 White False
Pikachu Electric 35 90 Yellow True
Shellder Water 30 40 Purple True
Vulpix Fire 38 65 Red True
# You can select a column from the frame by passing the name between brackets
pframe['Type']
Name
Abra         Psychic
Koffing       Poison
Milcery        Fairy
Pikachu     Electric
Shellder       Water
Vulpix          Fire
Name: Type, dtype: object
# If you pass a list of column names you will retrieve them in that order
pframe[['FirstGen', 'HP', 'Color']]
FirstGen HP Color
Name
Abra True 25 Yellow
Koffing True 40 Purple
Milcery False 45 White
Pikachu True 35 Yellow
Shellder True 30 Purple
Vulpix True 38 Red

Square brackets also support selection based on content. Let's select rows that satisfy specific criteria to see how it works.

# Select all Pokemon with speed lower than 50
pframe[pframe['Speed'] < 50]
Type HP Speed Color FirstGen
Name
Koffing Poison 40 35 Purple True
Milcery Fairy 45 34 White False
Shellder Water 30 40 Purple True
# Select all yelloe Pokemon
pframe[pframe['Color'] == 'Yellow']
Type HP Speed Color FirstGen
Name
Abra Psychic 25 90 Yellow True
Pikachu Electric 35 90 Yellow True
# Select all first generation Pokemon with HP greater than 37
pframe[(pframe['FirstGen'] == True) & (pframe['HP'] > 37)]
Type HP Speed Color FirstGen
Name
Koffing Poison 40 35 Purple True
Vulpix Fire 38 65 Red True

You can go as specific as you want with this form of filtering. Selecting subsets of rows is a very useful skill, so play a bit selecting based on your own conditions.

Good, I think we are good when it comes to selecting based on column tags, now let's select specific rows based on the index. For this, Pandas offers you two very valuable functions: loc and iloc.

loc lets you select based on axis labels, whereas iloc lets you select based on integers that represent the position of the row. Again it's easier to understand with examples:

# Select the row with index Shellder
pframe.loc['Shellder']
Type         Water
HP              30
Speed           40
Color       Purple
FirstGen      True
Name: Shellder, dtype: object
# You can pass a list of index values and get the rows in the specified order
pframe.loc[['Shellder', 'Abra', 'Pikachu']]
Type HP Speed Color FirstGen
Name
Shellder Water 30 40 Purple True
Abra Psychic 25 90 Yellow True
Pikachu Electric 35 90 Yellow True
# It's also possible to get only a subset of columns using loc
# Let's get data for Shellder, but only the Type and Color
pframe.loc['Shellder', ['Type', 'Color']]
Type      Water
Color    Purple
Name: Shellder, dtype: object
# If instead, you need to select elements based on order, you can use iloc
# For example, the following line selects the third row (index 2, because 0-indexed)
pframe.iloc[2]
Type        Fairy
HP             45
Speed          34
Color       White
FirstGen    False
Name: Milcery, dtype: object
# Just like loc, you can pass a list of indexes and it will return a dataframe with rows in that order
pframe.iloc[[2,4,0]]
Type HP Speed Color FirstGen
Name
Milcery Fairy 45 34 White False
Shellder Water 30 40 Purple True
Abra Psychic 25 90 Yellow True
# Remember that little trick for selecting just a subset of columns? It also works for iloc
# This selects the third row, and only the Type (column at position 0) and HP (column at position 1)
pframe.iloc[2, [0, 1]]
Type    Fairy
HP         45
Name: Milcery, dtype: object

A word on numeric indexes

loc and iloc are pretty straightforward, but it's important to understand the difference between them. This is especially true when dealing with numeric indexes. A dataframe with numeric indexes that are not in order, starting at 0 and without interruption will behave weird unless you remember how those function differ. Take the following dataframe as example:

frame = pd.DataFrame(np.arange(36).reshape(6,6), 
                     columns = ['a', 'b', 'c', 'd', 'e', 'f' ],
                     index = [5, 3, 1, 4, 2, 0])
frame
a b c d e f
5 0 1 2 3 4 5
3 6 7 8 9 10 11
1 12 13 14 15 16 17
4 18 19 20 21 22 23
2 24 25 26 27 28 29
0 30 31 32 33 34 35
# Now, let's check what loc[2] and iloc[2] return
frame.loc[2]
a    24
b    25
c    26
d    27
e    28
f    29
Name: 2, dtype: int64
frame.iloc[2]
a    12
b    13
c    14
d    15
e    16
f    17
Name: 1, dtype: int64

Can you see they return different rows? This happens because loc[2] looks for a row with an index with a value of two, in this case, the penultimate row. On the other hand, iloc[2] just looks for the third row, the one with positional index 2, starting from 0. If you remember this, you will have no problem dealing with dataframes with numeric indexes!

Selection is a rich topic

One of the great things about Pandas is how easy it makes selecting only the data you need. As you may already know, almost every advanced application rests on this foundation, and know you know how to use it!

Now that we can select data and understand how indexes work, we can deal with two interesting topics: Reindexing and deletion of entries. The next article will talk about these topics, so make sure to come back to check it.

Thanks for reading!

What to do next

Author image
Budapest, Hungary
Hey there, I'm Juan. A programmer currently living in Budapest. I believe in well-engineered solutions, clean code and sharing knowledge. Thanks for reading, I hope you find my articles useful!