Hands-on Pandas(2): Selection, Filtering, loc and iloc

In the last article, we learned about the two basic pandas data structures: Series and DataFrames. We also built a couple of them on our own and learned the basics of indexing and selection.

Today we will learn a bit more about selecting and filtering elements from Pandas data structures. This might seem like an incredibly basic topic, but it's very useful. That's why it's important to understand it well before tackling more advanced topics.

Knowing how to wrangle data is one of the most important skills for anyone working on data science and machine learning, and the foundation of those skills is data selection and filtering.

Good, let's get started!

Playing with Series

Selecting elements from a Series object is pretty straightforward, the next are examples of different ways of selecting elements from a small 8-element series

import pandas as pd
import numpy as np 

ser = pd.Series(np.arange(8), index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(ser)

a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
dtype: int64

# You can select elements from a series using its index
ser['d']

# You can also pass a list of index elements if you need to retrieve more than one element
ser[['a', 'd', 'g']]

a    0
d    3
g    6
dtype: int64

Pandas is so cool that it even supports selection with index-based-slices! There is an important distinction between this and regular slices: The last element of the slice is included.

# Select all elements from b to g (both edges included)
ser['b':'g']

b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

The fact that you are not using the default index does not mean that position-based selection is not permitted. You can still select elements from a Series using integers.

# Select the third (index 2, remember? 0-indexed) from our series
ser[2]

# Now, select the elements at indexes 2, 3 and 6
ser[[2,3,6]]

c    2
d    3
g    6
dtype: int64

# And finally, slice selection is still supported (but in this case, the last element is excluded as usual)
ser[2:8]

c    2
d    3
e    4
f    5
g    6
h    7
dtype: int64

Playing with DataFrames

Because of an extra dimension, selecting elements from DataFrames is richer than from Series. We will start with the most basic scenario: Selecting whole columns.

pokedata = {'Name': ['Abra', 'Koffing', 'Milcery', 'Pikachu', 'Shellder', 'Vulpix'],
            'Type': ['Psychic', 'Poison', 'Fairy', 'Electric', 'Water', 'Fire'],
            'HP': [25, 40, 45, 35, 30, 38],
            'Speed': [90, 35, 34, 90, 40, 65],
            'Color': ['Yellow', 'Purple', 'White', 'Yellow', 'Purple', 'Red'],
            'FirstGen': [True, True, False, True, True, True]}

# We will use the Name column as index
pframe = pd.DataFrame(pokedata).set_index('Name')
pframe

	Type	HP	Speed	Color	FirstGen
Name
Abra	Psychic	25	90	Yellow	True
Koffing	Poison	40	35	Purple	True
Milcery	Fairy	45	34	White	False
Pikachu	Electric	35	90	Yellow	True
Shellder	Water	30	40	Purple	True
Vulpix	Fire	38	65	Red	True

# You can select a column from the frame by passing the name between brackets
pframe['Type']

Name
Abra         Psychic
Koffing       Poison
Milcery        Fairy
Pikachu     Electric
Shellder       Water
Vulpix          Fire
Name: Type, dtype: object

# If you pass a list of column names you will retrieve them in that order
pframe[['FirstGen', 'HP', 'Color']]

	FirstGen	HP	Color
Name
Abra	True	25	Yellow
Koffing	True	40	Purple
Milcery	False	45	White
Pikachu	True	35	Yellow
Shellder	True	30	Purple
Vulpix	True	38	Red

Square brackets also support selection based on content. Let's select rows that satisfy specific criteria to see how it works.

# Select all Pokemon with speed lower than 50
pframe[pframe['Speed'] < 50]

	Type	HP	Speed	Color	FirstGen
Name
Koffing	Poison	40	35	Purple	True
Milcery	Fairy	45	34	White	False
Shellder	Water	30	40	Purple	True

# Select all yelloe Pokemon
pframe[pframe['Color'] == 'Yellow']

	Type	HP	Speed	Color	FirstGen
Name
Abra	Psychic	25	90	Yellow	True
Pikachu	Electric	35	90	Yellow	True

# Select all first generation Pokemon with HP greater than 37
pframe[(pframe['FirstGen'] == True) & (pframe['HP'] > 37)]

	Type	HP	Speed	Color	FirstGen
Name
Koffing	Poison	40	35	Purple	True
Vulpix	Fire	38	65	Red	True

You can go as specific as you want with this form of filtering. Selecting subsets of rows is a very useful skill, so play a bit selecting based on your own conditions.

Good, I think we are good when it comes to selecting based on column tags, now let's select specific rows based on the index. For this, Pandas offers you two very valuable functions: loc and iloc.

loc lets you select based on axis labels, whereas iloc lets you select based on integers that represent the position of the row. Again it's easier to understand with examples:

# Select the row with index Shellder
pframe.loc['Shellder']

Type         Water
HP              30
Speed           40
Color       Purple
FirstGen      True
Name: Shellder, dtype: object

# You can pass a list of index values and get the rows in the specified order
pframe.loc[['Shellder', 'Abra', 'Pikachu']]

	Type	HP	Speed	Color	FirstGen
Name
Shellder	Water	30	40	Purple	True
Abra	Psychic	25	90	Yellow	True
Pikachu	Electric	35	90	Yellow	True

# It's also possible to get only a subset of columns using loc
# Let's get data for Shellder, but only the Type and Color
pframe.loc['Shellder', ['Type', 'Color']]

Type      Water
Color    Purple
Name: Shellder, dtype: object

# If instead, you need to select elements based on order, you can use iloc
# For example, the following line selects the third row (index 2, because 0-indexed)
pframe.iloc[2]

Type        Fairy
HP             45
Speed          34
Color       White
FirstGen    False
Name: Milcery, dtype: object

# Just like loc, you can pass a list of indexes and it will return a dataframe with rows in that order
pframe.iloc[[2,4,0]]

	Type	HP	Speed	Color	FirstGen
Name
Milcery	Fairy	45	34	White	False
Shellder	Water	30	40	Purple	True
Abra	Psychic	25	90	Yellow	True

# Remember that little trick for selecting just a subset of columns? It also works for iloc
# This selects the third row, and only the Type (column at position 0) and HP (column at position 1)
pframe.iloc[2, [0, 1]]

Type    Fairy
HP         45
Name: Milcery, dtype: object

A word on numeric indexes

loc and iloc are pretty straightforward, but it's important to understand the difference between them. This is especially true when dealing with numeric indexes. A dataframe with numeric indexes that are not in order, starting at 0 and without interruption will behave weird unless you remember how those function differ. Take the following dataframe as example:

frame = pd.DataFrame(np.arange(36).reshape(6,6), 
                     columns = ['a', 'b', 'c', 'd', 'e', 'f' ],
                     index = [5, 3, 1, 4, 2, 0])
frame

	a	b	c	d	e	f
5	0	1	2	3	4	5
3	6	7	8	9	10	11
1	12	13	14	15	16	17
4	18	19	20	21	22	23
2	24	25	26	27	28	29
0	30	31	32	33	34	35

# Now, let's check what loc[2] and iloc[2] return
frame.loc[2]

a    24
b    25
c    26
d    27
e    28
f    29
Name: 2, dtype: int64

frame.iloc[2]

a    12
b    13
c    14
d    15
e    16
f    17
Name: 1, dtype: int64

Can you see they return different rows? This happens because loc[2] looks for a row with an index with a value of two, in this case, the penultimate row. On the other hand, iloc[2] just looks for the third row, the one with positional index 2, starting from 0. If you remember this, you will have no problem dealing with dataframes with numeric indexes!

Selection is a rich topic

One of the great things about Pandas is how easy it makes selecting only the data you need. As you may already know, almost every advanced application rests on this foundation, and know you know how to use it!

Now that we can select data and understand how indexes work, we can deal with two interesting topics: Reindexing and deletion of entries. The next article will talk about these topics, so make sure to come back to check it.

Thanks for reading!

What to do next

Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
You can find the source code for this series in this repo.
This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
Send me an email with questions, comments or suggestions (it's in the About Me page)

Hands-on Pandas(2): Selection, Filtering, loc and iloc

Playing with Series

Playing with DataFrames

A word on numeric indexes

Selection is a rich topic

What to do next

Newsletter

Recent Post

Categories

	a	b	c	d	e	f
5	0	1	2	3	4	5
3	6	7	8	9	10	11
1	12	13	14	15	16	17
4	18	19	20	21	22	23
2	24	25	26	27	28	29
0	30	31	32	33	34	35

	a	b	c	d	e	f
5	0	1	2	3	4	5
3	6	7	8	9	10	11
1	12	13	14	15	16	17
4	18	19	20	21	22	23
2	24	25	26	27	28	29
0	30	31	32	33	34	35

	a	b	c	d	e	f
5	0	1	2	3	4	5
3	6	7	8	9	10	11
1	12	13	14	15	16	17
4	18	19	20	21	22	23
2	24	25	26	27	28	29
0	30	31	32	33	34	35