Hands-on Pandas(8): Cleaning Data

In an ideal world, all the data you need is available in the right format and with complete content.

In the real world, you will probably need to scrape data from lots of different and incomplete sources. That's why it's important to learn how to clean your data before analyzing it or feeding it into a ML algorithm.

Data cleaning might not the most glamorous part of your work, but it's a crucial part of the development of data-based products. Not only it's important for the whole pipeline, but it's also one of the most time-consuming tasks in a project (Some estimate it to consume around 80% of a project's time).

In this article, we will learn some basic data cleaning techniques that will let you handle the most common situations.

Great, let's get started!

Dealing with missing values

Dealing with missing values is a basic (and extremely useful) technique. The data you'll have access to will probably miss the attributes of some entries, and it's better to spend some time explicitly handling each case.

Imagine you have a csv file with the following content:

Name,Type,Color,Evolves,HP
Abra,Psychic,,True,
Pikachu,Electric,,True,35
Ekans,,Purple,,35
Dratini,,Blue,,41
Ditto,Normal,Pink,False,48

As you can see, it's missing entries for several rows. If you load it using read_csv, the resulting dataframe will look like this:

import pandas as pd
frame = pd.read_csv('./sample_data/pokes_missing.csv')
frame

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	NaN	True	NaN
1	Pikachu	Electric	NaN	True	35.0
2	Ekans	NaN	Purple	NaN	35.0
3	Dratini	NaN	Blue	NaN	41.0
4	Ditto	Normal	Pink	False	48.0

Pandas fills every entry without value with the default NaN sentinel value. In many situations, you just want to remove the rows with missing values and leave the ones that have exclusively non-NaN fields, for this, you can use the dropna function:

# dropna does not affect the original dataframe, it returns a new one.
frame.dropna()

	Name	Type	Color	Evolves	HP
4	Ditto	Normal	Pink	False	48.0

#If you want to perform the dropna in place, specify the inplace attribute
frame.dropna(inplace=True)
frame

	Name	Type	Color	Evolves	HP
4	Ditto	Normal	Pink	False	48.0

# Let's reload the data:
frame = pd.read_csv('./sample_data/pokes_missing.csv')

Dropna can be used differently: You can tell the function to drop only the rows that are composed of NaN values exclusively. This way you can do away with entries that don't provide any information. For doing this, provide the optional parameter * how='all' when calling dropna.

Dropping all rows with NaN values might not be the best option. Very often, the best option is to fill-in the missing entries with values you provide or with values calculated from the dataframe itself. Pandas provides a function called fillna that lets you handle missing values:

# Let's fill every missing entry with the value 0
frame.fillna(10)

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	10	True	10.0
1	Pikachu	Electric	10	True	35.0
2	Ekans	10	Purple	10	35.0
3	Dratini	10	Blue	10	41.0
4	Ditto	Normal	Pink	False	48.0

Do you notice anything wrong with the dataframe? Well, for starters, the value 10 might work reasonably well for HP, but it doesn't make any sense for Type, Color or Evolves. fillna lets you specify how to fill NaN values for every column if you pass a dictionary instead of a single value:

frame.fillna({'Type': 'Unknown',
              'Color': 'Yellow',
              'Evolves': True,
              'HP': 30})

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	Yellow	True	30.0
1	Pikachu	Electric	Yellow	True	35.0
2	Ekans	Unknown	Purple	True	35.0
3	Dratini	Unknown	Blue	True	41.0
4	Ditto	Normal	Pink	False	48.0

Now the dataframe makes a bit more sense! As with dropna, the change performed by fillna doesn't happen on the original dataframe, it just creates a new dataframe with new values. If you want to change the original dataframe provide the inplace argument:

frame.fillna({'Type': 'Unknown',
              'Color': 'Yellow',
              'Evolves': True,
              'HP': 30},
             inplace=True)
frame

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	Yellow	True	30.0
1	Pikachu	Electric	Yellow	True	35.0
2	Ekans	Unknown	Purple	True	35.0
3	Dratini	Unknown	Blue	True	41.0
4	Ditto	Normal	Pink	False	48.0

Dealing with duplicates

Sometimes your data will contain duplicate rows or values with the same fields. Take a look at this dataframe:

frame = pd.read_csv('./sample_data/pokes_duplicates.csv')
frame

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	Yellow	True	25
1	Pikachu	Electric	Yellow	True	35
2	Ekans	Poison	Purple	True	35
3	Dratini	Dragon	Blue	True	41
4	Ditto	Normal	Pink	False	48
5	Abra	Psychic	Yellow	True	25
6	Dratini	Dragon	Blue	True	41

The frame has two Dratinis and two Abras, so let's get rid of them using drop_duplicates:

frame.drop_duplicates()

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	Yellow	True	25
1	Pikachu	Electric	Yellow	True	35
2	Ekans	Poison	Purple	True	35
3	Dratini	Dragon	Blue	True	41
4	Ditto	Normal	Pink	False	48

Sometimes you want to remove duplicates based on the value of an attribute. In this case, you can provide an extra argument (or list of arguments if you want to use multiple attributes) to specify the column to take into consideration. For example, the following call ensures we have only rows with unique colors:

# Abra and Pikachu are both yellow, so it's time to go for Pikachu
frame.drop_duplicates('Color')

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	Yellow	True	25
2	Ekans	Poison	Purple	True	35
3	Dratini	Dragon	Blue	True	41
4	Ditto	Normal	Pink	False	48

Mappings and other transformations

In this section, we will learn about some other techniques for modifying your data. The first technique is to use the map function to alter a column in a dataframe:

frame = frame = pd.read_csv('./sample_data/pokes.csv')
frame

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	Yellow	True	25
1	Pikachu	Electric	Yellow	True	35
2	Ekans	Poison	Purple	True	35
3	Dratini	Dragon	Blue	True	41
4	Ditto	Normal	Pink	False	48

# Let's transform the type into an all-uppercase string
frame['Type'] = frame['Type'].map(lambda x: x.upper())
frame

	Name	Type	Color	Evolves	HP
0	Abra	PSYCHIC	Yellow	True	25
1	Pikachu	ELECTRIC	Yellow	True	35
2	Ekans	POISON	Purple	True	35
3	Dratini	DRAGON	Blue	True	41
4	Ditto	NORMAL	Pink	False	48

The next thing we can do is to use the replace function to change values. In the following example we will replace the Evolves' column boolean values for 'Yes' and 'No' strings:

frame['Evolves'] = frame['Evolves'].replace([True, False], ['Yes','No'])
frame

	Name	Type	Color	Evolves	HP
0	Abra	PSYCHIC	Yellow	Yes	25
1	Pikachu	ELECTRIC	Yellow	Yes	35
2	Ekans	POISON	Purple	Yes	35
3	Dratini	DRAGON	Blue	Yes	41
4	Ditto	NORMAL	Pink	No	48

We can also use conditional selection to get just the data we need. Suppose we are interested only on the elements with HP under 40:

frame[frame['HP'] < 40]

	Name	Type	Color	Evolves	HP
0	Abra	PSYCHIC	Yellow	Yes	25
1	Pikachu	ELECTRIC	Yellow	Yes	35
2	Ekans	POISON	Purple	Yes	35

The last thing we will deal with is transforming categorical data into one-hot encoded columns. If you want a more in-depth explanation of one-hot encoding you can read this article:

# get_dummies transforms categorical data into a one-hot encoded dataframe
# we will concatenate the original dataframe and the one-hot encoded type columns

pd.concat([frame, pd.get_dummies(frame['Type'])], axis=1)

	Name	Type	Color	Evolves	HP	DRAGON	ELECTRIC	NORMAL	POISON	PSYCHIC
0	Abra	PSYCHIC	Yellow	Yes	25	0	0	0	0	1
1	Pikachu	ELECTRIC	Yellow	Yes	35	0	1	0	0	0
2	Ekans	POISON	Purple	Yes	35	0	0	0	1	0
3	Dratini	DRAGON	Blue	Yes	41	1	0	0	0	0
4	Ditto	NORMAL	Pink	No	48	0	0	1	0	0

Keep it clean

Getting perfect data at the beginning of the project is extremely unlikely. In reality, you will probably need to put together a dataset from many sources, and in the process, there might be some malformed, erroneous, or missing data.

That's ok as long as you know how to clean your data and get it in proper shape for your analysis/algorithms. It is often said that you will probably spend 80% of your time cleaning and wrangling data, so it's a good idea to learn one or two tricks.

In the next article, we will learn how to put together data from different sources into a single collection.

Thank you for reading!

What to do next

Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
You can find the source code for this series in this repo.
This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
Send me an email with questions, comments or suggestions (it's in the About Me page)

Hands-on Pandas(8): Cleaning Data

Dealing with missing values

Dealing with duplicates

Mappings and other transformations

Keep it clean

What to do next

Newsletter

Recent Post

Categories