BrainsToBytes

Hands-on Pandas(8): Cleaning Data

In an ideal world, all the data you need is available in the right format and with complete content.

In the real world, you will probably need to scrape data from lots of different and incomplete sources. That's why it's important to learn how to clean your data before analyzing it or feeding it into a ML algorithm.

Data cleaning might not the most glamorous part of your work, but it's a crucial part of the development of data-based products. Not only it's important for the whole pipeline, but it's also one of the most time-consuming tasks in a project (Some estimate it to consume around 80% of a project's time).

In this article, we will learn some basic data cleaning techniques that will let you handle the most common situations.

Great, let's get started!

Dealing with missing values

Dealing with missing values is a basic (and extremely useful) technique. The data you'll have access to will probably miss the attributes of some entries, and it's better to spend some time explicitly handling each case.

Imagine you have a csv file with the following content:

Name,Type,Color,Evolves,HP
Abra,Psychic,,True,
Pikachu,Electric,,True,35
Ekans,,Purple,,35
Dratini,,Blue,,41
Ditto,Normal,Pink,False,48

As you can see, it's missing entries for several rows. If you load it using read_csv, the resulting dataframe will look like this:

import pandas as pd
frame = pd.read_csv('./sample_data/pokes_missing.csv')
frame
Name Type Color Evolves HP
0 Abra Psychic NaN True NaN
1 Pikachu Electric NaN True 35.0
2 Ekans NaN Purple NaN 35.0
3 Dratini NaN Blue NaN 41.0
4 Ditto Normal Pink False 48.0

Pandas fills every entry without value with the default NaN sentinel value. In many situations, you just want to remove the rows with missing values and leave the ones that have exclusively non-NaN fields, for this, you can use the dropna function:

# dropna does not affect the original dataframe, it returns a new one.
frame.dropna()
Name Type Color Evolves HP
4 Ditto Normal Pink False 48.0
#If you want to perform the dropna in place, specify the inplace attribute
frame.dropna(inplace=True)
frame
Name Type Color Evolves HP
4 Ditto Normal Pink False 48.0
# Let's reload the data:
frame = pd.read_csv('./sample_data/pokes_missing.csv')

Dropna can be used differently: You can tell the function to drop only the rows that are composed of NaN values exclusively. This way you can do away with entries that don't provide any information. For doing this, provide the optional parameter * how='all' when calling dropna.

Dropping all rows with NaN values might not be the best option. Very often, the best option is to fill-in the missing entries with values you provide or with values calculated from the dataframe itself. Pandas provides a function called fillna that lets you handle missing values:

# Let's fill every missing entry with the value 0
frame.fillna(10)
Name Type Color Evolves HP
0 Abra Psychic 10 True 10.0
1 Pikachu Electric 10 True 35.0
2 Ekans 10 Purple 10 35.0
3 Dratini 10 Blue 10 41.0
4 Ditto Normal Pink False 48.0

Do you notice anything wrong with the dataframe? Well, for starters, the value 10 might work reasonably well for HP, but it doesn't make any sense for Type, Color or Evolves. fillna lets you specify how to fill NaN values for every column if you pass a dictionary instead of a single value:

frame.fillna({'Type': 'Unknown',
              'Color': 'Yellow',
              'Evolves': True,
              'HP': 30})
Name Type Color Evolves HP
0 Abra Psychic Yellow True 30.0
1 Pikachu Electric Yellow True 35.0
2 Ekans Unknown Purple True 35.0
3 Dratini Unknown Blue True 41.0
4 Ditto Normal Pink False 48.0

Now the dataframe makes a bit more sense! As with dropna, the change performed by fillna doesn't happen on the original dataframe, it just creates a new dataframe with new values. If you want to change the original dataframe provide the inplace argument:

frame.fillna({'Type': 'Unknown',
              'Color': 'Yellow',
              'Evolves': True,
              'HP': 30},
             inplace=True)
frame
Name Type Color Evolves HP
0 Abra Psychic Yellow True 30.0
1 Pikachu Electric Yellow True 35.0
2 Ekans Unknown Purple True 35.0
3 Dratini Unknown Blue True 41.0
4 Ditto Normal Pink False 48.0

Dealing with duplicates

Sometimes your data will contain duplicate rows or values with the same fields. Take a look at this dataframe:

frame = pd.read_csv('./sample_data/pokes_duplicates.csv')
frame
Name Type Color Evolves HP
0 Abra Psychic Yellow True 25
1 Pikachu Electric Yellow True 35
2 Ekans Poison Purple True 35
3 Dratini Dragon Blue True 41
4 Ditto Normal Pink False 48
5 Abra Psychic Yellow True 25
6 Dratini Dragon Blue True 41

The frame has two Dratinis and two Abras, so let's get rid of them using drop_duplicates:

frame.drop_duplicates()
Name Type Color Evolves HP
0 Abra Psychic Yellow True 25
1 Pikachu Electric Yellow True 35
2 Ekans Poison Purple True 35
3 Dratini Dragon Blue True 41
4 Ditto Normal Pink False 48

Sometimes you want to remove duplicates based on the value of an attribute. In this case, you can provide an extra argument (or list of arguments if you want to use multiple attributes) to specify the column to take into consideration. For example, the following call ensures we have only rows with unique colors:

# Abra and Pikachu are both yellow, so it's time to go for Pikachu
frame.drop_duplicates('Color')
Name Type Color Evolves HP
0 Abra Psychic Yellow True 25
2 Ekans Poison Purple True 35
3 Dratini Dragon Blue True 41
4 Ditto Normal Pink False 48

Mappings and other transformations

In this section, we will learn about some other techniques for modifying your data. The first technique is to use the map function to alter a column in a dataframe:

frame = frame = pd.read_csv('./sample_data/pokes.csv')
frame
Name Type Color Evolves HP
0 Abra Psychic Yellow True 25
1 Pikachu Electric Yellow True 35
2 Ekans Poison Purple True 35
3 Dratini Dragon Blue True 41
4 Ditto Normal Pink False 48
# Let's transform the type into an all-uppercase string
frame['Type'] = frame['Type'].map(lambda x: x.upper())
frame
Name Type Color Evolves HP
0 Abra PSYCHIC Yellow True 25
1 Pikachu ELECTRIC Yellow True 35
2 Ekans POISON Purple True 35
3 Dratini DRAGON Blue True 41
4 Ditto NORMAL Pink False 48

The next thing we can do is to use the replace function to change values. In the following example we will replace the Evolves' column boolean values for 'Yes' and 'No' strings:

frame['Evolves'] = frame['Evolves'].replace([True, False], ['Yes','No'])
frame
Name Type Color Evolves HP
0 Abra PSYCHIC Yellow Yes 25
1 Pikachu ELECTRIC Yellow Yes 35
2 Ekans POISON Purple Yes 35
3 Dratini DRAGON Blue Yes 41
4 Ditto NORMAL Pink No 48

We can also use conditional selection to get just the data we need. Suppose we are interested only on the elements with HP under 40:

frame[frame['HP'] < 40]
Name Type Color Evolves HP
0 Abra PSYCHIC Yellow Yes 25
1 Pikachu ELECTRIC Yellow Yes 35
2 Ekans POISON Purple Yes 35

The last thing we will deal with is transforming categorical data into one-hot encoded columns. If you want a more in-depth explanation of one-hot encoding you can read this article:

# get_dummies transforms categorical data into a one-hot encoded dataframe
# we will concatenate the original dataframe and the one-hot encoded type columns

pd.concat([frame, pd.get_dummies(frame['Type'])], axis=1)
Name Type Color Evolves HP DRAGON ELECTRIC NORMAL POISON PSYCHIC
0 Abra PSYCHIC Yellow Yes 25 0 0 0 0 1
1 Pikachu ELECTRIC Yellow Yes 35 0 1 0 0 0
2 Ekans POISON Purple Yes 35 0 0 0 1 0
3 Dratini DRAGON Blue Yes 41 1 0 0 0 0
4 Ditto NORMAL Pink No 48 0 0 1 0 0

Keep it clean

Getting perfect data at the beginning of the project is extremely unlikely. In reality, you will probably need to put together a dataset from many sources, and in the process, there might be some malformed, erroneous, or missing data.

That's ok as long as you know how to clean your data and get it in proper shape for your analysis/algorithms. It is often said that you will probably spend 80% of your time cleaning and wrangling data, so it's a good idea to learn one or two tricks.

In the next article, we will learn how to put together data from different sources into a single collection.

Thank you for reading!

What to do next

Author image
Budapest, Hungary
Hey there, I'm Juan. A programmer currently living in Budapest. I believe in well-engineered solutions, clean code and sharing knowledge. Thanks for reading, I hope you find my articles useful!