In an ideal world, all the data you need is available in the right format and with complete content.
In the real world, you will probably need to scrape data from lots of different and incomplete sources. That's why it's important to learn how to clean your data before analyzing it or feeding it into a ML algorithm.
Data cleaning might not the most glamorous part of your work, but it's a crucial part of the development of data-based products. Not only it's important for the whole pipeline, but it's also one of the most time-consuming tasks in a project (Some estimate it to consume around 80% of a project's time).
In this article, we will learn some basic data cleaning techniques that will let you handle the most common situations.
Great, let's get started!
Dealing with missing values
Dealing with missing values is a basic (and extremely useful) technique. The data you'll have access to will probably miss the attributes of some entries, and it's better to spend some time explicitly handling each case.
Imagine you have a csv file with the following content:
Name,Type,Color,Evolves,HP
Abra,Psychic,,True,
Pikachu,Electric,,True,35
Ekans,,Purple,,35
Dratini,,Blue,,41
Ditto,Normal,Pink,False,48
As you can see, it's missing entries for several rows. If you load it using read_csv
, the resulting dataframe will look like this:
import pandas as pd
frame = pd.read_csv('./sample_data/pokes_missing.csv')
frame
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | Psychic | NaN | True | NaN |
1 | Pikachu | Electric | NaN | True | 35.0 |
2 | Ekans | NaN | Purple | NaN | 35.0 |
3 | Dratini | NaN | Blue | NaN | 41.0 |
4 | Ditto | Normal | Pink | False | 48.0 |
Pandas fills every entry without value with the default NaN sentinel value. In many situations, you just want to remove the rows with missing values and leave the ones that have exclusively non-NaN fields, for this, you can use the dropna
function:
# dropna does not affect the original dataframe, it returns a new one.
frame.dropna()
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
4 | Ditto | Normal | Pink | False | 48.0 |
#If you want to perform the dropna in place, specify the inplace attribute
frame.dropna(inplace=True)
frame
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
4 | Ditto | Normal | Pink | False | 48.0 |
# Let's reload the data:
frame = pd.read_csv('./sample_data/pokes_missing.csv')
Dropna can be used differently: You can tell the function to drop only the rows that are composed of NaN values exclusively. This way you can do away with entries that don't provide any information. For doing this, provide the optional parameter * how='all'
when calling dropna.
Dropping all rows with NaN values might not be the best option. Very often, the best option is to fill-in the missing entries with values you provide or with values calculated from the dataframe itself. Pandas provides a function called fillna
that lets you handle missing values:
# Let's fill every missing entry with the value 0
frame.fillna(10)
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | Psychic | 10 | True | 10.0 |
1 | Pikachu | Electric | 10 | True | 35.0 |
2 | Ekans | 10 | Purple | 10 | 35.0 |
3 | Dratini | 10 | Blue | 10 | 41.0 |
4 | Ditto | Normal | Pink | False | 48.0 |
Do you notice anything wrong with the dataframe? Well, for starters, the value 10 might work reasonably well for HP, but it doesn't make any sense for Type, Color or Evolves. fillna
lets you specify how to fill NaN values for every column if you pass a dictionary instead of a single value:
frame.fillna({'Type': 'Unknown',
'Color': 'Yellow',
'Evolves': True,
'HP': 30})
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | Psychic | Yellow | True | 30.0 |
1 | Pikachu | Electric | Yellow | True | 35.0 |
2 | Ekans | Unknown | Purple | True | 35.0 |
3 | Dratini | Unknown | Blue | True | 41.0 |
4 | Ditto | Normal | Pink | False | 48.0 |
Now the dataframe makes a bit more sense! As with dropna, the change performed by fillna doesn't happen on the original dataframe, it just creates a new dataframe with new values. If you want to change the original dataframe provide the inplace argument:
frame.fillna({'Type': 'Unknown',
'Color': 'Yellow',
'Evolves': True,
'HP': 30},
inplace=True)
frame
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | Psychic | Yellow | True | 30.0 |
1 | Pikachu | Electric | Yellow | True | 35.0 |
2 | Ekans | Unknown | Purple | True | 35.0 |
3 | Dratini | Unknown | Blue | True | 41.0 |
4 | Ditto | Normal | Pink | False | 48.0 |
Dealing with duplicates
Sometimes your data will contain duplicate rows or values with the same fields. Take a look at this dataframe:
frame = pd.read_csv('./sample_data/pokes_duplicates.csv')
frame
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | Psychic | Yellow | True | 25 |
1 | Pikachu | Electric | Yellow | True | 35 |
2 | Ekans | Poison | Purple | True | 35 |
3 | Dratini | Dragon | Blue | True | 41 |
4 | Ditto | Normal | Pink | False | 48 |
5 | Abra | Psychic | Yellow | True | 25 |
6 | Dratini | Dragon | Blue | True | 41 |
The frame has two Dratinis and two Abras, so let's get rid of them using drop_duplicates:
frame.drop_duplicates()
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | Psychic | Yellow | True | 25 |
1 | Pikachu | Electric | Yellow | True | 35 |
2 | Ekans | Poison | Purple | True | 35 |
3 | Dratini | Dragon | Blue | True | 41 |
4 | Ditto | Normal | Pink | False | 48 |
Sometimes you want to remove duplicates based on the value of an attribute. In this case, you can provide an extra argument (or list of arguments if you want to use multiple attributes) to specify the column to take into consideration. For example, the following call ensures we have only rows with unique colors:
# Abra and Pikachu are both yellow, so it's time to go for Pikachu
frame.drop_duplicates('Color')
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | Psychic | Yellow | True | 25 |
2 | Ekans | Poison | Purple | True | 35 |
3 | Dratini | Dragon | Blue | True | 41 |
4 | Ditto | Normal | Pink | False | 48 |
Mappings and other transformations
In this section, we will learn about some other techniques for modifying your data. The first technique is to use the map function to alter a column in a dataframe:
frame = frame = pd.read_csv('./sample_data/pokes.csv')
frame
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | Psychic | Yellow | True | 25 |
1 | Pikachu | Electric | Yellow | True | 35 |
2 | Ekans | Poison | Purple | True | 35 |
3 | Dratini | Dragon | Blue | True | 41 |
4 | Ditto | Normal | Pink | False | 48 |
# Let's transform the type into an all-uppercase string
frame['Type'] = frame['Type'].map(lambda x: x.upper())
frame
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | PSYCHIC | Yellow | True | 25 |
1 | Pikachu | ELECTRIC | Yellow | True | 35 |
2 | Ekans | POISON | Purple | True | 35 |
3 | Dratini | DRAGON | Blue | True | 41 |
4 | Ditto | NORMAL | Pink | False | 48 |
The next thing we can do is to use the replace function to change values. In the following example we will replace the Evolves' column boolean values for 'Yes' and 'No' strings:
frame['Evolves'] = frame['Evolves'].replace([True, False], ['Yes','No'])
frame
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | PSYCHIC | Yellow | Yes | 25 |
1 | Pikachu | ELECTRIC | Yellow | Yes | 35 |
2 | Ekans | POISON | Purple | Yes | 35 |
3 | Dratini | DRAGON | Blue | Yes | 41 |
4 | Ditto | NORMAL | Pink | No | 48 |
We can also use conditional selection to get just the data we need. Suppose we are interested only on the elements with HP under 40:
frame[frame['HP'] < 40]
Name | Type | Color | Evolves | HP | |
---|---|---|---|---|---|
0 | Abra | PSYCHIC | Yellow | Yes | 25 |
1 | Pikachu | ELECTRIC | Yellow | Yes | 35 |
2 | Ekans | POISON | Purple | Yes | 35 |
The last thing we will deal with is transforming categorical data into one-hot encoded columns. If you want a more in-depth explanation of one-hot encoding you can read this article:
# get_dummies transforms categorical data into a one-hot encoded dataframe
# we will concatenate the original dataframe and the one-hot encoded type columns
pd.concat([frame, pd.get_dummies(frame['Type'])], axis=1)
Name | Type | Color | Evolves | HP | DRAGON | ELECTRIC | NORMAL | POISON | PSYCHIC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Abra | PSYCHIC | Yellow | Yes | 25 | 0 | 0 | 0 | 0 | 1 |
1 | Pikachu | ELECTRIC | Yellow | Yes | 35 | 0 | 1 | 0 | 0 | 0 |
2 | Ekans | POISON | Purple | Yes | 35 | 0 | 0 | 0 | 1 | 0 |
3 | Dratini | DRAGON | Blue | Yes | 41 | 1 | 0 | 0 | 0 | 0 |
4 | Ditto | NORMAL | Pink | No | 48 | 0 | 0 | 1 | 0 | 0 |
Keep it clean
Getting perfect data at the beginning of the project is extremely unlikely. In reality, you will probably need to put together a dataset from many sources, and in the process, there might be some malformed, erroneous, or missing data.
That's ok as long as you know how to clean your data and get it in proper shape for your analysis/algorithms. It is often said that you will probably spend 80% of your time cleaning and wrangling data, so it's a good idea to learn one or two tricks.
In the next article, we will learn how to put together data from different sources into a single collection.
Thank you for reading!
What to do next
- Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
- You can find the source code for this series in this repo.
- This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
- Send me an email with questions, comments or suggestions (it's in the About Me page)