BrainsToBytes

Hands-on Pandas(9): Merging Dataframes

Merge/join operations in Pandas let you gather information from many tables into a single dataframe for further processing or analysis. This is another important skill that you will probably use a lot when working with data.

If you have some experience with relational databases you can recognize the analogous behavior with table joins. In this article, we will demo some of the most important behavior offered by Pandas' merge function. It will probably be more than enough to keep you going before you need to consult more documentation.

Great, let's get started!

Loading the data for the examples

For the demos, we will use data from two different files. One of them contains some non-numeric attributes of a few Pokemon, the other contains base stats like HP, Attack, and Speed.

import pandas as pd

attribs = pd.read_csv('./sample_data/poke_attributes.csv')
attribs
Name Type Color Evolves
0 Abra Psychic Yellow True
1 Ekans Poison Purple True
2 Ditto Normal Pink False
3 Dratini Dragon Blue True
4 Pikachu Electric Yellow True
stats = pd.read_csv('./sample_data/poke_stats.csv')
stats
Name HP Attack Speed
0 Ditto 48 48 48
1 Dratini 41 64 50
2 Pikachu 35 55 90
3 Caterpie 45 30 45
4 Vulpix 38 41 65

Notice that Ditto, Dratini, and Pikachu are on both tables, but Abra and Ekans appear only on the first one while Caterpie and Vulpix appear only on the second one.

Now let's go back to the programming bits: Pandas' merge function links the rows of two dataframes using one or more keys. In the following examples, we will use the Name column as key to show different types of merge/join.

Inner Joins

An inner join is performed on the intersection of the keys of the two dataframes. In this case, it will find the Names that are in both dataframes (Ditto, Dratini, and Pikachu) and create a new dataframe with all the columns on both original dataframes.

We will call the merge function specifying two additional arguments:

  • how: Tells merge which type of join to perform.
  • on: Tells merge which columns to use as key.
innerjoin = pd.merge(attribs, stats, on='Name', how='inner')
innerjoin
Name Type Color Evolves HP Attack Speed
0 Ditto Normal Pink False 48 48 48
1 Dratini Dragon Blue True 41 64 50
2 Pikachu Electric Yellow True 35 55 90

If you want to perform an inner join you can skip the how attribute: The default behavior of the merge function is an inner join:

innerjoin = pd.merge(attribs, stats, on='Name')
innerjoin
Name Type Color Evolves HP Attack Speed
0 Ditto Normal Pink False 48 48 48
1 Dratini Dragon Blue True 41 64 50
2 Pikachu Electric Yellow True 35 55 90

Outer joins

Outer joins are performed on the union of the keys of the two dataframes. In this case, it will use every single name in the original dataframes and fill the missing fields with NaN.

outerjoin = pd.merge(attribs, stats, on='Name', how='outer')
outerjoin
Name Type Color Evolves HP Attack Speed
0 Abra Psychic Yellow True NaN NaN NaN
1 Ekans Poison Purple True NaN NaN NaN
2 Ditto Normal Pink False 48.0 48.0 48.0
3 Dratini Dragon Blue True 41.0 64.0 50.0
4 Pikachu Electric Yellow True 35.0 55.0 90.0
5 Caterpie NaN NaN NaN 45.0 30.0 45.0
6 Vulpix NaN NaN NaN 38.0 41.0 65.0

Left joins

Left joins take every single element on the left dataframes and fill-in with the keys in common on the right dataframe. This might be a bit easier to understand with an example:

leftjoin = pd.merge(attribs, stats, on='Name', how='left')
leftjoin
Name Type Color Evolves HP Attack Speed
0 Abra Psychic Yellow True NaN NaN NaN
1 Ekans Poison Purple True NaN NaN NaN
2 Ditto Normal Pink False 48.0 48.0 48.0
3 Dratini Dragon Blue True 41.0 64.0 50.0
4 Pikachu Electric Yellow True 35.0 55.0 90.0

Notice how Abra and Ekans don't have values in HP, Attack, and Speed. This happens because the right table does not contain values for these particular Pokemon.

Right joins

Right joins take every single element on the right dataframes and fill-in with the keys in common on the left dataframe. This might be a bit easier to understand with an example:

rightjoin = pd.merge(attribs, stats, on='Name', how='right')
rightjoin
Name Type Color Evolves HP Attack Speed
0 Ditto Normal Pink False 48 48 48
1 Dratini Dragon Blue True 41 64 50
2 Pikachu Electric Yellow True 35 55 90
3 Caterpie NaN NaN NaN 45 30 45
4 Vulpix NaN NaN NaN 38 41 65

Notice how Caterpie and Vulpix don't have values in Type, Color, and Evolves. This happens because the left table does not contain values for these particular Pokemon.

Other considerations

Joins with key columns of different names

If the names of the columns are different, you can specify them with left_on and right_on

pd.merge(leftdataframe, rightdataframe, left_on='left_key', right_on='right_key')

Joining on indexes

Sometimes you want to use the index of the dataframes as key column to perform the join. For this to work, you just need to pass two additional parameters set to true: left_index=True, right_index=True

Remembering which join type you need

If you are having trouble recalling which type of join is which, take a look at the following image:

joins

Multi-source data

In the previous article, we started talking about the reality of working with real data: You'd be really lucky to find a clean dataset in a single repository. In reality, you will probably need to select and aggregate data from lots of different sources.

Merge is a relatively simple function at a basic level, but it's an incredibly useful and rich tool. You just learned how to use the basics, but it's also a good idea to read the documentation to get an idea of the full capabilities of this function.

In the next article, we will learn how to perform group operations in Pandas data structures.

Thank you for reading!

What to do next

Author image
Budapest, Hungary
Hey there, I'm Juan. A programmer currently living in Budapest. I believe in well-engineered solutions, clean code and sharing knowledge. Thanks for reading, I hope you find my articles useful!