Hands-on Pandas(9): Merging Dataframes

Merge/join operations in Pandas let you gather information from many tables into a single dataframe for further processing or analysis. This is another important skill that you will probably use a lot when working with data.

If you have some experience with relational databases you can recognize the analogous behavior with table joins. In this article, we will demo some of the most important behavior offered by Pandas' merge function. It will probably be more than enough to keep you going before you need to consult more documentation.

Great, let's get started!

Loading the data for the examples

For the demos, we will use data from two different files. One of them contains some non-numeric attributes of a few Pokemon, the other contains base stats like HP, Attack, and Speed.

import pandas as pd

attribs = pd.read_csv('./sample_data/poke_attributes.csv')
attribs

	Name	Type	Color	Evolves
0	Abra	Psychic	Yellow	True
1	Ekans	Poison	Purple	True
2	Ditto	Normal	Pink	False
3	Dratini	Dragon	Blue	True
4	Pikachu	Electric	Yellow	True

stats = pd.read_csv('./sample_data/poke_stats.csv')
stats

	Name	HP	Attack	Speed
0	Ditto	48	48	48
1	Dratini	41	64	50
2	Pikachu	35	55	90
3	Caterpie	45	30	45
4	Vulpix	38	41	65

Notice that Ditto, Dratini, and Pikachu are on both tables, but Abra and Ekans appear only on the first one while Caterpie and Vulpix appear only on the second one.

Now let's go back to the programming bits: Pandas' merge function links the rows of two dataframes using one or more keys. In the following examples, we will use the Name column as key to show different types of merge/join.

Inner Joins

An inner join is performed on the intersection of the keys of the two dataframes. In this case, it will find the Names that are in both dataframes (Ditto, Dratini, and Pikachu) and create a new dataframe with all the columns on both original dataframes.

We will call the merge function specifying two additional arguments:

how: Tells merge which type of join to perform.
on: Tells merge which columns to use as key.

innerjoin = pd.merge(attribs, stats, on='Name', how='inner')
innerjoin

	Name	Type	Color	Evolves	HP	Attack	Speed
0	Ditto	Normal	Pink	False	48	48	48
1	Dratini	Dragon	Blue	True	41	64	50
2	Pikachu	Electric	Yellow	True	35	55	90

If you want to perform an inner join you can skip the how attribute: The default behavior of the merge function is an inner join:

innerjoin = pd.merge(attribs, stats, on='Name')
innerjoin

	Name	Type	Color	Evolves	HP	Attack	Speed
0	Ditto	Normal	Pink	False	48	48	48
1	Dratini	Dragon	Blue	True	41	64	50
2	Pikachu	Electric	Yellow	True	35	55	90

Outer joins

Outer joins are performed on the union of the keys of the two dataframes. In this case, it will use every single name in the original dataframes and fill the missing fields with NaN.

outerjoin = pd.merge(attribs, stats, on='Name', how='outer')
outerjoin

	Name	Type	Color	Evolves	HP	Attack	Speed
0	Abra	Psychic	Yellow	True	NaN	NaN	NaN
1	Ekans	Poison	Purple	True	NaN	NaN	NaN
2	Ditto	Normal	Pink	False	48.0	48.0	48.0
3	Dratini	Dragon	Blue	True	41.0	64.0	50.0
4	Pikachu	Electric	Yellow	True	35.0	55.0	90.0
5	Caterpie	NaN	NaN	NaN	45.0	30.0	45.0
6	Vulpix	NaN	NaN	NaN	38.0	41.0	65.0

Left joins

Left joins take every single element on the left dataframes and fill-in with the keys in common on the right dataframe. This might be a bit easier to understand with an example:

leftjoin = pd.merge(attribs, stats, on='Name', how='left')
leftjoin

	Name	Type	Color	Evolves	HP	Attack	Speed
0	Abra	Psychic	Yellow	True	NaN	NaN	NaN
1	Ekans	Poison	Purple	True	NaN	NaN	NaN
2	Ditto	Normal	Pink	False	48.0	48.0	48.0
3	Dratini	Dragon	Blue	True	41.0	64.0	50.0
4	Pikachu	Electric	Yellow	True	35.0	55.0	90.0

Notice how Abra and Ekans don't have values in HP, Attack, and Speed. This happens because the right table does not contain values for these particular Pokemon.

Right joins

Right joins take every single element on the right dataframes and fill-in with the keys in common on the left dataframe. This might be a bit easier to understand with an example:

rightjoin = pd.merge(attribs, stats, on='Name', how='right')
rightjoin

	Name	Type	Color	Evolves	HP	Attack	Speed
0	Ditto	Normal	Pink	False	48	48	48
1	Dratini	Dragon	Blue	True	41	64	50
2	Pikachu	Electric	Yellow	True	35	55	90
3	Caterpie	NaN	NaN	NaN	45	30	45
4	Vulpix	NaN	NaN	NaN	38	41	65

Notice how Caterpie and Vulpix don't have values in Type, Color, and Evolves. This happens because the left table does not contain values for these particular Pokemon.

Other considerations

Joins with key columns of different names

If the names of the columns are different, you can specify them with left_on and right_on

pd.merge(leftdataframe, rightdataframe, left_on='left_key', right_on='right_key')

Joining on indexes

Sometimes you want to use the index of the dataframes as key column to perform the join. For this to work, you just need to pass two additional parameters set to true: left_index=True, right_index=True

Remembering which join type you need

If you are having trouble recalling which type of join is which, take a look at the following image:

joins

Multi-source data

In the previous article, we started talking about the reality of working with real data: You'd be really lucky to find a clean dataset in a single repository. In reality, you will probably need to select and aggregate data from lots of different sources.

Merge is a relatively simple function at a basic level, but it's an incredibly useful and rich tool. You just learned how to use the basics, but it's also a good idea to read the documentation to get an idea of the full capabilities of this function.

In the next article, we will learn how to perform group operations in Pandas data structures.

Thank you for reading!

What to do next

Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
You can find the source code for this series in this repo.
This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
Send me an email with questions, comments or suggestions (it's in the About Me page)