Merge/join operations in Pandas let you gather information from many tables into a single dataframe for further processing or analysis. This is another important skill that you will probably use a lot when working with data.
If you have some experience with relational databases you can recognize the analogous behavior with table joins. In this article, we will demo some of the most important behavior offered by Pandas' merge
function. It will probably be more than enough to keep you going before you need to consult more documentation.
Great, let's get started!
Loading the data for the examples
For the demos, we will use data from two different files. One of them contains some non-numeric attributes of a few Pokemon, the other contains base stats like HP, Attack, and Speed.
import pandas as pd
attribs = pd.read_csv('./sample_data/poke_attributes.csv')
attribs
Name | Type | Color | Evolves | |
---|---|---|---|---|
0 | Abra | Psychic | Yellow | True |
1 | Ekans | Poison | Purple | True |
2 | Ditto | Normal | Pink | False |
3 | Dratini | Dragon | Blue | True |
4 | Pikachu | Electric | Yellow | True |
stats = pd.read_csv('./sample_data/poke_stats.csv')
stats
Name | HP | Attack | Speed | |
---|---|---|---|---|
0 | Ditto | 48 | 48 | 48 |
1 | Dratini | 41 | 64 | 50 |
2 | Pikachu | 35 | 55 | 90 |
3 | Caterpie | 45 | 30 | 45 |
4 | Vulpix | 38 | 41 | 65 |
Notice that Ditto, Dratini, and Pikachu are on both tables, but Abra and Ekans appear only on the first one while Caterpie and Vulpix appear only on the second one.
Now let's go back to the programming bits: Pandas' merge function links the rows of two dataframes using one or more keys. In the following examples, we will use the Name column as key to show different types of merge/join.
Inner Joins
An inner join is performed on the intersection of the keys of the two dataframes. In this case, it will find the Names that are in both dataframes (Ditto, Dratini, and Pikachu) and create a new dataframe with all the columns on both original dataframes.
We will call the merge function specifying two additional arguments:
- how: Tells merge which type of join to perform.
- on: Tells merge which columns to use as key.
innerjoin = pd.merge(attribs, stats, on='Name', how='inner')
innerjoin
Name | Type | Color | Evolves | HP | Attack | Speed | |
---|---|---|---|---|---|---|---|
0 | Ditto | Normal | Pink | False | 48 | 48 | 48 |
1 | Dratini | Dragon | Blue | True | 41 | 64 | 50 |
2 | Pikachu | Electric | Yellow | True | 35 | 55 | 90 |
If you want to perform an inner join you can skip the how attribute: The default behavior of the merge function is an inner join:
innerjoin = pd.merge(attribs, stats, on='Name')
innerjoin
Name | Type | Color | Evolves | HP | Attack | Speed | |
---|---|---|---|---|---|---|---|
0 | Ditto | Normal | Pink | False | 48 | 48 | 48 |
1 | Dratini | Dragon | Blue | True | 41 | 64 | 50 |
2 | Pikachu | Electric | Yellow | True | 35 | 55 | 90 |
Outer joins
Outer joins are performed on the union of the keys of the two dataframes. In this case, it will use every single name in the original dataframes and fill the missing fields with NaN.
outerjoin = pd.merge(attribs, stats, on='Name', how='outer')
outerjoin
Name | Type | Color | Evolves | HP | Attack | Speed | |
---|---|---|---|---|---|---|---|
0 | Abra | Psychic | Yellow | True | NaN | NaN | NaN |
1 | Ekans | Poison | Purple | True | NaN | NaN | NaN |
2 | Ditto | Normal | Pink | False | 48.0 | 48.0 | 48.0 |
3 | Dratini | Dragon | Blue | True | 41.0 | 64.0 | 50.0 |
4 | Pikachu | Electric | Yellow | True | 35.0 | 55.0 | 90.0 |
5 | Caterpie | NaN | NaN | NaN | 45.0 | 30.0 | 45.0 |
6 | Vulpix | NaN | NaN | NaN | 38.0 | 41.0 | 65.0 |
Left joins
Left joins take every single element on the left dataframes and fill-in with the keys in common on the right dataframe. This might be a bit easier to understand with an example:
leftjoin = pd.merge(attribs, stats, on='Name', how='left')
leftjoin
Name | Type | Color | Evolves | HP | Attack | Speed | |
---|---|---|---|---|---|---|---|
0 | Abra | Psychic | Yellow | True | NaN | NaN | NaN |
1 | Ekans | Poison | Purple | True | NaN | NaN | NaN |
2 | Ditto | Normal | Pink | False | 48.0 | 48.0 | 48.0 |
3 | Dratini | Dragon | Blue | True | 41.0 | 64.0 | 50.0 |
4 | Pikachu | Electric | Yellow | True | 35.0 | 55.0 | 90.0 |
Notice how Abra and Ekans don't have values in HP, Attack, and Speed. This happens because the right table does not contain values for these particular Pokemon.
Right joins
Right joins take every single element on the right dataframes and fill-in with the keys in common on the left dataframe. This might be a bit easier to understand with an example:
rightjoin = pd.merge(attribs, stats, on='Name', how='right')
rightjoin
Name | Type | Color | Evolves | HP | Attack | Speed | |
---|---|---|---|---|---|---|---|
0 | Ditto | Normal | Pink | False | 48 | 48 | 48 |
1 | Dratini | Dragon | Blue | True | 41 | 64 | 50 |
2 | Pikachu | Electric | Yellow | True | 35 | 55 | 90 |
3 | Caterpie | NaN | NaN | NaN | 45 | 30 | 45 |
4 | Vulpix | NaN | NaN | NaN | 38 | 41 | 65 |
Notice how Caterpie and Vulpix don't have values in Type, Color, and Evolves. This happens because the left table does not contain values for these particular Pokemon.
Other considerations
Joins with key columns of different names
If the names of the columns are different, you can specify them with left_on and right_on
pd.merge(leftdataframe, rightdataframe, left_on='left_key', right_on='right_key')
Joining on indexes
Sometimes you want to use the index of the dataframes as key column to perform the join. For this to work, you just need to pass two additional parameters set to true: left_index=True, right_index=True
Remembering which join type you need
If you are having trouble recalling which type of join is which, take a look at the following image:
Multi-source data
In the previous article, we started talking about the reality of working with real data: You'd be really lucky to find a clean dataset in a single repository. In reality, you will probably need to select and aggregate data from lots of different sources.
Merge is a relatively simple function at a basic level, but it's an incredibly useful and rich tool. You just learned how to use the basics, but it's also a good idea to read the documentation to get an idea of the full capabilities of this function.
In the next article, we will learn how to perform group operations in Pandas data structures.
Thank you for reading!
What to do next
- Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
- You can find the source code for this series in this repo.
- This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
- Send me an email with questions, comments or suggestions (it's in the About Me page)