Hands-on Pandas(1): Series and Dataframes

In a previous series we covered the fundamentals of NumPy, now it's time to deal with another important tool frequently used in data analysis: Pandas.

Pandas is a library for data manipulation and analysis that lets you manipulate heterogeneous data in tabular form (in contrast to NumPy, designed to work with homogeneous numerical data in array form). It includes data structures and data manipulation features that make cleaning and analyzing data a quick and easy task.

It was originally developed by Wes McKinney (2008), but over the years it has gained a hugely supportive community that continually invest in improving the tool.

This article will be the introduction to a series where we'll learn how to work with Pandas using a hands-on approach. If you are just starting, I recommend running a Jupyter notebook alongside and trying the examples.

Great, let's start our journey talking about the two most important data structures in Pandas: Series and Dataframes.

# Before proceeding, we need to import pandas. The most common alias for it is pd
import pandas as pd


A Series is an array-like collection containing a sequence of values and their associated set of labels (called index).

# You can create a series by feeding a list to pd.Series
simple_series = pd.Series([2,4,6,8,10])
0     2
1     4
2     6
3     8
4    10
dtype: int64

The column on the left is the index of our series, and the column on the right are the values. We didn't provide an index during creation, so NumPy created a default index consisting of integers starting at 0. You can pass an index parameter during creation like this:

series = pd.Series([2,4,6,8,10], index=['dos', 'quatre', 'six', 'oito', 'dieci'])
dos        2
quatre     4
six        6
oito       8
dieci     10
dtype: int64
# You can retrieve the index using the .index attribute of the series
Index(['dos', 'quatre', 'six', 'oito', 'dieci'], dtype='object')

You can retrieve elements from the series using the respective index value as if you were getting elements from a standard Python dictionary:

# Let's get the element whose index is 'oito'
element = series['oito']
# We can retrieve more than one element if we provide an array of indices
elements = series[['dieci', 'dos', 'six']]
dieci    10
dos       2
six       6
dtype: int64

An alternative way of creating Series is by providing a dictionary as argument during creation:

num_data = {'dos': 2,
            'quatre': 4,
            'six': 6,
            'oito': 8,
            'dieci': 10}

other_series = pd.Series(num_data)
dos        2
quatre     4
six        6
oito       8
dieci     10
dtype: int64
# You can alter the order by providing an additional index argument, just be careful:
# If you provide in the list an element that is not in the dictionary it will be filled with NA
other_series = pd.Series(num_data, index=['doce', 'dieci', 'oito', 'six', 'quatre', 'dos'])
doce       NaN
dieci     10.0
oito       8.0
six        6.0
quatre     4.0
dos        2.0
dtype: float64
# As a last detail, you can assign names to both the index and values of a series = 'Numbers' = 'Names'
doce       NaN
dieci     10.0
oito       8.0
six        6.0
quatre     4.0
dos        2.0
Name: Numbers, dtype: float64


DataFrames represent a rectangular array of data consisting of columns. Each column can have a different data type: some might represent numeric data like temperatures or ages, and others might contain strings or boolean entries. DataFrames count with both a column and a row index.

# DataFrames can also be created from dictionaries

poke_data = {'Name': ['Abra', 'Koffing', 'Ditto', 'Pikachu'],
             'Type': ['Psychic', 'Poison', 'Normal', 'Electric'],
             'Base speed': [90, 35, 48, 90],
             'Learns transform': [False, False, True, False]}

poke_frame = pd.DataFrame(poke_data)
Name Type Base speed Learns transform
0 Abra Psychic 90 False
1 Koffing Poison 35 False
2 Ditto Normal 48 True
3 Pikachu Electric 90 False
# You can retrieve dataframe columns (as a Series object) using either dictionary-like syntax or by attribute

poke_speeds = poke_frame['Base speed']
0    90
1    35
2    48
3    90
Name: Base speed, dtype: int64
poke_types = poke_frame.Type
0     Psychic
1      Poison
2      Normal
3    Electric
Name: Type, dtype: object
# Rows can be retrieved using the loc you can use the loc function and providing the right index

abra_data = poke_frame.loc[0]
Name                   Abra
Type                Psychic
Base speed               90
Learns transform      False
Name: 0, dtype: object
# You can update the values of a column using standard assignment
# Let's update the names with 'cute' versions

poke_frame['Name'] = ['Cute Abra', 'Cute Koffing', 'Cute Ditto', 'Cute Pikachu']
Name Type Base speed Learns transform
0 Cute Abra Psychic 90 False
1 Cute Koffing Poison 35 False
2 Cute Ditto Normal 48 True
3 Cute Pikachu Electric 90 False
# If you perform the assignment to a column that doesn't exist yet, a new one will be created
poke_frame['Yellow'] = [True, False, False, True]
Name Type Base speed Learns transform Yellow
0 Cute Abra Psychic 90 False True
1 Cute Koffing Poison 35 False False
2 Cute Ditto Normal 48 True False
3 Cute Pikachu Electric 90 False True
# And finally, if you want to get rid of a specific column you can use del
del poke_frame['Yellow']
Name Type Base speed Learns transform
0 Cute Abra Psychic 90 False
1 Cute Koffing Poison 35 False
2 Cute Ditto Normal 48 True
3 Cute Pikachu Electric 90 False

These are just the first steps

I think that is enough for an introduction. All you need to remember is that DataFrames represent tabular data and Series represent just a row (or column) of data at a time.

In the next articles we will learn some useful techniques for manipulating and analyzing data. All that knowledge is built on top of the foundations we just learned, so feel free to experiment a bit on your own to solidify your understanding. Create a couple of dataframes and series with data you know: Groceries, pets, fundamental particles.

Thank you for reading!

What to do next

Author image
Budapest, Hungary
Hey there, I'm Juan. A programmer currently living in Budapest. I believe in well-engineered solutions, clean code and sharing knowledge. Thanks for reading, I hope you find my articles useful!