Hands-on Pandas(7): Loading data from files

Data analysis usually starts by loading data into the structures of your library/tools of choice. Almost always this data will either come from a database, the web, or a collection of files.

The files that contain your data can come in many different formats: Comma-separated values in a text file, JSON files, excel files, or files with values separated by custom characters. In this article, we will learn how to read data from some of the common file formats using Pandas' built-in functions.

Great, let's get started!

Reading data from CSV files

CSV files are an incredibly common file format, and many tutorials and small file repos use them as default data format. Imagine you have a file with this contents at ./sample_data/pokes.csv:

Name,Type,Color,Evolves,HP
Abra,Psychic,Yellow,True,25
Pikachu,Electric,Yellow,True,35
Ekans,Poison,Purple,True,35
Dratini,Dragon,Blue,True,41
Ditto,Normal,Pink,False,48

You can easily put this data in a dataframe using read_csv:

import pandas as pd

poke_data = pd.read_csv('./sample_data/pokes.csv')
poke_data

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	Yellow	True	25
1	Pikachu	Electric	Yellow	True	35
2	Ekans	Poison	Purple	True	35
3	Dratini	Dragon	Blue	True	41
4	Ditto	Normal	Pink	False	48

Note that read_csv automatically assigns the first line as column index and adds its own default row index. You can specify which column to use as index with the index_col attribute:

poke_data = pd.read_csv('./sample_data/pokes.csv', index_col='Name')
poke_data

	Type	Color	Evolves	HP
Name
Abra	Psychic	Yellow	True	25
Pikachu	Electric	Yellow	True	35
Ekans	Poison	Purple	True	35
Dratini	Dragon	Blue	True	41
Ditto	Normal	Pink	False	48

Sometimes files don't have column names on the first row. In this case, you can provide an additional parameter to read_csv to assign column names:

poke_data = pd.read_csv('./sample_data/pokes_no_header.csv',
                        names=['Name', 'Type', 'Color', 'Evolves', 'HP'],
                        index_col='Name')
poke_data

	Type	Color	Evolves	HP
Name
Abra	Psychic	Yellow	True	25
Pikachu	Electric	Yellow	True	35
Ekans	Poison	Purple	True	35
Dratini	Dragon	Blue	True	41
Ditto	Normal	Pink	False	48

Reading data from tables

Pandas has a function called read_table that lets you load files with data in tabular form. This function receives a parameter sep that specifies the separator character in the table. Imagine you have a file with the following contents:

Name|Type|Color|Evolves|HP
Abra|Psychic|Yellow|True|25
Pikachu|Electric|Yellow|True|35
Ekans|Poison|Purple|True|35
Dratini|Dragon|Blue|True|41
Ditto|Normal|Pink|False|48

This table uses the '|' character as separator. We can load the data into pandas with the following call:

poke_data = pd.read_table('./sample_data/pokes_table', sep='|')
poke_data

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	Yellow	True	25
1	Pikachu	Electric	Yellow	True	35
2	Ekans	Poison	Purple	True	35
3	Dratini	Dragon	Blue	True	41
4	Ditto	Normal	Pink	False	48

# You can emulate the behavior of read_csv by passing sep=',' as attribute:
poke_data = pd.read_table('./sample_data/pokes.csv', sep=',')
poke_data

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	Yellow	True	25
1	Pikachu	Electric	Yellow	True	35
2	Ekans	Poison	Purple	True	35
3	Dratini	Dragon	Blue	True	41
4	Ditto	Normal	Pink	False	48

pd.read_table is very useful if you have a file with entries that are separated by a varying number of spaces or tabs. By passing the regular expression sep='\s+' you will be able to load such files. Let's load a file with the following contents:

Name  Type Color   Evolves      HP
Abra Psychic           Yellow  True  25
Pikachu  Electric  Yellow  True  35
Ekans   Poison       Purple  True  35
Dratini                 Dragon              Blue        True              41
Ditto    Normal  Pink  False  48

poke_data = pd.read_table('./sample_data/pokes_varspace', sep='\s+')
poke_data

	Name	Type	Color	Evolves	HP
0	Abra	Psychic	Yellow	True	25
1	Pikachu	Electric	Yellow	True	35
2	Ekans	Poison	Purple	True	35
3	Dratini	Dragon	Blue	True	41
4	Ditto	Normal	Pink	False	48

Reading other formats

The two other common formats you will probably find in practice are Excel-like files and JSON files. Pandas has the functions read_excel and read_json to aid you when working with those files. You can read the official documentation here:

Files are not everything, but it's a great start

While databases are probably the most common form of data storage you will deal with, knowing how to read data from files is a fundamental skill when working with Pandas. Lots of online repositories and tutorials offer huge libraries of data in several different file formats.

With the material covered in this article, and with access to Pandas' excellent documentation you will have no problem working with files.

Now that we can get the data into dataframes, the next step is learning how to clean it. In the next article, we will learn some techniques for preparing data for processing.

Thank you for reading!

What to do next

Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
You can find the source code for this series in this repo.
This article is based on Python for Data Analysis. These and other very helpful books can be found in the recommended reading list.
Send me an email with questions, comments or suggestions (it's in the About Me page)

Hands-on Pandas(7): Loading data from files

Reading data from CSV files

Reading data from tables

Reading other formats

Files are not everything, but it's a great start

What to do next

Newsletter

Recent Post

Categories