BrainsToBytes

Hands-on Pandas(7): Loading data from files

Data analysis usually starts by loading data into the structures of your library/tools of choice. Almost always this data will either come from a database, the web, or a collection of files.

The files that contain your data can come in many different formats: Comma-separated values in a text file, JSON files, excel files, or files with values separated by custom characters. In this article, we will learn how to read data from some of the common file formats using Pandas' built-in functions.

Great, let's get started!

Reading data from CSV files

CSV files are an incredibly common file format, and many tutorials and small file repos use them as default data format. Imagine you have a file with this contents at ./sample_data/pokes.csv:

Name,Type,Color,Evolves,HP
Abra,Psychic,Yellow,True,25
Pikachu,Electric,Yellow,True,35
Ekans,Poison,Purple,True,35
Dratini,Dragon,Blue,True,41
Ditto,Normal,Pink,False,48

You can easily put this data in a dataframe using read_csv:

import pandas as pd

poke_data = pd.read_csv('./sample_data/pokes.csv')
poke_data
Name Type Color Evolves HP
0 Abra Psychic Yellow True 25
1 Pikachu Electric Yellow True 35
2 Ekans Poison Purple True 35
3 Dratini Dragon Blue True 41
4 Ditto Normal Pink False 48

Note that read_csv automatically assigns the first line as column index and adds its own default row index. You can specify which column to use as index with the index_col attribute:

poke_data = pd.read_csv('./sample_data/pokes.csv', index_col='Name')
poke_data
Type Color Evolves HP
Name
Abra Psychic Yellow True 25
Pikachu Electric Yellow True 35
Ekans Poison Purple True 35
Dratini Dragon Blue True 41
Ditto Normal Pink False 48

Sometimes files don't have column names on the first row. In this case, you can provide an additional parameter to read_csv to assign column names:

poke_data = pd.read_csv('./sample_data/pokes_no_header.csv',
                        names=['Name', 'Type', 'Color', 'Evolves', 'HP'],
                        index_col='Name')
poke_data
Type Color Evolves HP
Name
Abra Psychic Yellow True 25
Pikachu Electric Yellow True 35
Ekans Poison Purple True 35
Dratini Dragon Blue True 41
Ditto Normal Pink False 48

Reading data from tables

Pandas has a function called read_table that lets you load files with data in tabular form. This function receives a parameter sep that specifies the separator character in the table. Imagine you have a file with the following contents:

Name|Type|Color|Evolves|HP
Abra|Psychic|Yellow|True|25
Pikachu|Electric|Yellow|True|35
Ekans|Poison|Purple|True|35
Dratini|Dragon|Blue|True|41
Ditto|Normal|Pink|False|48

This table uses the '|' character as separator. We can load the data into pandas with the following call:

poke_data = pd.read_table('./sample_data/pokes_table', sep='|')
poke_data
Name Type Color Evolves HP
0 Abra Psychic Yellow True 25
1 Pikachu Electric Yellow True 35
2 Ekans Poison Purple True 35
3 Dratini Dragon Blue True 41
4 Ditto Normal Pink False 48
# You can emulate the behavior of read_csv by passing sep=',' as attribute:
poke_data = pd.read_table('./sample_data/pokes.csv', sep=',')
poke_data
Name Type Color Evolves HP
0 Abra Psychic Yellow True 25
1 Pikachu Electric Yellow True 35
2 Ekans Poison Purple True 35
3 Dratini Dragon Blue True 41
4 Ditto Normal Pink False 48

pd.read_table is very useful if you have a file with entries that are separated by a varying number of spaces or tabs. By passing the regular expression sep='\s+' you will be able to load such files. Let's load a file with the following contents:

Name  Type Color   Evolves      HP
Abra Psychic           Yellow  True  25
Pikachu  Electric  Yellow  True  35
Ekans   Poison       Purple  True  35
Dratini                 Dragon              Blue        True              41
Ditto    Normal  Pink  False  48
poke_data = pd.read_table('./sample_data/pokes_varspace', sep='\s+')
poke_data
Name Type Color Evolves HP
0 Abra Psychic Yellow True 25
1 Pikachu Electric Yellow True 35
2 Ekans Poison Purple True 35
3 Dratini Dragon Blue True 41
4 Ditto Normal Pink False 48

Reading other formats

The two other common formats you will probably find in practice are Excel-like files and JSON files. Pandas has the functions read_excel and read_json to aid you when working with those files. You can read the official documentation here:

Files are not everything, but it's a great start

While databases are probably the most common form of data storage you will deal with, knowing how to read data from files is a fundamental skill when working with Pandas. Lots of online repositories and tutorials offer huge libraries of data in several different file formats.

With the material covered in this article, and with access to Pandas' excellent documentation you will have no problem working with files.

Now that we can get the data into dataframes, the next step is learning how to clean it. In the next article, we will learn some techniques for preparing data for processing.

Thank you for reading!

What to do next

Author image
Budapest, Hungary
Hey there, I'm Juan. A programmer currently living in Budapest. I believe in well-engineered solutions, clean code and sharing knowledge. Thanks for reading, I hope you find my articles useful!