Pandas Basics I: Series and DataFrames

What is Pandas?

Pandas is a free software (software libre) data analysis library for the Python programming language. The library provides analysts and programmers data structures optimized for working with large data sets, and methods for examining and manipulating that data. It uses another free software library, NumPy, for underlying data structures, and Pyplot to generate plots – graphs, histograms, etc.

Using Pandas

The pandas library is part of the standard Python distribution, so using it is as simple as importing it into your project. Likewise, NumPy and pyplot can easily be imported.

import numpy
import pandas
import matplotlib.pyplot

Series

A pandas Series is a one-dimensional array (actually a Numpy ndarray) with labels for each item in the Series. Series are great for data points over time. For example, here’s the annual personal savings rate for Americans from 1967 to 2017 (source: Federal Reserve Bank of St. Louis)

DATE	Rate
1/1/1967	12.2
1/1/1968	11.2
1/1/1969	10.7
1/1/1970	12.6
1/1/1971	13.3
1/1/1972	12.1
1/1/1973	13.1
1/1/1974	12.9
1/1/1975	13
1/1/1976	11.1
1/1/1977	10.2
1/1/1978	10.2
1/1/1979	9.8
1/1/1980	10.6
1/1/1981	11.2
1/1/1982	11.5
1/1/1983	9.5
1/1/1984	10.7
1/1/1985	8.6
1/1/1986	8.2
1/1/1987	7.3
1/1/1988	7.8
1/1/1989	7.8
1/1/1990	7.8
1/1/1991	8.2
1/1/1992	8.9
1/1/1993	7.3
1/1/1994	6.3
1/1/1995	6.4
1/1/1996	5.9
1/1/1997	5.7
1/1/1998	6.2
1/1/1999	4.4
1/1/2000	4.2
1/1/2001	4.3
1/1/2002	5
1/1/2003	4.8
1/1/2004	4.6
1/1/2005	2.6
1/1/2006	3.3
1/1/2007	3
1/1/2008	4.9
1/1/2009	6.1
1/1/2010	5.6
1/1/2011	6
1/1/2012	7.6
1/1/2013	5
1/1/2014	5.7
1/1/2015	6.1
1/1/2016	4.9
1/1/2017	3.4

To represent this data in a pandas Series we call the Series constructor and assign the result to a variable.

import pandas as pd

savings_rate = pd.Series([12.2,11.2,10.7,12.6,13.3,12.1,13.1,12.9,13,11.1,10.2,10.2,9.8,10.6,11.2,11.5,9.5,10.7,8.6,8.2,7.3,7.8,7.8,7.8,8.2,8.9,7.3,6.3,6.4,5.9,5.7,6.2,4.4,4.2,4.3,5,4.8,4.6,2.6,3.3,3,4.9,6.1,5.6,6,7.6,5,5.7,6.1,4.9,3.4])

Now our series representing the data is stored in the savings_rate variable, and each data point is indexed 0 through n-1 where n is the number of items in our series. That means to get the value of the first data point in the Series, we reference the 0th (zeroth) item in the Series. We do this with the square bracket notation…

>>> savings_rate[0]
12.2

There are 51 data points (51 years [1967, 2017]) in our Series. To find the last data point (n-1 = 50) in the series…

>>> savings_rate[50]
3.4

Now to make this Series a time series, we change the index (from 0 to n-1) to a DatetimeIndex. The DatetimeIndex method will convert our textual dates (strings) into a datetime64 data type. After the conversion, when working with the Series we can perform operations in a temporal context.

savings_rate.index = pd.DatetimeIndex(['1/1/1967','1/1/1968','1/1/1969','1/1/1970','1/1/1971','1/1/1972','1/1/1973','1/1/1974','1/1/1975','1/1/1976','1/1/1977','1/1/1978','1/1/1979','1/1/1980','1/1/1981','1/1/1982','1/1/1983','1/1/1984','1/1/1985','1/1/1986','1/1/1987','1/1/1988','1/1/1989','1/1/1990','1/1/1991','1/1/1992','1/1/1993','1/1/1994','1/1/1995','1/1/1996','1/1/1997','1/1/1998','1/1/1999','1/1/2000','1/1/2001','1/1/2002','1/1/2003','1/1/2004','1/1/2005','1/1/2006','1/1/2007','1/1/2008','1/1/2009','1/1/2010','1/1/2011','1/1/2012','1/1/2013','1/1/2014','1/1/2015','1/1/2016','1/1/2017'])

Note: we could also set the index to be other values besides datetime64 values. Index values can such as names (strings) such as, ['Rate in 1967', 'Rate in 1968', ..., 'Rate in 2017']); or other numerical values besides 0 to n-1.

We could have done this all in the constructor. Here’s an example of that which outputs our desired time series data.

import pandas as pd
savings_rate = pd.Series(
[12.2,11.2,10.7,12.6,13.3,12.1,13.1,12.9,13,11.1,10.2,10.2,9.8,10.6,11.2,11.5,9.5,10.7,8.6,8.2,7.3,7.8,7.8,7.8,8.2,8.9,7.3,6.3,6.4,5.9,5.7,6.2,4.4,4.2,4.3,5,4.8,4.6,2.6,3.3,3,4.9,6.1,5.6,6,7.6,5,5.7,6.1,4.9,3.4],
index = pd.DatetimeIndex(['1/1/1967','1/1/1968','1/1/1969','1/1/1970','1/1/1971','1/1/1972','1/1/1973','1/1/1974','1/1/1975','1/1/1976','1/1/1977','1/1/1978','1/1/1979','1/1/1980','1/1/1981','1/1/1982','1/1/1983','1/1/1984','1/1/1985','1/1/1986','1/1/1987','1/1/1988','1/1/1989','1/1/1990','1/1/1991','1/1/1992','1/1/1993','1/1/1994','1/1/1995','1/1/1996','1/1/1997','1/1/1998','1/1/1999','1/1/2000','1/1/2001','1/1/2002','1/1/2003','1/1/2004','1/1/2005','1/1/2006','1/1/2007','1/1/2008','1/1/2009','1/1/2010','1/1/2011','1/1/2012','1/1/2013','1/1/2014','1/1/2015','1/1/2016','1/1/2017'])
)

print(savings_rate)

1967-01-01    12.2
1968-01-01    11.2
1969-01-01    10.7
1970-01-01    12.6
1971-01-01    13.3
1972-01-01    12.1
1973-01-01    13.1
1974-01-01    12.9
1975-01-01    13.0
1976-01-01    11.1
1977-01-01    10.2
1978-01-01    10.2
1979-01-01     9.8
1980-01-01    10.6
1981-01-01    11.2
1982-01-01    11.5
1983-01-01     9.5
1984-01-01    10.7
1985-01-01     8.6
1986-01-01     8.2
1987-01-01     7.3
1988-01-01     7.8
1989-01-01     7.8
1990-01-01     7.8
1991-01-01     8.2
1992-01-01     8.9
1993-01-01     7.3
1994-01-01     6.3
1995-01-01     6.4
1996-01-01     5.9
1997-01-01     5.7
1998-01-01     6.2
1999-01-01     4.4
2000-01-01     4.2
2001-01-01     4.3
2002-01-01     5.0
2003-01-01     4.8
2004-01-01     4.6
2005-01-01     2.6
2006-01-01     3.3
2007-01-01     3.0
2008-01-01     4.9
2009-01-01     6.1
2010-01-01     5.6
2011-01-01     6.0
2012-01-01     7.6
2013-01-01     5.0
2014-01-01     5.7
2015-01-01     6.1
2016-01-01     4.9
2017-01-01     3.4
dtype: float64

DataFrame

A pandas DataFrame is rectangular data – a matrix, like a spreadsheet or database table with rows and columns. This table of cars is an example.

Maker	Year	Model	Package	City MPG	Highway MPG	MSRP
Ford	2018	Fiesta	S Sedan	27	35	14115
Ford	2018	Fusion	S FWD	21	32	22120
Ford	2018	Mustang	GT Fastback	15	24	35095
Honda	2018	Fit	LX Manual	29	36	16190
Honda	2018	Civic	LX Manual	28	40	18840
Honda	2018	Accord	LX CVT	30	38	23570
Hyundai	2018	Accent	SE Manual	28	37	14995
Hyundai	2018	Elantra	SE 2.0L Manual	26	36	16950
Hyundai	2018	Sonata	SE 2.4L	25	36	22050

To create a DataFrame we provide a dictionary of iterables (in this case lists) to the DataFrame constructor. We can also, and often do, use Series as the dictionary values.

import pandas as pd

cars = pd.DataFrame({
 'Maker': ['Ford', 'Ford', 'Ford', 'Honda', 'Honda', 'Honda', 'Hyundai', 'Hyundai', 'Hyundai'],
 'Year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018],
 'Model': ['Fiesta', 'Fusion', 'Mustang', 'Fit', 'Civic', 'Accord', 'Accent', 'Elantra', 'Sonata'],
 'Package': ['S Sedan', 'S FWD', 'GT Fastback', 'LX Manual', 'LX Manual', 'LX CVT', 'SE Manual', 'SE 2.0L Manual', 'SE 2.4L'],
 'MPG_City': [27, 21, 15, 29, 28, 30, 28, 26, 25],
 'MPG_HWY': [35, 32, 24, 36, 40, 38, 37, 36, 36],
 'MSRP': [14115, 22120, 35095, 16190, 18840, 23570, 14995, 16950, 22050],
 'Cylinders': [4, 4, 8, 4, 4, 4, 4, 4, 4],
 'Bore': [1.6, 2.5, 5, 1.5, 2, 1.5, 1.6, 2, 2.4],
 'Basic_Warranty_Years': [3, 3, 3, 3, 3, 3, 5, 5, 5],
 'Basic_Warranty_Miles': [36000, 36000, 36000, 36000, 36000, 36000, 60000, 60000, 60000],
 'Drivetrain_Warranty_Years': [5, 5, 5, 5, 5, 5, 10, 10, 10],
 'Drivetrain_Warranty_Miles': [60000, 60000, 60000, 60000, 60000, 60000, 100000, 100000, 100000]
})

The resulting DataFrame will automatically index the rows from 0 to n-1 where n is the the number of rows in the DataFrame. To get the first row of data, use the DataFrame’s loc method.

>>> cars.loc[0]

Basic_Warranty_Miles           36000
Basic_Warranty_Years               3
Bore                             1.6
Cylinders                          4
Drivetrain_Warranty_Miles      60000
Drivetrain_Warranty_Years          5
MPG_City                          27
MPG_HWY                           35
MSRP                           14115
Maker                           Ford
Model                         Fiesta
Package                      S Sedan
Year                            2018
Name: 0, dtype: object

To retrieve the values in a given column, simply provide the column name in angle brackets, and if your frame’s columns are numerically valued, provide the column number (e.g. frame[0]).

>>> cars['MSRP']

0    14115
1    22120
2    35095
3    16190
4    18840
5    23570
6    14995
7    16950
8    22050

You can also provide a value to use as the index. With cars, we often use year, make, and model to uniquely represent a specific car. In our data set, it’d probably be better to use those values, as well the car’s package – the optional manufacturer extras selected for each model. If the data shown above were complete, we would probably leave out the year, as it’s the same across all the cars. Actually, if this were all the data, we could probably drop that from the data.

To set the index value to something meaningful to our dataset is a bit tricky because each car is represented by a composite of Year, Maker, and Model. The simple way to assign a new index, for example to MSRP (which wouldn’t really work too well with this data – as the MSRP 1) doesn’t identify the elements in the rows, 2) two cars can have the same MSRP) would be to use the set_index method.

cars = cars.set_index('MSRP');

To set the index to Year, Maker, and Model is a little complicated. I’m using a list comprehension and zip – which interleaves the items of multiple list items into one list – to build a list that will be used as the index.

A list comprehension is essentially a list builder notion in Python. If that’s interesting to you, I encourage you to look it up. They’re quite convenient; as you can see here. In this one line the list comprehension is creating a list where each entry in the list is a concatenation of the year make and model of each car in the DataFrame. Then we’re assigning that list to be the index for the DataFrame.

cars = pd.DataFrame(car_data, index = [str(year)+' '+make+' '+model for year,make,model in zip(car_data['Year'],car_data['Maker'],car_data['Model'])] )

Here’s what our DataFrame looks like. The \ is showing that the output is continued on the next line.

                      Cylinders  Drivetrain_Warranty_Miles  \
2018 Ford Fiesta              4                      60000   
2018 Ford Fusion              4                      60000   
2018 Ford Mustang             8                      60000   
2018 Honda Fit                4                      60000   
2018 Honda Civic              4                      60000   
2018 Honda Accord             4                      60000   
2018 Hyundai Accent           4                     100000   
2018 Hyundai Elantra          4                     100000   
2018 Hyundai Sonata           4                     100000   

                      Drivetrain_Warranty_Years  MPG_City  MPG_HWY   MSRP  \
2018 Ford Fiesta                              5        27       35  14115   
2018 Ford Fusion                              5        21       32  22120   
2018 Ford Mustang                             5        15       24  35095   
2018 Honda Fit                                5        29       36  16190   
2018 Honda Civic                              5        28       40  18840   
2018 Honda Accord                             5        30       38  23570   
2018 Hyundai Accent                          10        28       37  14995   
2018 Hyundai Elantra                         10        26       36  16950   
2018 Hyundai Sonata                          10        25       36  22050   

                        Maker    Model         Package  Year  
2018 Ford Fiesta         Ford   Fiesta         S Sedan  2018  
2018 Ford Fusion         Ford   Fusion           S FWD  2018  
2018 Ford Mustang        Ford  Mustang     GT Fastback  2018  
2018 Honda Fit          Honda      Fit       LX Manual  2018  
2018 Honda Civic        Honda    Civic       LX Manual  2018  
2018 Honda Accord       Honda   Accord          LX CVT  2018  
2018 Hyundai Accent   Hyundai   Accent       SE Manual  2018  
2018 Hyundai Elantra  Hyundai  Elantra  SE 2.0L Manual  2018  
2018 Hyundai Sonata   Hyundai   Sonata         SE 2.4L  2018

And if we wanted to get the information on one specific car, we can use the loc method with the index value.

>>> cars.loc['2018 Ford Mustang']

Basic_Warranty_Miles               36000
Basic_Warranty_Years                   3
Bore                                   5
Cylinders                              8
Drivetrain_Warranty_Miles          60000
Drivetrain_Warranty_Years              5
MPG_City                              15
MPG_HWY                               24
MSRP                               35095
Maker                               Ford
Model                            Mustang
Package                      GT Fastback
Year                                2018
Name: 2018 Ford Mustang, dtype: object

DataFrame from CSV

A common format for representing datasets is CSV (Character (often comma) Separated Values). Most spreadsheet applications, like Microsoft Excel, LibreOffice Calc, and Google Sheets, can open CSV files. If your file is in one of these applications’ file formats (such as .xls and .xlsx of Excel) you can easily convert the file to CSV. Note: CSV files cannot contain formatting – such as font sizes and styles – nor calculations that other spreadsheet formats can, but the data in a CSV file can be used by many different applications.

In our case, we can take a CSV file and use pandas to create a DataFrame. To create the same DataFrame as above, we use the pandas read_csv method. We must tell pandas where the file is as well as set some other parameters.

import pandas as pd

cars = pd.read_csv('car_data.csv', index_col=None,parse_dates=False);

By default, the read_csv method treats the first row of the csv file as a header (as it is in our example table and in the CSV used). If your CSV file doesn’t have a header row set header=None.

Setting index_col to None gives us the 0 to n-1 indexing that we got when we created the DataFrame from a dictionary. If we don’t set parse_dates to False, our dates will be converted from 2018 to 2018-01-01 format (from strings to datetime64 data type).

We can set the index of this cars DataFrame to the same as the first cars DataFrame.

cars.index = [str(year)+' '+make+' '+model for year,make,model in zip(cars['Year'],cars['Maker'],cars['Model'])]

Now this cars DataFrame is identical to the one we created from the dictionary.

Materials

The car_data CSV.

Jason Favrod

Adventures in Information