Pandas Basics I: Series and DataFrames

What is Pandas?

Pandas is a free software (software libre) data analysis library for the Python programming language. The library provides analysts and programmers data structures optimized for working with large data sets, and methods for examining and manipulating that data. It uses another free software library, NumPy, for underlying data structures, and Pyplot to generate plots – graphs, histograms, etc.

Using Pandas

The pandas library is part of the standard Python distribution, so using it is as simple as importing it into your project. Likewise, NumPy and pyplot can easily be imported.

import numpy
import pandas
import matplotlib.pyplot

Series

A pandas Series is a one-dimensional array (actually a Numpy ndarray) with labels for each item in the Series. Series are great for data points over time. For example, here’s the annual personal savings rate for Americans from 1967 to 2017 (source: Federal Reserve Bank of St. Louis)

DATE Rate
1/1/1967 12.2
1/1/1968 11.2
1/1/1969 10.7
1/1/1970 12.6
1/1/1971 13.3
1/1/1972 12.1
1/1/1973 13.1
1/1/1974 12.9
1/1/1975 13
1/1/1976 11.1
1/1/1977 10.2
1/1/1978 10.2
1/1/1979 9.8
1/1/1980 10.6
1/1/1981 11.2
1/1/1982 11.5
1/1/1983 9.5
1/1/1984 10.7
1/1/1985 8.6
1/1/1986 8.2
1/1/1987 7.3
1/1/1988 7.8
1/1/1989 7.8
1/1/1990 7.8
1/1/1991 8.2
1/1/1992 8.9
1/1/1993 7.3
1/1/1994 6.3
1/1/1995 6.4
1/1/1996 5.9
1/1/1997 5.7
1/1/1998 6.2
1/1/1999 4.4
1/1/2000 4.2
1/1/2001 4.3
1/1/2002 5
1/1/2003 4.8
1/1/2004 4.6
1/1/2005 2.6
1/1/2006 3.3
1/1/2007 3
1/1/2008 4.9
1/1/2009 6.1
1/1/2010 5.6
1/1/2011 6
1/1/2012 7.6
1/1/2013 5
1/1/2014 5.7
1/1/2015 6.1
1/1/2016 4.9
1/1/2017 3.4

To represent this data in a pandas Series we call the Series constructor and assign the result to a variable.

import pandas as pd

savings_rate = pd.Series([12.2,11.2,10.7,12.6,13.3,12.1,13.1,12.9,13,11.1,10.2,10.2,9.8,10.6,11.2,11.5,9.5,10.7,8.6,8.2,7.3,7.8,7.8,7.8,8.2,8.9,7.3,6.3,6.4,5.9,5.7,6.2,4.4,4.2,4.3,5,4.8,4.6,2.6,3.3,3,4.9,6.1,5.6,6,7.6,5,5.7,6.1,4.9,3.4])

Now our series representing the data is stored in the savings_rate variable, and each data point is indexed 0 through n-1 where n is the number of items in our series. That means to get the value of the first data point in the Series, we reference the 0th (zeroth) item in the Series. We do this with the square bracket notation…

>>> savings_rate[0]
12.2

There are 51 data points (51 years [1967, 2017]) in our Series. To find the last data point (n-1 = 50) in the series…

>>> savings_rate[50]
3.4

Now to make this Series a time series, we change the index (from 0 to n-1) to a DatetimeIndex. The DatetimeIndex method will convert our textual dates (strings) into a datetime64 data type. After the conversion, when working with the Series we can perform operations in a temporal context.

savings_rate.index = pd.DatetimeIndex(['1/1/1967','1/1/1968','1/1/1969','1/1/1970','1/1/1971','1/1/1972','1/1/1973','1/1/1974','1/1/1975','1/1/1976','1/1/1977','1/1/1978','1/1/1979','1/1/1980','1/1/1981','1/1/1982','1/1/1983','1/1/1984','1/1/1985','1/1/1986','1/1/1987','1/1/1988','1/1/1989','1/1/1990','1/1/1991','1/1/1992','1/1/1993','1/1/1994','1/1/1995','1/1/1996','1/1/1997','1/1/1998','1/1/1999','1/1/2000','1/1/2001','1/1/2002','1/1/2003','1/1/2004','1/1/2005','1/1/2006','1/1/2007','1/1/2008','1/1/2009','1/1/2010','1/1/2011','1/1/2012','1/1/2013','1/1/2014','1/1/2015','1/1/2016','1/1/2017'])

Note: we could also set the index to be other values besides datetime64 values. Index values can such as names (strings) such as, ['Rate in 1967', 'Rate in 1968', ..., 'Rate in 2017']); or other numerical values besides 0 to n-1.

We could have done this all in the constructor. Here’s an example of that which outputs our desired time series data.

import pandas as pd
savings_rate = pd.Series(
[12.2,11.2,10.7,12.6,13.3,12.1,13.1,12.9,13,11.1,10.2,10.2,9.8,10.6,11.2,11.5,9.5,10.7,8.6,8.2,7.3,7.8,7.8,7.8,8.2,8.9,7.3,6.3,6.4,5.9,5.7,6.2,4.4,4.2,4.3,5,4.8,4.6,2.6,3.3,3,4.9,6.1,5.6,6,7.6,5,5.7,6.1,4.9,3.4],
index = pd.DatetimeIndex(['1/1/1967','1/1/1968','1/1/1969','1/1/1970','1/1/1971','1/1/1972','1/1/1973','1/1/1974','1/1/1975','1/1/1976','1/1/1977','1/1/1978','1/1/1979','1/1/1980','1/1/1981','1/1/1982','1/1/1983','1/1/1984','1/1/1985','1/1/1986','1/1/1987','1/1/1988','1/1/1989','1/1/1990','1/1/1991','1/1/1992','1/1/1993','1/1/1994','1/1/1995','1/1/1996','1/1/1997','1/1/1998','1/1/1999','1/1/2000','1/1/2001','1/1/2002','1/1/2003','1/1/2004','1/1/2005','1/1/2006','1/1/2007','1/1/2008','1/1/2009','1/1/2010','1/1/2011','1/1/2012','1/1/2013','1/1/2014','1/1/2015','1/1/2016','1/1/2017'])
)

print(savings_rate)
1967-01-01    12.2
1968-01-01    11.2
1969-01-01    10.7
1970-01-01    12.6
1971-01-01    13.3
1972-01-01    12.1
1973-01-01    13.1
1974-01-01    12.9
1975-01-01    13.0
1976-01-01    11.1
1977-01-01    10.2
1978-01-01    10.2
1979-01-01     9.8
1980-01-01    10.6
1981-01-01    11.2
1982-01-01    11.5
1983-01-01     9.5
1984-01-01    10.7
1985-01-01     8.6
1986-01-01     8.2
1987-01-01     7.3
1988-01-01     7.8
1989-01-01     7.8
1990-01-01     7.8
1991-01-01     8.2
1992-01-01     8.9
1993-01-01     7.3
1994-01-01     6.3
1995-01-01     6.4
1996-01-01     5.9
1997-01-01     5.7
1998-01-01     6.2
1999-01-01     4.4
2000-01-01     4.2
2001-01-01     4.3
2002-01-01     5.0
2003-01-01     4.8
2004-01-01     4.6
2005-01-01     2.6
2006-01-01     3.3
2007-01-01     3.0
2008-01-01     4.9
2009-01-01     6.1
2010-01-01     5.6
2011-01-01     6.0
2012-01-01     7.6
2013-01-01     5.0
2014-01-01     5.7
2015-01-01     6.1
2016-01-01     4.9
2017-01-01     3.4
dtype: float64

DataFrame

A pandas DataFrame is rectangular data – a matrix, like a spreadsheet or database table with rows and columns. This table of cars is an example.

 

Maker Year Model Package City MPG Highway MPG MSRP
Ford 2018 Fiesta S Sedan 27 35 14115
Ford 2018 Fusion S FWD 21 32 22120
Ford 2018 Mustang GT Fastback 15 24 35095
Honda 2018 Fit LX Manual 29 36 16190
Honda 2018 Civic LX Manual 28 40 18840
Honda 2018 Accord LX CVT 30 38 23570
Hyundai 2018 Accent SE Manual 28 37 14995
Hyundai 2018 Elantra SE 2.0L Manual 26 36 16950
Hyundai 2018 Sonata SE 2.4L 25 36 22050

To create a DataFrame we provide a dictionary of iterables (in this case lists) to the DataFrame constructor. We can also, and often do, use Series as the dictionary values.

import pandas as pd

cars = pd.DataFrame({
 'Maker': ['Ford', 'Ford', 'Ford', 'Honda', 'Honda', 'Honda', 'Hyundai', 'Hyundai', 'Hyundai'],
 'Year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018],
 'Model': ['Fiesta', 'Fusion', 'Mustang', 'Fit', 'Civic', 'Accord', 'Accent', 'Elantra', 'Sonata'],
 'Package': ['S Sedan', 'S FWD', 'GT Fastback', 'LX Manual', 'LX Manual', 'LX CVT', 'SE Manual', 'SE 2.0L Manual', 'SE 2.4L'],
 'MPG_City': [27, 21, 15, 29, 28, 30, 28, 26, 25],
 'MPG_HWY': [35, 32, 24, 36, 40, 38, 37, 36, 36],
 'MSRP': [14115, 22120, 35095, 16190, 18840, 23570, 14995, 16950, 22050],
 'Cylinders': [4, 4, 8, 4, 4, 4, 4, 4, 4],
 'Bore': [1.6, 2.5, 5, 1.5, 2, 1.5, 1.6, 2, 2.4],
 'Basic_Warranty_Years': [3, 3, 3, 3, 3, 3, 5, 5, 5],
 'Basic_Warranty_Miles': [36000, 36000, 36000, 36000, 36000, 36000, 60000, 60000, 60000],
 'Drivetrain_Warranty_Years': [5, 5, 5, 5, 5, 5, 10, 10, 10],
 'Drivetrain_Warranty_Miles': [60000, 60000, 60000, 60000, 60000, 60000, 100000, 100000, 100000]
})

The resulting DataFrame will automatically index the rows from 0 to n-1 where n is the the number of rows in the DataFrame. To get the first row of data, use the DataFrame’s loc method.

>>> cars.loc[0]
Basic_Warranty_Miles           36000
Basic_Warranty_Years               3
Bore                             1.6
Cylinders                          4
Drivetrain_Warranty_Miles      60000
Drivetrain_Warranty_Years          5
MPG_City                          27
MPG_HWY                           35
MSRP                           14115
Maker                           Ford
Model                         Fiesta
Package                      S Sedan
Year                            2018
Name: 0, dtype: object

To retrieve the values in a given column, simply provide the column name in angle brackets, and if your frame’s columns are numerically valued, provide the column number (e.g. frame[0]).

>>> cars['MSRP']

0    14115
1    22120
2    35095
3    16190
4    18840
5    23570
6    14995
7    16950
8    22050

You can also provide a value to use as the index. With cars, we often use year, make, and model to uniquely represent a specific car. In our data set, it’d probably be better to use those values, as well the car’s package – the optional manufacturer extras selected for each model. If the data shown above were complete, we would probably leave out the year, as it’s the same across all the cars. Actually, if this were all the data, we could probably drop that from the data.

Here’s how to set the index for the data set. This is a little complicated. I’m using a list comprehension and zip – which interleaves the items of multiple list items into one list – to build a list that will be used as the index.

A list comprehension is essentially a list builder notion in Python. If that’s interesting to you, I encourage you to look it up. They’re quite convenient; as you can see here. In this one line the list comprehension is creating a list where each entry in the list is a concatenation of the year make and model of each car in the DataFrame. Then we’re assigning that list to be the index for the DataFrame.

cars = pd.DataFrame(car_data, index = [str(year)+' '+make+' '+model for year,make,model in zip(car_data['Year'],car_data['Maker'],car_data['Model'])] )

Here’s what our DataFrame looks like. The \ is showing that the output is continued on the next line.

                      Cylinders  Drivetrain_Warranty_Miles  \
2018 Ford Fiesta              4                      60000   
2018 Ford Fusion              4                      60000   
2018 Ford Mustang             8                      60000   
2018 Honda Fit                4                      60000   
2018 Honda Civic              4                      60000   
2018 Honda Accord             4                      60000   
2018 Hyundai Accent           4                     100000   
2018 Hyundai Elantra          4                     100000   
2018 Hyundai Sonata           4                     100000   

                      Drivetrain_Warranty_Years  MPG_City  MPG_HWY   MSRP  \
2018 Ford Fiesta                              5        27       35  14115   
2018 Ford Fusion                              5        21       32  22120   
2018 Ford Mustang                             5        15       24  35095   
2018 Honda Fit                                5        29       36  16190   
2018 Honda Civic                              5        28       40  18840   
2018 Honda Accord                             5        30       38  23570   
2018 Hyundai Accent                          10        28       37  14995   
2018 Hyundai Elantra                         10        26       36  16950   
2018 Hyundai Sonata                          10        25       36  22050   

                        Maker    Model         Package  Year  
2018 Ford Fiesta         Ford   Fiesta         S Sedan  2018  
2018 Ford Fusion         Ford   Fusion           S FWD  2018  
2018 Ford Mustang        Ford  Mustang     GT Fastback  2018  
2018 Honda Fit          Honda      Fit       LX Manual  2018  
2018 Honda Civic        Honda    Civic       LX Manual  2018  
2018 Honda Accord       Honda   Accord          LX CVT  2018  
2018 Hyundai Accent   Hyundai   Accent       SE Manual  2018  
2018 Hyundai Elantra  Hyundai  Elantra  SE 2.0L Manual  2018  
2018 Hyundai Sonata   Hyundai   Sonata         SE 2.4L  2018  

And if we wanted to get the information on one specific car, we can use the loc method with the index value.

>>> cars.loc['2018 Ford Mustang']
Basic_Warranty_Miles               36000
Basic_Warranty_Years                   3
Bore                                   5
Cylinders                              8
Drivetrain_Warranty_Miles          60000
Drivetrain_Warranty_Years              5
MPG_City                              15
MPG_HWY                               24
MSRP                               35095
Maker                               Ford
Model                            Mustang
Package                      GT Fastback
Year                                2018
Name: 2018 Ford Mustang, dtype: object

DataFrame from CSV

A common format for representing datasets is CSV (Character (often comma) Separated Values). Most spreadsheet applications, like Microsoft Excel, LibreOffice Calc, and Google Sheets, can open CSV files. If your file is in one of these applications’ file formats (such as .xls and .xlsx of Excel) you can easily convert the file to CSV.  Note: CSV files cannot contain formatting – such as font sizes and styles – nor calculations that other spreadsheet formats can, but the data in a CSV file can be used by many different applications.

In our case, we can take a CSV file and use pandas to create a DataFrame. To create the same DataFrame as above, we use the pandas read_csv method. We must tell pandas where the file is as well as set some other parameters.

import pandas as pd

cars = pd.read_csv('car_data.csv', index_col=None,parse_dates=False);

By default, the read_csv method treats the first row of the csv file as a header (as it is in our example table and in the CSV used). If your CSV file doesn’t have a header row set header=None.

Setting index_col to None gives us the 0 to n-1 indexing that we got when we created the DataFrame from a dictionary. If we don’t set parse_dates to False, our dates will be converted from 2018 to 2018-01-01 format (from strings to datetime64 data type).

We can set the index of this cars DataFrame to the same as the first cars DataFrame.

cars.index = [str(year)+' '+make+' '+model for year,make,model in zip(cars['Year'],cars['Maker'],cars['Model'])]

Now this cars DataFrame is identical to the one we created from the dictionary.

Materials