What is Pandas?
Pandas is a free software (software libre) data analysis library for the Python programming language. The library provides analysts and programmers data structures optimized for working with large data sets, and methods for examining and manipulating that data. It uses another free software library, NumPy, for underlying data structures, and Pyplot to generate plots – graphs, histograms, etc.
Using Pandas
The pandas library is part of the standard Python distribution, so using it is as simple as importing it into your project. Likewise, NumPy and pyplot can easily be imported.
import numpy import pandas import matplotlib.pyplot
Series
A pandas Series is a one-dimensional array (actually a Numpy ndarray) with labels for each item in the Series. Series are great for data points over time. For example, here’s the annual personal savings rate for Americans from 1967 to 2017 (source: Federal Reserve Bank of St. Louis)
DATE | Rate |
1/1/1967 | 12.2 |
1/1/1968 | 11.2 |
1/1/1969 | 10.7 |
1/1/1970 | 12.6 |
1/1/1971 | 13.3 |
1/1/1972 | 12.1 |
1/1/1973 | 13.1 |
1/1/1974 | 12.9 |
1/1/1975 | 13 |
1/1/1976 | 11.1 |
1/1/1977 | 10.2 |
1/1/1978 | 10.2 |
1/1/1979 | 9.8 |
1/1/1980 | 10.6 |
1/1/1981 | 11.2 |
1/1/1982 | 11.5 |
1/1/1983 | 9.5 |
1/1/1984 | 10.7 |
1/1/1985 | 8.6 |
1/1/1986 | 8.2 |
1/1/1987 | 7.3 |
1/1/1988 | 7.8 |
1/1/1989 | 7.8 |
1/1/1990 | 7.8 |
1/1/1991 | 8.2 |
1/1/1992 | 8.9 |
1/1/1993 | 7.3 |
1/1/1994 | 6.3 |
1/1/1995 | 6.4 |
1/1/1996 | 5.9 |
1/1/1997 | 5.7 |
1/1/1998 | 6.2 |
1/1/1999 | 4.4 |
1/1/2000 | 4.2 |
1/1/2001 | 4.3 |
1/1/2002 | 5 |
1/1/2003 | 4.8 |
1/1/2004 | 4.6 |
1/1/2005 | 2.6 |
1/1/2006 | 3.3 |
1/1/2007 | 3 |
1/1/2008 | 4.9 |
1/1/2009 | 6.1 |
1/1/2010 | 5.6 |
1/1/2011 | 6 |
1/1/2012 | 7.6 |
1/1/2013 | 5 |
1/1/2014 | 5.7 |
1/1/2015 | 6.1 |
1/1/2016 | 4.9 |
1/1/2017 | 3.4 |
To represent this data in a pandas Series we call the Series constructor and assign the result to a variable.
import pandas as pd savings_rate = pd.Series([12.2,11.2,10.7,12.6,13.3,12.1,13.1,12.9,13,11.1,10.2,10.2,9.8,10.6,11.2,11.5,9.5,10.7,8.6,8.2,7.3,7.8,7.8,7.8,8.2,8.9,7.3,6.3,6.4,5.9,5.7,6.2,4.4,4.2,4.3,5,4.8,4.6,2.6,3.3,3,4.9,6.1,5.6,6,7.6,5,5.7,6.1,4.9,3.4])
Now our series representing the data is stored in the savings_rate
variable, and each data point is indexed 0 through n-1 where n is the number of items in our series. That means to get the value of the first data point in the Series, we reference the 0th (zeroth) item in the Series. We do this with the square bracket notation…
>>> savings_rate[0] 12.2
There are 51 data points (51 years [1967, 2017]) in our Series. To find the last data point (n-1 = 50) in the series…
>>> savings_rate[50] 3.4
Now to make this Series a time series, we change the index
(from 0 to n-1) to a DatetimeIndex
. The DatetimeIndex
method will convert our textual dates (strings) into a datetime64 data type. After the conversion, when working with the Series we can perform operations in a temporal context.
savings_rate.index = pd.DatetimeIndex(['1/1/1967','1/1/1968','1/1/1969','1/1/1970','1/1/1971','1/1/1972','1/1/1973','1/1/1974','1/1/1975','1/1/1976','1/1/1977','1/1/1978','1/1/1979','1/1/1980','1/1/1981','1/1/1982','1/1/1983','1/1/1984','1/1/1985','1/1/1986','1/1/1987','1/1/1988','1/1/1989','1/1/1990','1/1/1991','1/1/1992','1/1/1993','1/1/1994','1/1/1995','1/1/1996','1/1/1997','1/1/1998','1/1/1999','1/1/2000','1/1/2001','1/1/2002','1/1/2003','1/1/2004','1/1/2005','1/1/2006','1/1/2007','1/1/2008','1/1/2009','1/1/2010','1/1/2011','1/1/2012','1/1/2013','1/1/2014','1/1/2015','1/1/2016','1/1/2017'])
Note: we could also set the index to be other values besides datetime64 values. Index values can such as names (strings) such as, ['Rate in 1967', 'Rate in 1968', ..., 'Rate in 2017']
); or other numerical values besides 0 to n-1.
We could have done this all in the constructor. Here’s an example of that which outputs our desired time series data.
import pandas as pd savings_rate = pd.Series( [12.2,11.2,10.7,12.6,13.3,12.1,13.1,12.9,13,11.1,10.2,10.2,9.8,10.6,11.2,11.5,9.5,10.7,8.6,8.2,7.3,7.8,7.8,7.8,8.2,8.9,7.3,6.3,6.4,5.9,5.7,6.2,4.4,4.2,4.3,5,4.8,4.6,2.6,3.3,3,4.9,6.1,5.6,6,7.6,5,5.7,6.1,4.9,3.4], index = pd.DatetimeIndex(['1/1/1967','1/1/1968','1/1/1969','1/1/1970','1/1/1971','1/1/1972','1/1/1973','1/1/1974','1/1/1975','1/1/1976','1/1/1977','1/1/1978','1/1/1979','1/1/1980','1/1/1981','1/1/1982','1/1/1983','1/1/1984','1/1/1985','1/1/1986','1/1/1987','1/1/1988','1/1/1989','1/1/1990','1/1/1991','1/1/1992','1/1/1993','1/1/1994','1/1/1995','1/1/1996','1/1/1997','1/1/1998','1/1/1999','1/1/2000','1/1/2001','1/1/2002','1/1/2003','1/1/2004','1/1/2005','1/1/2006','1/1/2007','1/1/2008','1/1/2009','1/1/2010','1/1/2011','1/1/2012','1/1/2013','1/1/2014','1/1/2015','1/1/2016','1/1/2017']) ) print(savings_rate)
1967-01-01 12.2 1968-01-01 11.2 1969-01-01 10.7 1970-01-01 12.6 1971-01-01 13.3 1972-01-01 12.1 1973-01-01 13.1 1974-01-01 12.9 1975-01-01 13.0 1976-01-01 11.1 1977-01-01 10.2 1978-01-01 10.2 1979-01-01 9.8 1980-01-01 10.6 1981-01-01 11.2 1982-01-01 11.5 1983-01-01 9.5 1984-01-01 10.7 1985-01-01 8.6 1986-01-01 8.2 1987-01-01 7.3 1988-01-01 7.8 1989-01-01 7.8 1990-01-01 7.8 1991-01-01 8.2 1992-01-01 8.9 1993-01-01 7.3 1994-01-01 6.3 1995-01-01 6.4 1996-01-01 5.9 1997-01-01 5.7 1998-01-01 6.2 1999-01-01 4.4 2000-01-01 4.2 2001-01-01 4.3 2002-01-01 5.0 2003-01-01 4.8 2004-01-01 4.6 2005-01-01 2.6 2006-01-01 3.3 2007-01-01 3.0 2008-01-01 4.9 2009-01-01 6.1 2010-01-01 5.6 2011-01-01 6.0 2012-01-01 7.6 2013-01-01 5.0 2014-01-01 5.7 2015-01-01 6.1 2016-01-01 4.9 2017-01-01 3.4 dtype: float64
DataFrame
A pandas DataFrame is rectangular data – a matrix, like a spreadsheet or database table with rows and columns. This table of cars is an example.
Maker | Year | Model | Package | City MPG | Highway MPG | MSRP |
Ford | 2018 | Fiesta | S Sedan | 27 | 35 | 14115 |
Ford | 2018 | Fusion | S FWD | 21 | 32 | 22120 |
Ford | 2018 | Mustang | GT Fastback | 15 | 24 | 35095 |
Honda | 2018 | Fit | LX Manual | 29 | 36 | 16190 |
Honda | 2018 | Civic | LX Manual | 28 | 40 | 18840 |
Honda | 2018 | Accord | LX CVT | 30 | 38 | 23570 |
Hyundai | 2018 | Accent | SE Manual | 28 | 37 | 14995 |
Hyundai | 2018 | Elantra | SE 2.0L Manual | 26 | 36 | 16950 |
Hyundai | 2018 | Sonata | SE 2.4L | 25 | 36 | 22050 |
To create a DataFrame we provide a dictionary of iterables (in this case lists) to the DataFrame
constructor. We can also, and often do, use Series as the dictionary values.
import pandas as pd cars = pd.DataFrame({ 'Maker': ['Ford', 'Ford', 'Ford', 'Honda', 'Honda', 'Honda', 'Hyundai', 'Hyundai', 'Hyundai'], 'Year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018], 'Model': ['Fiesta', 'Fusion', 'Mustang', 'Fit', 'Civic', 'Accord', 'Accent', 'Elantra', 'Sonata'], 'Package': ['S Sedan', 'S FWD', 'GT Fastback', 'LX Manual', 'LX Manual', 'LX CVT', 'SE Manual', 'SE 2.0L Manual', 'SE 2.4L'], 'MPG_City': [27, 21, 15, 29, 28, 30, 28, 26, 25], 'MPG_HWY': [35, 32, 24, 36, 40, 38, 37, 36, 36], 'MSRP': [14115, 22120, 35095, 16190, 18840, 23570, 14995, 16950, 22050], 'Cylinders': [4, 4, 8, 4, 4, 4, 4, 4, 4], 'Bore': [1.6, 2.5, 5, 1.5, 2, 1.5, 1.6, 2, 2.4], 'Basic_Warranty_Years': [3, 3, 3, 3, 3, 3, 5, 5, 5], 'Basic_Warranty_Miles': [36000, 36000, 36000, 36000, 36000, 36000, 60000, 60000, 60000], 'Drivetrain_Warranty_Years': [5, 5, 5, 5, 5, 5, 10, 10, 10], 'Drivetrain_Warranty_Miles': [60000, 60000, 60000, 60000, 60000, 60000, 100000, 100000, 100000] })
The resulting DataFrame will automatically index the rows from 0 to n-1 where n is the the number of rows in the DataFrame. To get the first row of data, use the DataFrame’s loc
method.
>>> cars.loc[0]
Basic_Warranty_Miles 36000 Basic_Warranty_Years 3 Bore 1.6 Cylinders 4 Drivetrain_Warranty_Miles 60000 Drivetrain_Warranty_Years 5 MPG_City 27 MPG_HWY 35 MSRP 14115 Maker Ford Model Fiesta Package S Sedan Year 2018 Name: 0, dtype: object
To retrieve the values in a given column, simply provide the column name in angle brackets, and if your frame’s columns are numerically valued, provide the column number (e.g. frame[0]
).
>>> cars['MSRP'] 0 14115 1 22120 2 35095 3 16190 4 18840 5 23570 6 14995 7 16950 8 22050
You can also provide a value to use as the index. With cars, we often use year, make, and model to uniquely represent a specific car. In our data set, it’d probably be better to use those values, as well the car’s package – the optional manufacturer extras selected for each model. If the data shown above were complete, we would probably leave out the year, as it’s the same across all the cars. Actually, if this were all the data, we could probably drop that from the data.
To set the index value to something meaningful to our dataset is a bit tricky because each car is represented by a composite of Year, Maker, and Model. The simple way to assign a new index, for example to MSRP (which wouldn’t really work too well with this data – as the MSRP 1) doesn’t identify the elements in the rows, 2) two cars can have the same MSRP) would be to use the set_index
method.
cars = cars.set_index('MSRP');
To set the index to Year, Maker, and Model is a little complicated. I’m using a list comprehension and zip – which interleaves the items of multiple list items into one list – to build a list that will be used as the index.
A list comprehension is essentially a list builder notion in Python. If that’s interesting to you, I encourage you to look it up. They’re quite convenient; as you can see here. In this one line the list comprehension is creating a list where each entry in the list is a concatenation of the year make and model of each car in the DataFrame. Then we’re assigning that list to be the index for the DataFrame.
cars = pd.DataFrame(car_data, index = [str(year)+' '+make+' '+model for year,make,model in zip(car_data['Year'],car_data['Maker'],car_data['Model'])] )
Here’s what our DataFrame looks like. The \
is showing that the output is continued on the next line.
Cylinders Drivetrain_Warranty_Miles \ 2018 Ford Fiesta 4 60000 2018 Ford Fusion 4 60000 2018 Ford Mustang 8 60000 2018 Honda Fit 4 60000 2018 Honda Civic 4 60000 2018 Honda Accord 4 60000 2018 Hyundai Accent 4 100000 2018 Hyundai Elantra 4 100000 2018 Hyundai Sonata 4 100000 Drivetrain_Warranty_Years MPG_City MPG_HWY MSRP \ 2018 Ford Fiesta 5 27 35 14115 2018 Ford Fusion 5 21 32 22120 2018 Ford Mustang 5 15 24 35095 2018 Honda Fit 5 29 36 16190 2018 Honda Civic 5 28 40 18840 2018 Honda Accord 5 30 38 23570 2018 Hyundai Accent 10 28 37 14995 2018 Hyundai Elantra 10 26 36 16950 2018 Hyundai Sonata 10 25 36 22050 Maker Model Package Year 2018 Ford Fiesta Ford Fiesta S Sedan 2018 2018 Ford Fusion Ford Fusion S FWD 2018 2018 Ford Mustang Ford Mustang GT Fastback 2018 2018 Honda Fit Honda Fit LX Manual 2018 2018 Honda Civic Honda Civic LX Manual 2018 2018 Honda Accord Honda Accord LX CVT 2018 2018 Hyundai Accent Hyundai Accent SE Manual 2018 2018 Hyundai Elantra Hyundai Elantra SE 2.0L Manual 2018 2018 Hyundai Sonata Hyundai Sonata SE 2.4L 2018
And if we wanted to get the information on one specific car, we can use the loc
method with the index value.
>>> cars.loc['2018 Ford Mustang']
Basic_Warranty_Miles 36000 Basic_Warranty_Years 3 Bore 5 Cylinders 8 Drivetrain_Warranty_Miles 60000 Drivetrain_Warranty_Years 5 MPG_City 15 MPG_HWY 24 MSRP 35095 Maker Ford Model Mustang Package GT Fastback Year 2018 Name: 2018 Ford Mustang, dtype: object
DataFrame from CSV
A common format for representing datasets is CSV (Character (often comma) Separated Values). Most spreadsheet applications, like Microsoft Excel, LibreOffice Calc, and Google Sheets, can open CSV files. If your file is in one of these applications’ file formats (such as .xls and .xlsx of Excel) you can easily convert the file to CSV. Note: CSV files cannot contain formatting – such as font sizes and styles – nor calculations that other spreadsheet formats can, but the data in a CSV file can be used by many different applications.
In our case, we can take a CSV file and use pandas to create a DataFrame. To create the same DataFrame as above, we use the pandas read_csv
method. We must tell pandas where the file is as well as set some other parameters.
import pandas as pd cars = pd.read_csv('car_data.csv', index_col=None,parse_dates=False);
By default, the read_csv
method treats the first row of the csv file as a header (as it is in our example table and in the CSV used). If your CSV file doesn’t have a header row set header=None
.
Setting index_col
to None
gives us the 0 to n-1 indexing that we got when we created the DataFrame from a dictionary. If we don’t set parse_dates
to False
, our dates will be converted from 2018 to 2018-01-01 format (from strings to datetime64 data type).
We can set the index of this cars
DataFrame to the same as the first cars
DataFrame.
cars.index = [str(year)+' '+make+' '+model for year,make,model in zip(cars['Year'],cars['Maker'],cars['Model'])]
Now this cars DataFrame is identical to the one we created from the dictionary.
Materials
- The car_data CSV.