Introduction to pandas: data analytics in Python

13 Aug. 2017, Python, 30865 views,

Введение в pandas: анализ данных на Python

pandas probably is the most popular library for data analysis in Python programming language. This library is a high-level abstraction over low-level NumPy which is written in pure C. I use pandas on a daily basis and really enjoy it because of its eloquent syntax and rich functionality. Hope this note will help you dive into data analysis realm with Python and pandas.

DataFrame and Series

In order to master pandas you have to start from scratch with two main data structures: DataFrame and Series. If you don't understand them well you won't understand pandas.

Series

Series is an object which is similar to Python built-in list data structure but differs from it because it has associated label with each element or so-called index. This distinctive feature makes it look like associated array or dictionary (hashmap representation).

>>> import pandas as pd
>>> my_series = pd.Series([5, 6, 7, 8, 9, 10])
>>> my_series
0     5
1     6
2     7
3     8
4     9
5    10
dtype: int64
>>>

Take a look at the output above and you will see that index is leftward and values are to the right. If index is not provided explicitly, then pandas creates RangeIndex starting from 0 to N-1, where N is a total number of elements. Moreover, each Series object has data type (dtype), in our case data type is int64.

Series has attributes to extract its values and index:

>>> my_series.index
RangeIndex(start=0, stop=6, step=1)
>>> my_series.values
array([ 5,  6,  7,  8,  9, 10], dtype=int64)

You can retrieve elements by their index number:

>>> my_series[4]
9

You can provide index (labels) explicitly:

>>> my_series2 = pd.Series([5, 6, 7, 8, 9, 10], index=['a', 'b', 'c', 'd', 'e', 'f'])
>>> my_series2['f']
10

It is easy to retrieve several elements by their indexes or make group assignment:

>>> my_series2[['a', 'b', 'f']]
a     5
b     6
f    10
dtype: int64
>>> my_series2[['a', 'b', 'f']] = 0
>>> my_series2
a    0
b    0
c    7
d    8
e    9
f    0
dtype: int64

Filtering and math operations are easy as well:

>>> my_series2[my_series2 > 0]
c    7
d    8
e    9
dtype: int64
>>> my_series2[my_series2 > 0] * 2
c    14
d    16
e    18
dtype: int64

Because Series is very similar to dictionary, where key is an index and value is an element, we can do this:

>>> my_series3 = pd.Series({'a': 5, 'b': 6, 'c': 7, 'd': 8})
>>> my_series3
a    5
b    6
c    7
d    8
dtype: int64
>>> 'd' in my_series3
True

Also Series object and its index have name attributes, so you can label them:

>>> my_series3.name = 'numbers'
>>> my_series3.index.name = 'letters'
>>> my_series3
letters
a    5
b    6
c    7
d    8
Name: numbers, dtype: int64

Index can be changed "on fly" by assigning list to index attribute:

>>> my_series3.index = ['A', 'B', 'C', 'D']
>>> my_series3
A    5
B    6
C    7
D    8
Name: numbers, dtype: int64

But bear in mind that the length of the list should be equal to the number of elements inside Series and also labels have to be unique.

DataFrame

Simply said, DataFrame is a table. It has rows and columns. Each column in a DataFrame is a Series object, rows consist of elements inside Series.

DataFrame can be constructed using built-in Python dicts:

>>> df = pd.DataFrame({
...     'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
...     'population': [17.04, 143.5, 9.5, 45.5],
...     'square': [2724902, 17125191, 207600, 603628]
... })
>;>> df
   country  population    square
0  Kazakhstan       17.04   2724902
1      Russia      143.50  17125191
2     Belarus        9.50    207600
3     Ukraine       45.50    603628

In order to make sure that each column is a Series object let's do this:

>>> df['country']
0    Kazakhstan
1        Russia
2       Belarus
3       Ukraine
Name: country, dtype: object
>>> type(df['country'])
<class 'pandas.core.series.Series'>

DataFrame object has 2 indexes: column index and row index. If you do not provide row index explicitly, pandas will create RangeIndex from 0 to N-1, where N is a number of rows inside DataFrame.

>>> df.columns
Index([u'country', u'population', u'square'], dtype='object')
>>> df.index
RangeIndex(start=0, stop=4, step=1)

Our table/DataFrame has 4 elements from 0 to 3.

Accessing elements by Index:

There are numerous ways to provide row index explicitly, for example you can provide index when creating a DataFrame or do it “on the fly” during runtime:

>>> df = pd.DataFrame({
...     'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
...     'population': [17.04, 143.5, 9.5, 45.5],
...     'square': [2724902, 17125191, 207600, 603628]
... }, index=['KZ', 'RU', 'BY', 'UA'])
>>> df
       country  population    square
KZ  Kazakhstan       17.04   2724902
RU      Russia      143.50  17125191
BY     Belarus        9.50    207600
UA     Ukraine       45.50    603628
>>> df.index = ['KZ', 'RU', 'BY', 'UA']
>>> df.index.name = 'Country Code'
>>> df
                 country  population    square
Country Code                                  
KZ            Kazakhstan       17.04   2724902
RU                Russia      143.50  17125191
BY               Belarus        9.50    207600
UA               Ukraine       45.50    603628

Besides creating an index I have also named it as Country Code. Series object will also have the same index as DataFrame has:

>>> df['country']
Country Code
KZ    Kazakhstan
RU        Russia
BY       Belarus
UA       Ukraine
Name: country, dtype: object

Row access using index can be performed in several ways:

using .loc and providing index label
using .iloc and providing index number

>>> df.loc['KZ']
country       Kazakhstan
population         17.04
square           2724902
Name: KZ, dtype: object
>>> df.iloc[0]
country       Kazakhstan
population         17.04
square           2724902
Name: KZ, dtype: object

Selection of particular rows and columns can be performed this way:

>>> df.loc[['KZ', 'RU'], 'population']
Country Code
KZ     17.04
RU    143.50
Name: population, dtype: float64

.loc takes 2 arguments: index list and column list, slicing operation is supported as well:

>>> df.loc['KZ':'BY', :]
                 country  population    square
Country Code                                  
KZ            Kazakhstan       17.04   2724902
RU                Russia      143.50  17125191
BY               Belarus        9.50    207600

Filtering is performed using so-called Boolean arrays:

>>> df[df.population > 10][['country', 'square']]
                 country    square
Country Code                      
KZ            Kazakhstan   2724902
RU                Russia  17125191
UA               Ukraine    603628

By the way, columns can be accessed using attribute or Python dict notation, for example df.population and df[‘population’] are the same operations.

If you want to reset index, no problem:

>>> df.reset_index()
  Country Code     country  population    square
0           KZ  Kazakhstan       17.04   2724902
1           RU      Russia      143.50  17125191
2           BY     Belarus        9.50    207600
3           UA     Ukraine       45.50    603628

By default when you manipulate a DataFrame, pandas will return a new instance (the old one will not be affected).

Let’s add new column with population density:

>>> df['density'] = df['population'] / df['square'] * 1000000
>>> df
                 country  population    square    density
Country Code                                             
KZ            Kazakhstan       17.04   2724902   6.253436
RU                Russia      143.50  17125191   8.379469
BY               Belarus        9.50    207600  45.761079
UA               Ukraine       45.50    603628  75.377550

Don’t like new column? OK, let’s delete it:

>>> df.drop(['density'], axis='columns')
                 country  population    square
Country Code                                  
KZ            Kazakhstan       17.04   2724902
RU                Russia      143.50  17125191
BY               Belarus        9.50    207600
UA               Ukraine       45.50    603628

If you are too lazy to type drop... you can use del df[‘density’]

Rename it!

>>> df = df.rename(columns={'Country Code': 'country_code'})
>>> df
  country_code     country  population    square
0           KZ  Kazakhstan       17.04   2724902
1           RU      Russia      143.50  17125191
2           BY     Belarus        9.50    207600
3           UA     Ukraine       45.50    603628

In our renaming example make sure that you dropped the index otherwise invoking rename method will have no effect.

Reading and Writing

pandas supports many popular file formats including CSV, XML, HTML, Excel, SQL, JSON many more (check out official docs).

But usually (in my practice) I use CSV files. For example, if you want to save our previous DataFrame run this:

>>> df.to_csv('filename.csv')

to_csvmethod takes many arguments (for example, separator character), if you want to know more make sure to check official documentation.

We have saved our DataFrame, but what about reading data? No problem:

>>> df = pd.read_csv('filename.csv', sep=',')

Named argument sep points to a separator character in CSV file called filename.csv. There are many different ways to construct DataFrame from external sources, for example using read_sql method pandas can perform SQL query and store results inside a new DataFrame instance.

Aggregating and Grouping in Pandas

Grouping is probably one of the most popular methods in data analysis. If you want to group data in pandas you have to use .groupby method. In order to demonstrate aggregates and grouping in pandas I decided to choose popular Titanic dataset which you can download using this link.

Let’s load data and see how it goes:

>>> titanic_df = pd.read_csv('titanic.csv')
>>> print(titanic_df.head())
   PassengerID                                           Name PClass    Age  \
0            1                   Allen, Miss Elisabeth Walton    1st  29.00   
1            2                    Allison, Miss Helen Loraine    1st   2.00   
2            3            Allison, Mr Hudson Joshua Creighton    1st  30.00   
3            4  Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st  25.00   
4            5                  Allison, Master Hudson Trevor    1st   0.92   
      Sex  Survived  SexCode  
0  female         1        1  
1  female         0        1  
2    male         0        0  
3  female         0        1  
4    male         1        0

Let’s calculate how many passengers (women and men) survived and how many did not, we will use .groupby as stated above.

>>> print(titanic_df.groupby(['Sex', 'Survived'])['PassengerID'].count())
Sex     Survived
female  0           154
        1           308
male    0           709
        1           142
Name: PassengerID, dtype: int64

Now let’s analyze the same data by cabin class:

>>> print(titanic_df.groupby(['PClass', 'Survived'])['PassengerID'].count())
PClass  Survived
*       0             1
1st     0           129
        1           193
2nd     0           160
        1           119
3rd     0           573
        1           138
Name: PassengerID, dtype: int64

Pivoting tables in pandas

Term “pivot table” is known for those who are pretty familiar with tools like Microsoft Excel or other spreadsheet instruments. In order to pivot a table in pandas you have to use .pivot_table method. I will demonstrate how to use it on our Titanic dataset. Let’s assume that we have to calculate how many women and men were in a particular cabin class.

>>> titanic_df = pd.read_csv('titanic.csv')
>>> pvt = titanic_df.pivot_table(index=['Sex'], columns=['PClass'], values='Name', aggfunc='count')

Now DataFrame index will be Sex column values, new columns are represented by PClass values and aggregation function is count which calculate the number of passengers by Name column.

>>> print(pvt.loc['female', ['1st', '2nd', '3rd']])
PClass
1st    143.0
2nd    107.0
3rd    212.0
Name: female, dtype: float64

Very easy, isn’t it?

Time Series analysis using pandas

Pandas was created to analyze time series data. In order to illustrate how easy it is, I prepared sample dataset with Apple stock prices (5 year period). You can download it here.

>>> import pandas as pd
>>> df = pd.read_csv('apple.csv', index_col='Date', parse_dates=True)
>>> df = df.sort_index()
>>> print(df.info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1258 entries, 2017-02-22 to 2012-02-23
Data columns (total 6 columns):
Open         1258 non-null float64
High         1258 non-null float64
Low          1258 non-null float64
Close        1258 non-null float64
Volume       1258 non-null int64
Adj Close    1258 non-null float64
dtypes: float64(5), int64(1)
memory usage: 68.8 KB

We have created a DataFrame with DatetimeIndex by Date column and then sort it. If datetime column is different from ISO8601 format, then you have to use built-in pandas function pandas.to_datetime.

Let’s now calculate mean closing price:

>>> df.loc['2012-Feb', 'Close'].mean()
528.4820021999999

But what about specific time period?

>>> df.loc['2012-Feb':'2015-Feb', 'Close'].mean()
430.43968317018414

Do you want to know a mean of closing price by weeks? No prob.

>>> df.resample('W')['Close'].mean()
Date
2012-02-26    519.399979
2012-03-04    538.652008
2012-03-11    536.254004
2012-03-18    576.161993
2012-03-25    600.990001
2012-04-01    609.698003
2012-04-08    626.484993
2012-04-15    623.773999
2012-04-22    591.718002
2012-04-29    590.536005
2012-05-06    579.831995
2012-05-13    568.814001
2012-05-20    543.593996
2012-05-27    563.283995
2012-06-03    572.539994
2012-06-10    570.124002
2012-06-17    573.029991
2012-06-24    583.739993
2012-07-01    574.070004
2012-07-08    601.937489
2012-07-15    606.080008
2012-07-22    607.746011
2012-07-29    587.951999
2012-08-05    607.217999
2012-08-12    621.150003
2012-08-19    635.394003
2012-08-26    663.185999
2012-09-02    670.611995
2012-09-09    675.477503
2012-09-16    673.476007
                 ...    
2016-08-07    105.934003
2016-08-14    108.258000
2016-08-21    109.304001
2016-08-28    107.980000
2016-09-04    106.676001
2016-09-11    106.177498
2016-09-18    111.129999
2016-09-25    113.606001
2016-10-02    113.029999
2016-10-09    113.303999
2016-10-16    116.860000
2016-10-23    117.160001
2016-10-30    115.938000
2016-11-06    111.057999
2016-11-13    109.714000
2016-11-20    108.563999
2016-11-27    111.637503
2016-12-04    110.587999
2016-12-11    111.231999
2016-12-18    115.094002
2016-12-25    116.691998
2017-01-01    116.642502
2017-01-08    116.672501
2017-01-15    119.228000
2017-01-22    119.942499
2017-01-29    121.164000
2017-02-05    125.867999
2017-02-12    131.679996
2017-02-19    134.978000
2017-02-26    136.904999
Freq: W-SUN, Name: Close, dtype: float64

Resampling is a very powerful tool when it comes to time series analysis. If you want to know more about resampling go ahead and dive into official docs.

Visualization in pandas

At the very beginning of this post I said that pandas is built on numpy, when it comes to visualization pandas uses library called matplotlib. Let’s see how Apple stock prices change over time on a graph:

Taking Closing price between Feb, 2012 and Feb, 2017:

>>> import matplotlib.pyplot as plt
>>> new_sample_df = df.loc['2012-Feb':'2017-Feb', ['Close']]
>>> new_sample_df.plot()
>>> plt.show()

Values of X-axis are represented by index values of DataFrame (by default if you do not provide explicitly), Y-axis is a closing price. Take a look at 2014, there is a drop because of 7 to 1 split held by Apple.

That’s it for the introduction to pandas. Hope you like it, if so please leave a comment below.