Friday, 26 April 2019

Pandas for Data Analysis

Pandas is the most popular Python library for general data analysis. You can use it to load data, change data types, summarise data, calculate new fields/columns and much much more.

The purpose of this blog post is to outline and illustrate the most useful Pandas techinques using one dataset. There is a lot of great material out there but they tend to only show these techniques in isolation without context or as part of a larger project where only the most relevant Pandas techniques are used.

This post uses a simple Kaggle dataset available on https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python/version/1 and applying them to mainly the techniques described by Shiu-Tang Li 's excellent post on Towards Data Science - https://towardsdatascience.com/10-python-pandas-tricks-that-make-your-work-more-efficient-2e8e483808ba

This post assumes you, the reader, understands the basics of Python, especially data types, functions and lambda expressions. If you want to use the code, you would need to have the relevant libraries installed as well. The below code and output are done using Jupyter Notebook which is also available as part of the Anaconda distribution of Python.

Let's get started:



Test Practice - Mall Customer 2
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

read_csv

With read_csv, you can limit the number of rows you want to read. This way you can test out to see what the columns look like without committing to load the entire dataset:
In [2]:
df = pd.read_csv('Mall_Customers.csv', nrows = 5)
df
Out[2]:
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
You can get a list of the columns using the .columns.to_list() method:
In [3]:
col_list = df.columns.to_list()
col_list
Out[3]:
['CustomerID', 'Gender', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']
CustomerID is not a very useful column so we can exclude it in our full loading of the dataset. We are going to remove 'CustomerID" from our col_list and use the usecols argument to specify the columns we want to load. In addition, we can specify the data types of the columns using the dtype argument.
df.head() shows the top (5) rows of the dataframe.
In [4]:
col_list = col_list[1:]
col_list
Out[4]:
['Gender', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']
In [5]:
dtype_dict = {'Gender':str,'Age':int,'Annual Income (k$)': float, 'Spending Score (1-100)': int}
In [6]:
df = pd.read_csv('Mall_Customers.csv',usecols = col_list, dtype = dtype_dict)
df.head()
Out[6]:
Gender Age Annual Income (k$) Spending Score (1-100)
0 Male 19 15.0 39
1 Male 21 15.0 81
2 Female 20 16.0 6
3 Female 23 16.0 77
4 Female 31 17.0 40
df.info can be run to display the number of data points each column as well as their data types.
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
Gender                    200 non-null object
Age                       200 non-null int32
Annual Income (k$)        200 non-null float64
Spending Score (1-100)    200 non-null int32
dtypes: float64(1), int32(2), object(1)
memory usage: 4.8+ KB

select_dtypes

You can count the number of data types you have by using the .dtypes.value_counts() method:
In [8]:
df.dtypes.value_counts()
Out[8]:
int32      2
object     1
float64    1
dtype: int64
You can select a smaller dataframe, filtering for the only data types you want using the .select_dtypes method:
In [9]:
# Selecting a dataframe with only 'float64' and 'int32'
df.select_dtypes(include = ['float64', 'int32'])
Out[9]:
Age Annual Income (k$) Spending Score (1-100)
0 19 15.0 39
1 21 15.0 81
2 20 16.0 6
3 23 16.0 77
4 31 17.0 40
5 22 17.0 76
6 35 18.0 6
7 23 18.0 94
8 64 19.0 3
9 30 19.0 72
10 67 19.0 14
11 35 19.0 99
12 58 20.0 15
13 24 20.0 77
14 37 20.0 13
15 22 20.0 79
16 35 21.0 35
17 20 21.0 66
18 52 23.0 29
19 35 23.0 98
20 35 24.0 35
21 25 24.0 73
22 46 25.0 5
23 31 25.0 73
24 54 28.0 14
25 29 28.0 82
26 45 28.0 32
27 35 28.0 61
28 40 29.0 31
29 23 29.0 87
... ... ... ...
170 40 87.0 13
171 28 87.0 75
172 36 87.0 10
173 36 87.0 92
174 52 88.0 13
175 30 88.0 86
176 58 88.0 15
177 27 88.0 69
178 59 93.0 14
179 35 93.0 90
180 37 97.0 32
181 32 97.0 86
182 46 98.0 15
183 29 98.0 88
184 41 99.0 39
185 30 99.0 97
186 54 101.0 24
187 28 101.0 68
188 41 103.0 17
189 36 103.0 85
190 34 103.0 23
191 32 103.0 69
192 33 113.0 8
193 38 113.0 91
194 47 120.0 16
195 35 120.0 79
196 45 126.0 28
197 32 126.0 74
198 32 137.0 18
199 30 137.0 83
200 rows × 3 columns

copy

In pandas, when you set df1 = df2, you are not making a copy of df1 to df2. Instead, you are merely setting up a pointer to df1 - ie., making them equivalent. To make a copy, you need to use the .copy() method:
In [10]:
df2 = df.copy()

# axis = 1 to indicate dropping by column; inplace = True to make the dropping permanent

df2.drop('Age', axis = 1, inplace = True)
df2.head()
Out[10]:
Gender Annual Income (k$) Spending Score (1-100)
0 Male 15.0 39
1 Male 15.0 81
2 Female 16.0 6
3 Female 16.0 77
4 Female 17.0 40
Notice what we have done to df2 does not affect df:
In [11]:
df.head()
Out[11]:
Gender Age Annual Income (k$) Spending Score (1-100)
0 Male 19 15.0 39
1 Male 21 15.0 81
2 Female 20 16.0 6
3 Female 23 16.0 77
4 Female 31 17.0 40

map vs get_dummies

When dealing with categorical data, you would need to convert these columns to dummy variables for their data to be effectively/correctly intepreted by machine learning/statisical models/algorithms. One way to go about it is to create a dictionary and map the values:
In [12]:
gender_map = {'Female':1, 'Male':0}
df['Female'] = df['Gender'].map(gender_map)
df.head()
Out[12]:
Gender Age Annual Income (k$) Spending Score (1-100) Female
0 Male 19 15.0 39 0
1 Male 21 15.0 81 0
2 Female 20 16.0 6 1
3 Female 23 16.0 77 1
4 Female 31 17.0 40 1
Alternatively, you can use pandas.get_dummies:
In [13]:
df_dummies = pd.get_dummies(df['Gender'])
df_dummies.head()
Out[13]:
Female Male
0 0 1
1 0 1
2 1 0
3 1 0
4 1 0

merge

You can merge a dataframe to an existing one using the .merge method:
In [14]:
# Setting both left_index and right_index to True to merge on an index
df = df.merge(df_dummies, left_index = True, right_index = True)
In [15]:
df.sample(5)
Out[15]:
Gender Age Annual Income (k$) Spending Score (1-100) Female_x Female_y Male
101 Female 49 62.0 48 1 1 0
176 Male 58 88.0 15 0 0 1
36 Female 42 34.0 17 1 1 0
1 Male 21 15.0 81 0 0 1
43 Female 31 39.0 61 1 1 0

rename

We can rename columns by using the .rename method. All you have to do is passing a dictionary to the columns argument and set inplace to True.
In [16]:
# Dropping Redundant columns
df.drop(columns =['Female_y','Male', 'Gender'], inplace = True )

# Renaming column
df.rename(columns = {'Female_x':'Female'}, inplace = True)
df.sample(5)
Out[16]:
Age Annual Income (k$) Spending Score (1-100) Female
187 28 101.0 68 0
24 54 28.0 14 1
60 70 46.0 56 0
99 20 61.0 49 0
54 50 43.0 45 1

Using Apply

You can create a new column and put calculated values in them by using the .apply method together with a lambda expression. A lambda expression allows you to apply a predetermined function to every element of a dataframe column. This is a common way to generate new categorical features for our data (feature engineering).
In [17]:
def young_old(i):
    if i>=55:
        return 1
    else:
        return 0
    
df['55 And Over'] = df['Age'].apply(lambda x: young_old(x))
df.head()
Out[17]:
Age Annual Income (k$) Spending Score (1-100) Female 55 And Over
0 19 15.0 39 0 0
1 21 15.0 81 0 0
2 20 16.0 6 1 0
3 23 16.0 77 1 0
4 31 17.0 40 1 0

isna, dropna and fillna

Often in a dataframe, we have missing values. Many data modelling and prediction techniques require no missing values in the input data. There are two main ways to deal with missing values, you can drop the row, or you can fill in the missing values.
The method .isna returns True for NaN/missing values and False if it is not. In our dataframe, we have no missing values. If we write code that counts the number of NaNs it would return zeros.
In [18]:
df.apply(lambda x: sum(x.isna()))
Out[18]:
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
Female                    0
55 And Over               0
dtype: int64
The method .dropna would drop the rows containing missing values. In our case, no rows would be dropped - you would still have the 200 rows:
In [19]:
df.apply(lambda x: x.dropna())
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
Age                       200 non-null int32
Annual Income (k$)        200 non-null float64
Spending Score (1-100)    200 non-null int32
Female                    200 non-null int64
55 And Over               200 non-null int64
dtypes: float64(1), int32(2), int64(2)
memory usage: 6.3 KB
Another common way to deal with missing values is to use the .fillna() method. Typically, mean is used for floats, median for integers and mode for categorical data. These are done so to minimise the effects on the statistical properties of the dataset.
In [20]:
df['Age'].fillna(value = 'median')
df['Age'].median()
Out[20]:
36.0

value counts

value_counts() allows us to check what are the possible values and the frequency for each individual value. This is often used in conjunction with a bar graph:
In [21]:
df['Age'].value_counts().head()
Out[21]:
32    11
35     9
19     8
31     8
30     7
Name: Age, dtype: int64
In [22]:
plt.figure(figsize =(15,5))
df['Age'].value_counts().plot(kind = 'bar')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x2529abbbf28>
There are some advance techniques:
A. normalize = True: if you want to check the frequency instead of counts.
B. dropna = False: if you also want to include missing values in the stats.
C. df['c'].value_counts().reset_index(): if you want to convert the stats table into a pandas dataframe and manipulate it.
D. df['c'].value_counts().reset_index().sort_values(by='index') : show the stats sorted by distinct values in column ‘c’ instead of counts.
In [23]:
df['Age'].value_counts(normalize = True).head()
Out[23]:
32    0.055
35    0.045
19    0.040
31    0.040
30    0.035
Name: Age, dtype: float64
In our above barplot, ages are sorted by frequency. If we want to sort them by age, we would have to use Trick D. Note that Trick D will result in a DataFrame with 'index' as the ages and 'Age' as the frequency. We would rename them to avoid confusion:
In [24]:
age_freq = df['Age'].value_counts().reset_index().sort_values(by='index')
age_freq.head()
Out[24]:
index Age
18 18 4
2 19 8
16 20 5
15 21 5
27 22 3
In [25]:
age_freq.rename(columns = {'index':'age','Age':'count'}, inplace= True)
In [26]:
plt.figure(figsize =(15,5))
sns.barplot(x = age_freq['age'], y = age_freq['count'], color = 'blue')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x2529acd99b0>

pivot_table

pivot_table in pandas is similar to groupby in SQL and pivot table in Excel. You can summarise values by more than one dimension.
The parameter aggfunc allows you to select the aggregate function you want to use. Numpy is used so the syntax require you to put np.aggfunc.
In [27]:
pt = pd.pivot_table(data = df, 
                    index = '55 And Over',
                    columns = 'Female',
                    values = 'Spending Score (1-100)',
                    aggfunc=np.mean)
pt.head(10)
Out[27]:
Female 0 1
55 And Over
0 51.414286 52.70
1 37.222222 41.75
In [28]:
pt.plot(kind= 'bar')
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x2529b090198>

select specific rows and columns

We can extract subsets of our dataframe by restricting the indexes and columns:
In [29]:
df_filtered = df.iloc[5:10,[2,4]]
df_filtered
Out[29]:
Spending Score (1-100) 55 And Over
5 76 0
6 6 0
7 94 0
8 3 1
9 72 0
We can also create a filter with certain criteria and use the technique of df[filter] to create a subset dataframe that fit those criteria.
In [30]:
filter = df['Age'].isin([18,19,20])
filter
Out[30]:
0       True
1      False
2       True
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17      True
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
170    False
171    False
172    False
173    False
174    False
175    False
176    False
177    False
178    False
179    False
180    False
181    False
182    False
183    False
184    False
185    False
186    False
187    False
188    False
189    False
190    False
191    False
192    False
193    False
194    False
195    False
196    False
197    False
198    False
199    False
Name: Age, Length: 200, dtype: bool
In [31]:
df_filtered2 = df[filter]
df_filtered2
Out[31]:
Age Annual Income (k$) Spending Score (1-100) Female 55 And Over
0 19 15.0 39 0 0
2 20 16.0 6 1 0
17 20 21.0 66 0 0
33 18 33.0 92 0 0
39 20 37.0 75 1 0
61 19 46.0 55 0 0
65 18 48.0 59 0 0
68 19 48.0 59 0 0
91 18 59.0 41 0 0
99 20 61.0 49 0 0
111 19 63.0 54 1 0
113 19 64.0 46 0 0
114 18 65.0 48 1 0
115 19 65.0 50 1 0
134 20 73.0 5 0 0
138 19 74.0 10 0 0
162 19 81.0 5 0 0

Percentile Groups using np.percentile and cut:

One way to add additional labels or feature engineer your data is to convert continuous variables to categorical variables. We can break down annual income to a number of income brackets. To achieve this, we first create a list of cut point filters then we use them to filter the data in our dataframe using pandas.cut.
Note that Group 0 has the lowest income and group 3 has the highest.
In [32]:
cut_points = [np.percentile(df['Annual Income (k$)'], i) for i in [0,30,70,90,100]]

df['income group'] =pd.cut(df['Annual Income (k$)'], cut_points, labels=False, include_lowest=True)
df.sample(10)
Out[32]:
Age Annual Income (k$) Spending Score (1-100) Female 55 And Over income group
168 36 87.0 27 1 0 2
100 23 62.0 41 1 0 1
146 48 77.0 36 0 0 2
115 19 65.0 50 1 0 1
15 22 20.0 79 0 0 0
170 40 87.0 13 0 0 2
9 30 19.0 72 1 0 0
119 50 67.0 57 1 0 1
12 58 20.0 15 1 1 0
198 32 137.0 18 0 0 3
You can also use pandas.qcut if you want the bins/brackets to be equally spaced.
You can also add labels:
In [33]:
labels = ['0 - 25%', '25% - 50%', '50% - 75%', '75% - 100%']
df['income percentile group'] = pd.qcut(x = df['Annual Income (k$)'], q = 4, labels = labels)
df.sample(10)
Out[33]:
Age Annual Income (k$) Spending Score (1-100) Female 55 And Over income group income percentile group
194 47 120.0 16 1 0 3 75% - 100%
36 42 34.0 17 1 0 0 0 - 25%
73 60 50.0 56 1 1 1 25% - 50%
75 26 54.0 54 0 0 1 25% - 50%
37 30 34.0 73 1 0 0 0 - 25%
47 27 40.0 47 1 0 0 0 - 25%
156 37 78.0 1 0 0 2 50% - 75%
83 46 54.0 44 1 0 1 25% - 50%
30 60 30.0 4 0 1 0 0 - 25%
171 28 87.0 75 0 0 2 75% - 100%
In [ ]:
 

Portfolio Optimisation with Python

 Recently I have been busy so I have been neglecting this blog for a very long time. Just want to put out some new content. So there is this...