The purpose of this blog post is to outline and illustrate the most useful Pandas techinques using one dataset. There is a lot of great material out there but they tend to only show these techniques in isolation without context or as part of a larger project where only the most relevant Pandas techniques are used.
This post uses a simple Kaggle dataset available on https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python/version/1 and applying them to mainly the techniques described by Shiu-Tang Li 's excellent post on Towards Data Science - https://towardsdatascience.com/10-python-pandas-tricks-that-make-your-work-more-efficient-2e8e483808ba
This post assumes you, the reader, understands the basics of Python, especially data types, functions and lambda expressions. If you want to use the code, you would need to have the relevant libraries installed as well. The below code and output are done using Jupyter Notebook which is also available as part of the Anaconda distribution of Python.
Let's get started:
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
read_csv¶
With read_csv, you can limit the number of rows you want to read. This way you can test out to see what the columns look like without committing to load the entire dataset:
In [2]:
df = pd.read_csv('Mall_Customers.csv', nrows = 5)
df
Out[2]:
You can get a list of the columns using the .columns.to_list() method:
In [3]:
col_list = df.columns.to_list()
col_list
Out[3]:
CustomerID is not a very useful column so we can exclude it in our full loading of the dataset. We are going to remove 'CustomerID" from our col_list and use the usecols argument to specify the columns we want to load. In addition, we can specify the data types of the columns using the dtype argument.
df.head() shows the top (5) rows of the dataframe.
df.head() shows the top (5) rows of the dataframe.
In [4]:
col_list = col_list[1:]
col_list
Out[4]:
In [5]:
dtype_dict = {'Gender':str,'Age':int,'Annual Income (k$)': float, 'Spending Score (1-100)': int}
In [6]:
df = pd.read_csv('Mall_Customers.csv',usecols = col_list, dtype = dtype_dict)
df.head()
Out[6]:
df.info can be run to display the number of data points each column as well as their data types.
In [7]:
df.info()
select_dtypes¶
You can count the number of data types you have by using the .dtypes.value_counts() method:
In [8]:
df.dtypes.value_counts()
Out[8]:
You can select a smaller dataframe, filtering for the only data types you want using the .select_dtypes method:
In [9]:
# Selecting a dataframe with only 'float64' and 'int32'
df.select_dtypes(include = ['float64', 'int32'])
Out[9]:
copy¶
In pandas, when you set df1 = df2, you are not making a copy of df1 to df2. Instead, you are merely setting up a pointer to df1 - ie., making them equivalent. To make a copy, you need to use the .copy() method:
In [10]:
df2 = df.copy()
# axis = 1 to indicate dropping by column; inplace = True to make the dropping permanent
df2.drop('Age', axis = 1, inplace = True)
df2.head()
Out[10]:
Notice what we have done to df2 does not affect df:
In [11]:
df.head()
Out[11]:
map vs get_dummies¶
When dealing with categorical data, you would need to convert these columns to dummy variables for their data to be effectively/correctly intepreted by machine learning/statisical models/algorithms. One way to go about it is to create a dictionary and map the values:
In [12]:
gender_map = {'Female':1, 'Male':0}
df['Female'] = df['Gender'].map(gender_map)
df.head()
Out[12]:
Alternatively, you can use pandas.get_dummies:
In [13]:
df_dummies = pd.get_dummies(df['Gender'])
df_dummies.head()
Out[13]:
merge¶
You can merge a dataframe to an existing one using the .merge method:
In [14]:
# Setting both left_index and right_index to True to merge on an index
df = df.merge(df_dummies, left_index = True, right_index = True)
In [15]:
df.sample(5)
Out[15]:
rename¶
We can rename columns by using the .rename method. All you have to do is passing a dictionary to the columns argument and set inplace to True.
In [16]:
# Dropping Redundant columns
df.drop(columns =['Female_y','Male', 'Gender'], inplace = True )
# Renaming column
df.rename(columns = {'Female_x':'Female'}, inplace = True)
df.sample(5)
Out[16]:
Using Apply¶
You can create a new column and put calculated values in them by using the .apply method together with a lambda expression. A lambda expression allows you to apply a predetermined function to every element of a dataframe column. This is a common way to generate new categorical features for our data (feature engineering).
In [17]:
def young_old(i):
if i>=55:
return 1
else:
return 0
df['55 And Over'] = df['Age'].apply(lambda x: young_old(x))
df.head()
Out[17]:
isna, dropna and fillna¶
Often in a dataframe, we have missing values. Many data modelling and prediction techniques require no missing values in the input data. There are two main ways to deal with missing values, you can drop the row, or you can fill in the missing values.The method .isna returns True for NaN/missing values and False if it is not. In our dataframe, we have no missing values. If we write code that counts the number of NaNs it would return zeros.
In [18]:
df.apply(lambda x: sum(x.isna()))
Out[18]:
The method .dropna would drop the rows containing missing values. In our case, no rows would be dropped - you would still have the 200 rows:
In [19]:
df.apply(lambda x: x.dropna())
df.info()
Another common way to deal with missing values is to use the .fillna() method. Typically, mean is used for floats, median for integers and mode for categorical data. These are done so to minimise the effects on the statistical properties of the dataset.
In [20]:
df['Age'].fillna(value = 'median')
df['Age'].median()
Out[20]:
value counts¶
value_counts() allows us to check what are the possible values and the frequency for each individual value. This is often used in conjunction with a bar graph:
In [21]:
df['Age'].value_counts().head()
Out[21]:
In [22]:
plt.figure(figsize =(15,5))
df['Age'].value_counts().plot(kind = 'bar')
Out[22]:
There are some advance techniques:
A. normalize = True: if you want to check the frequency instead of counts.
B. dropna = False: if you also want to include missing values in the stats.
C. df['c'].value_counts().reset_index(): if you want to convert the stats table into a pandas dataframe and manipulate it.
D. df['c'].value_counts().reset_index().sort_values(by='index') : show the stats sorted by distinct values in column ‘c’ instead of counts.
A. normalize = True: if you want to check the frequency instead of counts.
B. dropna = False: if you also want to include missing values in the stats.
C. df['c'].value_counts().reset_index(): if you want to convert the stats table into a pandas dataframe and manipulate it.
D. df['c'].value_counts().reset_index().sort_values(by='index') : show the stats sorted by distinct values in column ‘c’ instead of counts.
In [23]:
df['Age'].value_counts(normalize = True).head()
Out[23]:
In our above barplot, ages are sorted by frequency. If we want to sort them by age, we would have to use Trick D. Note that Trick D will result in a DataFrame with 'index' as the ages and 'Age' as the frequency. We would rename them to avoid confusion:
In [24]:
age_freq = df['Age'].value_counts().reset_index().sort_values(by='index')
age_freq.head()
Out[24]:
In [25]:
age_freq.rename(columns = {'index':'age','Age':'count'}, inplace= True)
In [26]:
plt.figure(figsize =(15,5))
sns.barplot(x = age_freq['age'], y = age_freq['count'], color = 'blue')
Out[26]:
pivot_table¶
pivot_table in pandas is similar to groupby in SQL and pivot table in Excel. You can summarise values by more than one dimension.The parameter aggfunc allows you to select the aggregate function you want to use. Numpy is used so the syntax require you to put np.aggfunc.
In [27]:
pt = pd.pivot_table(data = df,
index = '55 And Over',
columns = 'Female',
values = 'Spending Score (1-100)',
aggfunc=np.mean)
pt.head(10)
Out[27]:
In [28]:
pt.plot(kind= 'bar')
Out[28]:
select specific rows and columns¶
We can extract subsets of our dataframe by restricting the indexes and columns:
In [29]:
df_filtered = df.iloc[5:10,[2,4]]
df_filtered
Out[29]:
We can also create a filter with certain criteria and use the technique of df[filter] to create a subset dataframe that fit those criteria.
In [30]:
filter = df['Age'].isin([18,19,20])
filter
Out[30]:
In [31]:
df_filtered2 = df[filter]
df_filtered2
Out[31]:
Percentile Groups using np.percentile and cut:¶
One way to add additional labels or feature engineer your data is to convert continuous variables to categorical variables. We can break down annual income to a number of income brackets. To achieve this, we first create a list of cut point filters then we use them to filter the data in our dataframe using pandas.cut.Note that Group 0 has the lowest income and group 3 has the highest.
In [32]:
cut_points = [np.percentile(df['Annual Income (k$)'], i) for i in [0,30,70,90,100]]
df['income group'] =pd.cut(df['Annual Income (k$)'], cut_points, labels=False, include_lowest=True)
df.sample(10)
Out[32]:
You can also use pandas.qcut if you want the bins/brackets to be equally spaced.
You can also add labels:
You can also add labels:
In [33]:
labels = ['0 - 25%', '25% - 50%', '50% - 75%', '75% - 100%']
df['income percentile group'] = pd.qcut(x = df['Annual Income (k$)'], q = 4, labels = labels)
df.sample(10)
Out[33]:
Data used¶
https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python/version/1References:¶
https://towardsdatascience.com/10-python-pandas-tricks-that-make-your-work-more-efficient-2e8e483808bahttps://towardsdatascience.com/data-manipulation-for-machine-learning-with-pandas-ab23e79ba5de
https://stackoverflow.com/questions/36631163/pandas-get-dummies-vs-sklearns-onehotencoder-what-is-more-efficient
https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.pivot_table.html
https://www.geeksforgeeks.org/numpy-percentile-in-python/
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
In [ ]:
No comments:
Post a Comment