import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

read_csv¶

With read_csv, you can limit the number of rows you want to read. This way you can test out to see what the columns look like without committing to load the entire dataset:

df = pd.read_csv('Mall_Customers.csv', nrows = 5)
df

You can get a list of the columns using the .columns.to_list() method:

col_list = df.columns.to_list()
col_list

['CustomerID', 'Gender', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']

CustomerID is not a very useful column so we can exclude it in our full loading of the dataset. We are going to remove 'CustomerID" from our col_list and use the usecols argument to specify the columns we want to load. In addition, we can specify the data types of the columns using the dtype argument.
df.head() shows the top (5) rows of the dataframe.

col_list = col_list[1:]
col_list

['Gender', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']

dtype_dict = {'Gender':str,'Age':int,'Annual Income (k$)': float, 'Spending Score (1-100)': int}

df = pd.read_csv('Mall_Customers.csv',usecols = col_list, dtype = dtype_dict)
df.head()

df.info can be run to display the number of data points each column as well as their data types.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
Gender                    200 non-null object
Age                       200 non-null int32
Annual Income (k$)        200 non-null float64
Spending Score (1-100)    200 non-null int32
dtypes: float64(1), int32(2), object(1)
memory usage: 4.8+ KB

select_dtypes¶

You can count the number of data types you have by using the .dtypes.value_counts() method:

df.dtypes.value_counts()

int32      2
object     1
float64    1
dtype: int64

You can select a smaller dataframe, filtering for the only data types you want using the .select_dtypes method:

# Selecting a dataframe with only 'float64' and 'int32'
df.select_dtypes(include = ['float64', 'int32'])

copy¶

In pandas, when you set df1 = df2, you are not making a copy of df1 to df2. Instead, you are merely setting up a pointer to df1 - ie., making them equivalent. To make a copy, you need to use the .copy() method:

df2 = df.copy()

# axis = 1 to indicate dropping by column; inplace = True to make the dropping permanent

df2.drop('Age', axis = 1, inplace = True)
df2.head()

Notice what we have done to df2 does not affect df:

df.head()

map vs get_dummies¶

When dealing with categorical data, you would need to convert these columns to dummy variables for their data to be effectively/correctly intepreted by machine learning/statisical models/algorithms. One way to go about it is to create a dictionary and map the values:

gender_map = {'Female':1, 'Male':0}
df['Female'] = df['Gender'].map(gender_map)
df.head()

Alternatively, you can use pandas.get_dummies:

df_dummies = pd.get_dummies(df['Gender'])
df_dummies.head()

merge¶

You can merge a dataframe to an existing one using the .merge method:

# Setting both left_index and right_index to True to merge on an index
df = df.merge(df_dummies, left_index = True, right_index = True)

df.sample(5)

rename¶

We can rename columns by using the .rename method. All you have to do is passing a dictionary to the columns argument and set inplace to True.

# Dropping Redundant columns
df.drop(columns =['Female_y','Male', 'Gender'], inplace = True )

# Renaming column
df.rename(columns = {'Female_x':'Female'}, inplace = True)
df.sample(5)

Using Apply¶

You can create a new column and put calculated values in them by using the .apply method together with a lambda expression. A lambda expression allows you to apply a predetermined function to every element of a dataframe column. This is a common way to generate new categorical features for our data (feature engineering).

def young_old(i):
    if i>=55:
        return 1
    else:
        return 0
    
df['55 And Over'] = df['Age'].apply(lambda x: young_old(x))
df.head()

isna, dropna and fillna¶

Often in a dataframe, we have missing values. Many data modelling and prediction techniques require no missing values in the input data. There are two main ways to deal with missing values, you can drop the row, or you can fill in the missing values.
The method .isna returns True for NaN/missing values and False if it is not. In our dataframe, we have no missing values. If we write code that counts the number of NaNs it would return zeros.

df.apply(lambda x: sum(x.isna()))

Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
Female                    0
55 And Over               0
dtype: int64

The method .dropna would drop the rows containing missing values. In our case, no rows would be dropped - you would still have the 200 rows:

df.apply(lambda x: x.dropna())
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
Age                       200 non-null int32
Annual Income (k$)        200 non-null float64
Spending Score (1-100)    200 non-null int32
Female                    200 non-null int64
55 And Over               200 non-null int64
dtypes: float64(1), int32(2), int64(2)
memory usage: 6.3 KB

Another common way to deal with missing values is to use the .fillna() method. Typically, mean is used for floats, median for integers and mode for categorical data. These are done so to minimise the effects on the statistical properties of the dataset.

df['Age'].fillna(value = 'median')
df['Age'].median()

36.0

value counts¶

value_counts() allows us to check what are the possible values and the frequency for each individual value. This is often used in conjunction with a bar graph:

df['Age'].value_counts().head()

32    11
35     9
19     8
31     8
30     7
Name: Age, dtype: int64

plt.figure(figsize =(15,5))
df['Age'].value_counts().plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x2529abbbf28>

There are some advance techniques:
A. normalize = True: if you want to check the frequency instead of counts.
B. dropna = False: if you also want to include missing values in the stats.
C. df['c'].value_counts().reset_index(): if you want to convert the stats table into a pandas dataframe and manipulate it.
D. df['c'].value_counts().reset_index().sort_values(by='index') : show the stats sorted by distinct values in column ‘c’ instead of counts.

df['Age'].value_counts(normalize = True).head()

32    0.055
35    0.045
19    0.040
31    0.040
30    0.035
Name: Age, dtype: float64

In our above barplot, ages are sorted by frequency. If we want to sort them by age, we would have to use Trick D. Note that Trick D will result in a DataFrame with 'index' as the ages and 'Age' as the frequency. We would rename them to avoid confusion:

age_freq = df['Age'].value_counts().reset_index().sort_values(by='index')
age_freq.head()

age_freq.rename(columns = {'index':'age','Age':'count'}, inplace= True)

plt.figure(figsize =(15,5))
sns.barplot(x = age_freq['age'], y = age_freq['count'], color = 'blue')

<matplotlib.axes._subplots.AxesSubplot at 0x2529acd99b0>

pivot_table¶

pivot_table in pandas is similar to groupby in SQL and pivot table in Excel. You can summarise values by more than one dimension.
The parameter aggfunc allows you to select the aggregate function you want to use. Numpy is used so the syntax require you to put np.aggfunc.

pt = pd.pivot_table(data = df, 
                    index = '55 And Over',
                    columns = 'Female',
                    values = 'Spending Score (1-100)',
                    aggfunc=np.mean)
pt.head(10)

pt.plot(kind= 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x2529b090198>

select specific rows and columns¶

We can extract subsets of our dataframe by restricting the indexes and columns:

df_filtered = df.iloc[5:10,[2,4]]
df_filtered

We can also create a filter with certain criteria and use the technique of df[filter] to create a subset dataframe that fit those criteria.

filter = df['Age'].isin([18,19,20])
filter

0       True
1      False
2       True
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17      True
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
170    False
171    False
172    False
173    False
174    False
175    False
176    False
177    False
178    False
179    False
180    False
181    False
182    False
183    False
184    False
185    False
186    False
187    False
188    False
189    False
190    False
191    False
192    False
193    False
194    False
195    False
196    False
197    False
198    False
199    False
Name: Age, Length: 200, dtype: bool

df_filtered2 = df[filter]
df_filtered2

Percentile Groups using np.percentile and cut:¶

One way to add additional labels or feature engineer your data is to convert continuous variables to categorical variables. We can break down annual income to a number of income brackets. To achieve this, we first create a list of cut point filters then we use them to filter the data in our dataframe using pandas.cut.
Note that Group 0 has the lowest income and group 3 has the highest.

cut_points = [np.percentile(df['Annual Income (k$)'], i) for i in [0,30,70,90,100]]

df['income group'] =pd.cut(df['Annual Income (k$)'], cut_points, labels=False, include_lowest=True)
df.sample(10)

You can also use pandas.qcut if you want the bins/brackets to be equally spaced.
You can also add labels:

labels = ['0 - 25%', '25% - 50%', '50% - 75%', '75% - 100%']
df['income percentile group'] = pd.qcut(x = df['Annual Income (k$)'], q = 4, labels = labels)
df.sample(10)

Data used¶

https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python/version/1

References:¶

https://towardsdatascience.com/10-python-pandas-tricks-that-make-your-work-more-efficient-2e8e483808ba
https://towardsdatascience.com/data-manipulation-for-machine-learning-with-pandas-ab23e79ba5de
https://stackoverflow.com/questions/36631163/pandas-get-dummies-vs-sklearns-onehotencoder-what-is-more-efficient
https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.pivot_table.html
https://www.geeksforgeeks.org/numpy-percentile-in-python/
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

Female	0	1
55 And Over
0	51.414286	52.70
1	37.222222	41.75

	Age	Annual Income (k$)	Spending Score (1-100)	Female
0	19	15.0	39	0
2	20	16.0	6	1
17	20	21.0	66	0
33	18	33.0	92	0
39	20	37.0	75	1
61	19	46.0	55	0
65	18	48.0	59	0
68	19	48.0	59	0
91	18	59.0	41	0
99	20	61.0	49	0
111	19	63.0	54	1
113	19	64.0	46	0
114	18	65.0	48	1
115	19	65.0	50	1
134	20	73.0	5	0
138	19	74.0	10	0
162	19	81.0	5	0

Data Econ

Friday, 26 April 2019

Pandas for Data Analysis