Saturday, 1 September 2018

Cold Calling for Car Insurance - Part 1 Data Visualisations and Feature Engineering

 Summary:

  • A US bank cold called 4000 customers for car insurance. The goal is to analysis customer data and data relating to the calls to come up with models that predict the successes of cold calls.
  • There is no single feature of a customer that strongly predicts whether they would buy car insurance.
  • The most significant features are : 
    • 1) whether a customer has household insurance (negative correlation) 
    • 2) whether previous attempts were made on a customer (positive correlation),
    • 3) how many days has passed since a customer was last contacted (positive correlation)
    •  4) whether a customer has a car loan (negative correlation).
    • Correlation Analysis
  • I converted categorical data to dummy variables to make it easier for machine learning in Part 2.
  • Creating dummy variables
  • I created new features using existing features (feature engineering) - call lengths and business hours. These are used to improve model performance and intepretation.
  • Creating new features

Car Insurance Cold Calls - Visual Analysis and Data Cleaning

Car Insurance Cold Calls

The purpose of this project is to explore and analyse what factors affect cold calling success - a classification problem.
The data set is downloaded and available at https://www.kaggle.com/kondla/carinsurance/home (accessed 28/7/2018).
This project follows mostly the steps from a tutorial from Analytics Vidya, "A Complete Tutorial to Learn Data Science with Python from Scratch"(link: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/)

Here is a brief description of the data from the data documentation:

Description

This is a dataset from one bank in the United States. Besides usual services, this bank also provides car insurance services. The bank organizes regular campaigns to attract new clients. The bank has potential customers’ data, and bank’s employees call them for advertising available car insurance options. We are provided with general information about clients (age, job, etc.) as well as more specific information about the current insurance sell campaign (communication, last contact day) and previous campaigns (attributes like previous attempts, outcome). You have data about 4000 customers who were contacted during the last campaign and for whom the results of campaign (did the customer buy insurance or not) are known.

Data Summary

Let's begin by importing some libraries. I am using Seaborn for better looking charts.
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
Importing the data.
In [2]:
df = pd.read_csv('carInsurance_train.csv')
Taking a look at the top of the data and general summary statistics of the data:
In [3]:
df.head()
Out[3]:
Id Age Job Marital Education Default Balance HHInsurance CarLoan Communication LastContactDay LastContactMonth NoOfContacts DaysPassed PrevAttempts Outcome CallStart CallEnd CarInsurance
0 1 32 management single tertiary 0 1218 1 0 telephone 28 jan 2 -1 0 NaN 13:45:20 13:46:30 0
1 2 32 blue-collar married primary 0 1156 1 0 NaN 26 may 5 -1 0 NaN 14:49:03 14:52:08 0
2 3 29 management single tertiary 0 637 1 0 cellular 3 jun 1 119 1 failure 16:30:24 16:36:04 1
3 4 25 student single primary 0 373 1 0 cellular 11 may 2 -1 0 NaN 12:06:43 12:20:22 1
4 5 30 management married tertiary 0 2694 0 0 cellular 3 jun 1 -1 0 NaN 14:35:44 14:38:56 0
In [4]:
df.describe()
Out[4]:
Id Age Default Balance HHInsurance CarLoan LastContactDay NoOfContacts DaysPassed PrevAttempts CarInsurance
count 4000.000000 4000.000000 4000.000000 4000.000000 4000.00000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000
mean 2000.500000 41.214750 0.014500 1532.937250 0.49275 0.133000 15.721250 2.607250 48.706500 0.717500 0.401000
std 1154.844867 11.550194 0.119555 3511.452489 0.50001 0.339617 8.425307 3.064204 106.685385 2.078647 0.490162
min 1.000000 18.000000 0.000000 -3058.000000 0.00000 0.000000 1.000000 1.000000 -1.000000 0.000000 0.000000
25% 1000.750000 32.000000 0.000000 111.000000 0.00000 0.000000 8.000000 1.000000 -1.000000 0.000000 0.000000
50% 2000.500000 39.000000 0.000000 551.500000 0.00000 0.000000 16.000000 2.000000 -1.000000 0.000000 0.000000
75% 3000.250000 49.000000 0.000000 1619.000000 1.00000 0.000000 22.000000 3.000000 -1.000000 0.000000 1.000000
max 4000.000000 95.000000 1.000000 98417.000000 1.00000 1.000000 31.000000 43.000000 854.000000 58.000000 1.000000
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 19 columns):
Id                  4000 non-null int64
Age                 4000 non-null int64
Job                 3981 non-null object
Marital             4000 non-null object
Education           3831 non-null object
Default             4000 non-null int64
Balance             4000 non-null int64
HHInsurance         4000 non-null int64
CarLoan             4000 non-null int64
Communication       3098 non-null object
LastContactDay      4000 non-null int64
LastContactMonth    4000 non-null object
NoOfContacts        4000 non-null int64
DaysPassed          4000 non-null int64
PrevAttempts        4000 non-null int64
Outcome             958 non-null object
CallStart           4000 non-null object
CallEnd             4000 non-null object
CarInsurance        4000 non-null int64
dtypes: int64(11), object(8)
memory usage: 593.8+ KB
Using the .describe and .info methods, we can see that there are 4000 lines of data. However, not all fields are completely filled. Many fields have missing data. These missing may or may not be needed to interpolated, depending on whether the charts or machine learning models we run them through.
Also note that there are many non-numerical objects presnent in the data (non-null object) types.
There are over 4000 lines of data but only 958 have clear outcomes.

Distribution Analysis

Let's take a look at the average yearly bank balance of the call recipients.
In [6]:
sns.distplot(df['Balance'].dropna(),kde=False, bins = 20 )

#Add labels
plt.title('Customer Balance Histogram, 20 Bins')
plt.xlabel('Average Yearly Balance, in USD')
plt.ylabel('Customer Count')
Out[6]:
<matplotlib.text.Text at 0x129e870ae48>
As we can see, most customers have a pretty low bank balance, probably because the balance is net of overdraft or credit card debt.
How about the number of calls the customers were getting?
In [7]:
sns.countplot(df['NoOfContacts'].dropna())

#Add labels
plt.title('Number of Calls Customers Received')
plt.xlabel('Number of Calls')
plt.ylabel('Customer Count')
Out[7]:
<matplotlib.text.Text at 0x129e888ec18>
As we can see, most customers only received a few calls. In fact half of them only received 1 to 2. This might make our machine learning result bias and the estimated effect of number of calls statistically insignificant. We should keep this in mind.
I suspect that those who has subscribed to a car insurance (successes) have, on average, higher average yearly balance (hence more money to spend).
Let's take a look at the distribution of bank balance by car insurance subscription:
In [8]:
#Set figure size
plt.figure(figsize = [5,15])

sns.boxplot(x=df.CarInsurance, y=df.Balance)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x129e8854e48>
The average balance of customers whom were successfully sold car insurance was only slightly higher, and their average balance shows a greater variance. It looks like average balance is not as decisive as hypothesised.
Does the number of contacts have an effect?
In [9]:
#Set figure size
plt.figure(figsize = [5,15])

sns.boxplot(x=df.CarInsurance, y=df.NoOfContacts)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x129eba14438>
It looks like there are almost no effect at all!!! So what might affect a customer's decision to purchase car insurance?
Notice in this section, we are analysing only one continous variable at a time.

Categorical Variable Analysis

To analyse the effects of categorical variables such as HHInsurance and CarLoan, we can use pivot tables
HHInsurance:
In [10]:
pivot_table1 = df.pivot_table(values = 'CarInsurance', index = 'HHInsurance',aggfunc= 'mean')

print('Probability of getting car insurance by household insurance category: \n')
print(pivot_table1)
Probability of getting car insurance by household insurance category: 

             CarInsurance
HHInsurance              
0                0.499261
1                0.299848
So, it looks like people who don't already have insurance are more likely to purchase a car insurance.
CarLoan:
In [11]:
pivot_table2 = df.pivot_table(values = 'CarInsurance', index = 'CarLoan', aggfunc = 'mean')

print('Probability of getting car insurance by car insurance category: \n')
print(pivot_table2)
Probability of getting car insurance by car insurance category: 

         CarInsurance
CarLoan              
0            0.418685
1            0.285714
Again, we can see that people who did not have a car loan are way more likely to purchase a car insurance from the cold calling campaign.
It seems we have some clear indication as to which set of bank customers are most responsive to our cold calling campaign. If we visualise these pivot tables:
In [12]:
fig, axes = plt.subplots(1,2, figsize = (15,5))

pivot_table1.plot(kind = 'bar',legend=False, ax = axes[0])
axes[0].set_ylabel('Probability')
axes[0].set_title('Probability of Purchasing Car Insurance by Household Insurance Status')
axes[0].set_ylim([0,0.6])

pivot_table2.plot(kind = 'bar',legend=False, ax = axes[1])
axes[1].set_ylabel('Probability')
axes[1].set_title('Probability of Purchasing Car Insurance by Car Insurance Status')
axes[1].set_ylim([0,0.6])
Out[12]:
(0, 0.6)

Bivariate Correlation Analysis

We want to see if any predictors/features are significantly correlating with each other and identify the most important variables. A correlation heatmap allows us to hit two birds with one stone.
In [13]:
correlations = df.corr()
correlations = correlations.drop('Id', axis =1).drop('Id', axis =0)
sns.heatmap(correlations, annot = True, linecolor = 'blue')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x129ebe062e8>
The features are not very correlated with the exception of DaysPassed and PrevAttempts, which makes sense : DaysPassed will be -1 if PrevAttempts is 0 and for everthing else we would see positive numbers in both.
The most significant features, in order of magnitude are: HHInsurance, PreAttempts, DaysPassed and CarLoan.

Data Wrangling

Next up, we need to prepare our data for machine learning by taking are of two things:
1) Missing data - Estimating value for all our NaNs in our dataframe. 2) Extreme values - Outliers need to be treated appropriately.
We will also look at categorical data for further information.

Checking for Missing values

We add up the missing values , "null", for each of our columns:
In [14]:
df.apply(lambda x : sum(x.isnull()), axis = 0)
Out[14]:
Id                     0
Age                    0
Job                   19
Marital                0
Education            169
Default                0
Balance                0
HHInsurance            0
CarLoan                0
Communication        902
LastContactDay         0
LastContactMonth       0
NoOfContacts           0
DaysPassed             0
PrevAttempts           0
Outcome             3042
CallStart              0
CallEnd                0
CarInsurance           0
dtype: int64
The data is almost complete. However, we have 3 variables/categories with missing data. We don't have data on the contacting methods for about a quarter of our customers. There are small number of customers with no job or education data. These missing data may not be significant because they are categorical data - an absence of catergories is unlikely to affect the result or other categories.
Most customers had no "Outcome" data. This can be either due to poor data quality or that these customers are new to the campaign. Since we have this ambiguity, we need to treat the effect of this variable with suspect.
No continuous variables have missing data so we do not need to impute or interpolate any data.

How should we treat the extreme values in Balance and NoOfContacts?

As we saw in the Distributon Analysis section, these variables are highly skewed, with quite a few extreme outliers. Instead of dropping the outliers, we can try log transformation to make the distribution less skewed. However if we try to log our raw data, Pandas will return an error because of the existence of negative values. Now, since the minimum balance is -3058, we can add 4000 (for simplicity) to all balances to make them positive.
NoOfContacts has

Log (Balance + 4000):

In [15]:
df['BalanceLog'] = np.log(df['Balance']+4000.0)

sns.distplot(df['BalanceLog'],kde=False, bins = 20 )

#Add labels
plt.title('Log of Customer Balance + 4000 Histogram, 20 Bins')
plt.xlabel('Log of Average Yearly Balance + 4000, in USD')
plt.ylabel('Customer Count')
Out[15]:
<matplotlib.text.Text at 0x129ec1e1160>
Even after transformation, our data for Balance is still highly skewed. We should keep this in mind.

Log NoOfContacts:

In [16]:
df['NoOfContactsLog'] = np.log(df['NoOfContacts']+4000.0)

sns.countplot(df['NoOfContactsLog'] )

#Add labels
plt.title('Log of Number of Contacts Histogram, 20 Bins')
plt.xlabel('Log NoOfContacts, in USD')
plt.ylabel('Customer Count')
Out[16]:
<matplotlib.text.Text at 0x129ec1b1e48>
Even after transformation, our data for NoOfContacts is still highly skewed - taking the log does not really help - so I am removing them from the dataset.
In [17]:
df = df.drop('NoOfContactsLog', axis =1)
df = df.drop('BalanceLog', axis =1)

Feature Engineering - Dummy Variables

In this dataset, there are large number of categorical variables. The best practice is to convert them to dummy variables. Dummy variables are easy to intepret and they are required for the machine learning models in Python.

Creating dummies for Outcome

In [18]:
#Create a set of dummies for the Outcome variable
df_Outcome = pd.get_dummies(df['Outcome'])
In [19]:
#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Outcome], axis = 1)
df.head()
Out[19]:
Id Age Job Marital Education Default Balance HHInsurance CarLoan Communication ... NoOfContacts DaysPassed PrevAttempts Outcome CallStart CallEnd CarInsurance failure other success
0 1 32 management single tertiary 0 1218 1 0 telephone ... 2 -1 0 NaN 13:45:20 13:46:30 0 0 0 0
1 2 32 blue-collar married primary 0 1156 1 0 NaN ... 5 -1 0 NaN 14:49:03 14:52:08 0 0 0 0
2 3 29 management single tertiary 0 637 1 0 cellular ... 1 119 1 failure 16:30:24 16:36:04 1 1 0 0
3 4 25 student single primary 0 373 1 0 cellular ... 2 -1 0 NaN 12:06:43 12:20:22 1 0 0 0
4 5 30 management married tertiary 0 2694 0 0 cellular ... 1 -1 0 NaN 14:35:44 14:38:56 0 0 0 0
5 rows × 22 columns

Creating Dummies for Job

In [20]:
#Create a set of dummies for the Job variable
df_Job = pd.get_dummies(df['Job'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Job], axis = 1)
df.head()
Out[20]:
Id Age Job Marital Education Default Balance HHInsurance CarLoan Communication ... blue-collar entrepreneur housemaid management retired self-employed services student technician unemployed
0 1 32 management single tertiary 0 1218 1 0 telephone ... 0 0 0 1 0 0 0 0 0 0
1 2 32 blue-collar married primary 0 1156 1 0 NaN ... 1 0 0 0 0 0 0 0 0 0
2 3 29 management single tertiary 0 637 1 0 cellular ... 0 0 0 1 0 0 0 0 0 0
3 4 25 student single primary 0 373 1 0 cellular ... 0 0 0 0 0 0 0 1 0 0
4 5 30 management married tertiary 0 2694 0 0 cellular ... 0 0 0 1 0 0 0 0 0 0
5 rows × 33 columns

Creating Dummies for Marital (Status)

In [21]:
#Create a set of dummies for the Marital variable
df_Marital = pd.get_dummies(df['Marital'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Marital], axis = 1)
df.head()
Out[21]:
Id Age Job Marital Education Default Balance HHInsurance CarLoan Communication ... management retired self-employed services student technician unemployed divorced married single
0 1 32 management single tertiary 0 1218 1 0 telephone ... 1 0 0 0 0 0 0 0 0 1
1 2 32 blue-collar married primary 0 1156 1 0 NaN ... 0 0 0 0 0 0 0 0 1 0
2 3 29 management single tertiary 0 637 1 0 cellular ... 1 0 0 0 0 0 0 0 0 1
3 4 25 student single primary 0 373 1 0 cellular ... 0 0 0 0 1 0 0 0 0 1
4 5 30 management married tertiary 0 2694 0 0 cellular ... 1 0 0 0 0 0 0 0 1 0
5 rows × 36 columns

Creating Dummies for Education

In [22]:
#Create a set of dummies for the Education variable
df_Education = pd.get_dummies(df['Education'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Education], axis = 1)
df.head()
Out[22]:
Id Age Job Marital Education Default Balance HHInsurance CarLoan Communication ... services student technician unemployed divorced married single primary secondary tertiary
0 1 32 management single tertiary 0 1218 1 0 telephone ... 0 0 0 0 0 0 1 0 0 1
1 2 32 blue-collar married primary 0 1156 1 0 NaN ... 0 0 0 0 0 1 0 1 0 0
2 3 29 management single tertiary 0 637 1 0 cellular ... 0 0 0 0 0 0 1 0 0 1
3 4 25 student single primary 0 373 1 0 cellular ... 0 1 0 0 0 0 1 1 0 0
4 5 30 management married tertiary 0 2694 0 0 cellular ... 0 0 0 0 0 1 0 0 0 1
5 rows × 39 columns

Creating Dummies for Communication (method)

In [23]:
#Create a set of dummies for the Communcation variable
df_Communication = pd.get_dummies(df['Communication'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Communication], axis = 1)
df.head()
Out[23]:
Id Age Job Marital Education Default Balance HHInsurance CarLoan Communication ... technician unemployed divorced married single primary secondary tertiary cellular telephone
0 1 32 management single tertiary 0 1218 1 0 telephone ... 0 0 0 0 1 0 0 1 0 1
1 2 32 blue-collar married primary 0 1156 1 0 NaN ... 0 0 0 1 0 1 0 0 0 0
2 3 29 management single tertiary 0 637 1 0 cellular ... 0 0 0 0 1 0 0 1 1 0
3 4 25 student single primary 0 373 1 0 cellular ... 0 0 0 0 1 1 0 0 1 0
4 5 30 management married tertiary 0 2694 0 0 cellular ... 0 0 0 1 0 0 0 1 1 0
5 rows × 41 columns

Creating Month Dummies

We are converting the month to dummy variables since the relationship is unlikely to be linear
In [24]:
#Create a set of dummies for the Month variable
df_Month = pd.get_dummies(df['LastContactMonth'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Month], axis = 1)
df.head()
Out[24]:
Id Age Job Marital Education Default Balance HHInsurance CarLoan Communication ... dec feb jan jul jun mar may nov oct sep
0 1 32 management single tertiary 0 1218 1 0 telephone ... 0 0 1 0 0 0 0 0 0 0
1 2 32 blue-collar married primary 0 1156 1 0 NaN ... 0 0 0 0 0 0 1 0 0 0
2 3 29 management single tertiary 0 637 1 0 cellular ... 0 0 0 0 1 0 0 0 0 0
3 4 25 student single primary 0 373 1 0 cellular ... 0 0 0 0 0 0 1 0 0 0
4 5 30 management married tertiary 0 2694 0 0 cellular ... 0 0 0 0 1 0 0 0 0 0
5 rows × 53 columns
We want to drop the columns we have converted to dummies:
In [25]:
df = df.drop('Outcome',axis = 1)
df = df.drop('Job',axis = 1)
df = df.drop('Marital',axis = 1)
df = df.drop('Education',axis = 1)
df = df.drop('Communication',axis = 1)
df = df.drop('LastContactMonth',axis = 1)

Further Feature Engineering:

We want to convert the CallStart and CallEnd as DateTime objects. We also want to calculate the length of the last call.

These need to be converted to DateTime objects again if we save our dataframe as a csv and read it again.

In [26]:
import datetime as dt

df.CallStart = pd.to_datetime(df.CallStart, infer_datetime_format =True )
df.CallEnd = pd.to_datetime(df.CallEnd, infer_datetime_format =True)
df['CallLength'] = df.CallEnd - df.CallStart

df['CallLength'].head()
Out[26]:
0   00:01:10
1   00:03:05
2   00:05:40
3   00:13:39
4   00:03:12
Name: CallLength, dtype: timedelta64[ns]
It might also be useful to Create a category called business hours for when the customer last received their call. We can call Business Hour 9am to 5pm, and Non-Business Hour otherwise.
We also want to convert these to seconds for easy analysis
In [27]:
df['CallLength'] = df['CallLength'].dt.total_seconds()
In [28]:
df['CallLength'].head()
Out[28]:
0     70.0
1    185.0
2    340.0
3    819.0
4    192.0
Name: CallLength, dtype: float64
In [29]:
import datetime

def BusinessHour(x):
    if x >=pd.Timestamp('09:00:00') and x <=pd.Timestamp('17:00:00'):
        x = 1
    else:
        x = 0
    return x

df.BusinessHour = df.CallStart.apply(BusinessHour)

df.BusinessHour.sample(20)
Out[29]:
3652    1
549     1
2639    1
245     1
1599    1
3149    1
453     1
2306    1
24      1
3801    1
1660    0
2087    1
2116    1
775     1
248     1
781     1
1302    1
170     1
1166    1
885     0
Name: CallStart, dtype: int64

Saving the modified dataset

In [30]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 48 columns):
Id                4000 non-null int64
Age               4000 non-null int64
Default           4000 non-null int64
Balance           4000 non-null int64
HHInsurance       4000 non-null int64
CarLoan           4000 non-null int64
LastContactDay    4000 non-null int64
NoOfContacts      4000 non-null int64
DaysPassed        4000 non-null int64
PrevAttempts      4000 non-null int64
CallStart         4000 non-null datetime64[ns]
CallEnd           4000 non-null datetime64[ns]
CarInsurance      4000 non-null int64
failure           4000 non-null uint8
other             4000 non-null uint8
success           4000 non-null uint8
admin.            4000 non-null uint8
blue-collar       4000 non-null uint8
entrepreneur      4000 non-null uint8
housemaid         4000 non-null uint8
management        4000 non-null uint8
retired           4000 non-null uint8
self-employed     4000 non-null uint8
services          4000 non-null uint8
student           4000 non-null uint8
technician        4000 non-null uint8
unemployed        4000 non-null uint8
divorced          4000 non-null uint8
married           4000 non-null uint8
single            4000 non-null uint8
primary           4000 non-null uint8
secondary         4000 non-null uint8
tertiary          4000 non-null uint8
cellular          4000 non-null uint8
telephone         4000 non-null uint8
apr               4000 non-null uint8
aug               4000 non-null uint8
dec               4000 non-null uint8
feb               4000 non-null uint8
jan               4000 non-null uint8
jul               4000 non-null uint8
jun               4000 non-null uint8
mar               4000 non-null uint8
may               4000 non-null uint8
nov               4000 non-null uint8
oct               4000 non-null uint8
sep               4000 non-null uint8
CallLength        4000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(11), uint8(34)
memory usage: 570.4 KB
In [31]:
df.to_csv('carInsurance_train_mod.csv')

Remember to convert the to DateTime objects again if we save our dataframe as a csv and read it again.

df.CallStart = pd.to_datetime(df.CallStart, infer_datetime_format =True ) df.CallEnd = pd.to_datetime(df.CallEnd, infer_datetime_format =True) df['CallLength'] = df.CallEnd - df.CallStart

No comments:

Post a Comment

Portfolio Optimisation with Python

 Recently I have been busy so I have been neglecting this blog for a very long time. Just want to put out some new content. So there is this...