Car Insurance Cold Calls¶

The purpose of this project is to explore and analyse what factors affect cold calling success - a classification problem.
The data set is downloaded and available at https://www.kaggle.com/kondla/carinsurance/home (accessed 28/7/2018).
This project follows mostly the steps from a tutorial from Analytics Vidya, "A Complete Tutorial to Learn Data Science with Python from Scratch"(link: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/)

Here is a brief description of the data from the data documentation:¶

Description¶

This is a dataset from one bank in the United States. Besides usual services, this bank also provides car insurance services. The bank organizes regular campaigns to attract new clients. The bank has potential customers’ data, and bank’s employees call them for advertising available car insurance options. We are provided with general information about clients (age, job, etc.) as well as more specific information about the current insurance sell campaign (communication, last contact day) and previous campaigns (attributes like previous attempts, outcome). You have data about 4000 customers who were contacted during the last campaign and for whom the results of campaign (did the customer buy insurance or not) are known.

Data Summary¶

Let's begin by importing some libraries. I am using Seaborn for better looking charts.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Importing the data.

df = pd.read_csv('carInsurance_train.csv')

Taking a look at the top of the data and general summary statistics of the data:

df.head()

df.describe()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 19 columns):
Id                  4000 non-null int64
Age                 4000 non-null int64
Job                 3981 non-null object
Marital             4000 non-null object
Education           3831 non-null object
Default             4000 non-null int64
Balance             4000 non-null int64
HHInsurance         4000 non-null int64
CarLoan             4000 non-null int64
Communication       3098 non-null object
LastContactDay      4000 non-null int64
LastContactMonth    4000 non-null object
NoOfContacts        4000 non-null int64
DaysPassed          4000 non-null int64
PrevAttempts        4000 non-null int64
Outcome             958 non-null object
CallStart           4000 non-null object
CallEnd             4000 non-null object
CarInsurance        4000 non-null int64
dtypes: int64(11), object(8)
memory usage: 593.8+ KB

Using the .describe and .info methods, we can see that there are 4000 lines of data. However, not all fields are completely filled. Many fields have missing data. These missing may or may not be needed to interpolated, depending on whether the charts or machine learning models we run them through.
Also note that there are many non-numerical objects presnent in the data (non-null object) types.
There are over 4000 lines of data but only 958 have clear outcomes.

Distribution Analysis¶

Let's take a look at the average yearly bank balance of the call recipients.

sns.distplot(df['Balance'].dropna(),kde=False, bins = 20 )

#Add labels
plt.title('Customer Balance Histogram, 20 Bins')
plt.xlabel('Average Yearly Balance, in USD')
plt.ylabel('Customer Count')

<matplotlib.text.Text at 0x129e870ae48>

As we can see, most customers have a pretty low bank balance, probably because the balance is net of overdraft or credit card debt.

How about the number of calls the customers were getting?

sns.countplot(df['NoOfContacts'].dropna())

#Add labels
plt.title('Number of Calls Customers Received')
plt.xlabel('Number of Calls')
plt.ylabel('Customer Count')

<matplotlib.text.Text at 0x129e888ec18>

As we can see, most customers only received a few calls. In fact half of them only received 1 to 2. This might make our machine learning result bias and the estimated effect of number of calls statistically insignificant. We should keep this in mind.

I suspect that those who has subscribed to a car insurance (successes) have, on average, higher average yearly balance (hence more money to spend).
Let's take a look at the distribution of bank balance by car insurance subscription:

#Set figure size
plt.figure(figsize = [5,15])

sns.boxplot(x=df.CarInsurance, y=df.Balance)

<matplotlib.axes._subplots.AxesSubplot at 0x129e8854e48>

The average balance of customers whom were successfully sold car insurance was only slightly higher, and their average balance shows a greater variance. It looks like average balance is not as decisive as hypothesised.

Does the number of contacts have an effect?

#Set figure size
plt.figure(figsize = [5,15])

sns.boxplot(x=df.CarInsurance, y=df.NoOfContacts)

<matplotlib.axes._subplots.AxesSubplot at 0x129eba14438>

It looks like there are almost no effect at all!!! So what might affect a customer's decision to purchase car insurance?
Notice in this section, we are analysing only one continous variable at a time.

Categorical Variable Analysis¶

To analyse the effects of categorical variables such as HHInsurance and CarLoan, we can use pivot tables

HHInsurance:

pivot_table1 = df.pivot_table(values = 'CarInsurance', index = 'HHInsurance',aggfunc= 'mean')

print('Probability of getting car insurance by household insurance category: \n')
print(pivot_table1)

Probability of getting car insurance by household insurance category: 

             CarInsurance
HHInsurance              
0                0.499261
1                0.299848

So, it looks like people who don't already have insurance are more likely to purchase a car insurance.

CarLoan:

pivot_table2 = df.pivot_table(values = 'CarInsurance', index = 'CarLoan', aggfunc = 'mean')

print('Probability of getting car insurance by car insurance category: \n')
print(pivot_table2)

Probability of getting car insurance by car insurance category: 

         CarInsurance
CarLoan              
0            0.418685
1            0.285714

Again, we can see that people who did not have a car loan are way more likely to purchase a car insurance from the cold calling campaign.
It seems we have some clear indication as to which set of bank customers are most responsive to our cold calling campaign. If we visualise these pivot tables:

fig, axes = plt.subplots(1,2, figsize = (15,5))

pivot_table1.plot(kind = 'bar',legend=False, ax = axes[0])
axes[0].set_ylabel('Probability')
axes[0].set_title('Probability of Purchasing Car Insurance by Household Insurance Status')
axes[0].set_ylim([0,0.6])

pivot_table2.plot(kind = 'bar',legend=False, ax = axes[1])
axes[1].set_ylabel('Probability')
axes[1].set_title('Probability of Purchasing Car Insurance by Car Insurance Status')
axes[1].set_ylim([0,0.6])

(0, 0.6)

Bivariate Correlation Analysis¶

We want to see if any predictors/features are significantly correlating with each other and identify the most important variables. A correlation heatmap allows us to hit two birds with one stone.

correlations = df.corr()
correlations = correlations.drop('Id', axis =1).drop('Id', axis =0)
sns.heatmap(correlations, annot = True, linecolor = 'blue')

<matplotlib.axes._subplots.AxesSubplot at 0x129ebe062e8>

The features are not very correlated with the exception of DaysPassed and PrevAttempts, which makes sense : DaysPassed will be -1 if PrevAttempts is 0 and for everthing else we would see positive numbers in both.
The most significant features, in order of magnitude are: HHInsurance, PreAttempts, DaysPassed and CarLoan.

Data Wrangling¶

Next up, we need to prepare our data for machine learning by taking are of two things:
1) Missing data - Estimating value for all our NaNs in our dataframe. 2) Extreme values - Outliers need to be treated appropriately.
We will also look at categorical data for further information.

Checking for Missing values¶

We add up the missing values , "null", for each of our columns:

df.apply(lambda x : sum(x.isnull()), axis = 0)

Id                     0
Age                    0
Job                   19
Marital                0
Education            169
Default                0
Balance                0
HHInsurance            0
CarLoan                0
Communication        902
LastContactDay         0
LastContactMonth       0
NoOfContacts           0
DaysPassed             0
PrevAttempts           0
Outcome             3042
CallStart              0
CallEnd                0
CarInsurance           0
dtype: int64

The data is almost complete. However, we have 3 variables/categories with missing data. We don't have data on the contacting methods for about a quarter of our customers. There are small number of customers with no job or education data. These missing data may not be significant because they are categorical data - an absence of catergories is unlikely to affect the result or other categories.
Most customers had no "Outcome" data. This can be either due to poor data quality or that these customers are new to the campaign. Since we have this ambiguity, we need to treat the effect of this variable with suspect.
No continuous variables have missing data so we do not need to impute or interpolate any data.

How should we treat the extreme values in Balance and NoOfContacts?¶

As we saw in the Distributon Analysis section, these variables are highly skewed, with quite a few extreme outliers. Instead of dropping the outliers, we can try log transformation to make the distribution less skewed. However if we try to log our raw data, Pandas will return an error because of the existence of negative values. Now, since the minimum balance is -3058, we can add 4000 (for simplicity) to all balances to make them positive.
NoOfContacts has

Log (Balance + 4000):¶

df['BalanceLog'] = np.log(df['Balance']+4000.0)

sns.distplot(df['BalanceLog'],kde=False, bins = 20 )

#Add labels
plt.title('Log of Customer Balance + 4000 Histogram, 20 Bins')
plt.xlabel('Log of Average Yearly Balance + 4000, in USD')
plt.ylabel('Customer Count')

<matplotlib.text.Text at 0x129ec1e1160>

Even after transformation, our data for Balance is still highly skewed. We should keep this in mind.

Log NoOfContacts:¶

df['NoOfContactsLog'] = np.log(df['NoOfContacts']+4000.0)

sns.countplot(df['NoOfContactsLog'] )

#Add labels
plt.title('Log of Number of Contacts Histogram, 20 Bins')
plt.xlabel('Log NoOfContacts, in USD')
plt.ylabel('Customer Count')

<matplotlib.text.Text at 0x129ec1b1e48>

Even after transformation, our data for NoOfContacts is still highly skewed - taking the log does not really help - so I am removing them from the dataset.

df = df.drop('NoOfContactsLog', axis =1)
df = df.drop('BalanceLog', axis =1)

Feature Engineering - Dummy Variables¶

In this dataset, there are large number of categorical variables. The best practice is to convert them to dummy variables. Dummy variables are easy to intepret and they are required for the machine learning models in Python.

Creating dummies for Outcome¶

#Create a set of dummies for the Outcome variable
df_Outcome = pd.get_dummies(df['Outcome'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Outcome], axis = 1)
df.head()

Creating Dummies for Job¶

#Create a set of dummies for the Job variable
df_Job = pd.get_dummies(df['Job'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Job], axis = 1)
df.head()

Creating Dummies for Marital (Status)¶

#Create a set of dummies for the Marital variable
df_Marital = pd.get_dummies(df['Marital'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Marital], axis = 1)
df.head()

Creating Dummies for Education¶

#Create a set of dummies for the Education variable
df_Education = pd.get_dummies(df['Education'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Education], axis = 1)
df.head()

Creating Dummies for Communication (method)¶

#Create a set of dummies for the Communcation variable
df_Communication = pd.get_dummies(df['Communication'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Communication], axis = 1)
df.head()

Creating Month Dummies¶

We are converting the month to dummy variables since the relationship is unlikely to be linear

#Create a set of dummies for the Month variable
df_Month = pd.get_dummies(df['LastContactMonth'])

#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Month], axis = 1)
df.head()

We want to drop the columns we have converted to dummies:

df = df.drop('Outcome',axis = 1)
df = df.drop('Job',axis = 1)
df = df.drop('Marital',axis = 1)
df = df.drop('Education',axis = 1)
df = df.drop('Communication',axis = 1)
df = df.drop('LastContactMonth',axis = 1)

Further Feature Engineering:¶

We want to convert the CallStart and CallEnd as DateTime objects. We also want to calculate the length of the last call.

These need to be converted to DateTime objects again if we save our dataframe as a csv and read it again.¶

import datetime as dt

df.CallStart = pd.to_datetime(df.CallStart, infer_datetime_format =True )
df.CallEnd = pd.to_datetime(df.CallEnd, infer_datetime_format =True)
df['CallLength'] = df.CallEnd - df.CallStart

df['CallLength'].head()

0   00:01:10
1   00:03:05
2   00:05:40
3   00:13:39
4   00:03:12
Name: CallLength, dtype: timedelta64[ns]

It might also be useful to Create a category called business hours for when the customer last received their call. We can call Business Hour 9am to 5pm, and Non-Business Hour otherwise.

We also want to convert these to seconds for easy analysis

df['CallLength'] = df['CallLength'].dt.total_seconds()

df['CallLength'].head()

0     70.0
1    185.0
2    340.0
3    819.0
4    192.0
Name: CallLength, dtype: float64

import datetime

def BusinessHour(x):
    if x >=pd.Timestamp('09:00:00') and x <=pd.Timestamp('17:00:00'):
        x = 1
    else:
        x = 0
    return x

df.BusinessHour = df.CallStart.apply(BusinessHour)

df.BusinessHour.sample(20)

3652    1
549     1
2639    1
245     1
1599    1
3149    1
453     1
2306    1
24      1
3801    1
1660    0
2087    1
2116    1
775     1
248     1
781     1
1302    1
170     1
1166    1
885     0
Name: CallStart, dtype: int64

Saving the modified dataset¶

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 48 columns):
Id                4000 non-null int64
Age               4000 non-null int64
Default           4000 non-null int64
Balance           4000 non-null int64
HHInsurance       4000 non-null int64
CarLoan           4000 non-null int64
LastContactDay    4000 non-null int64
NoOfContacts      4000 non-null int64
DaysPassed        4000 non-null int64
PrevAttempts      4000 non-null int64
CallStart         4000 non-null datetime64[ns]
CallEnd           4000 non-null datetime64[ns]
CarInsurance      4000 non-null int64
failure           4000 non-null uint8
other             4000 non-null uint8
success           4000 non-null uint8
admin.            4000 non-null uint8
blue-collar       4000 non-null uint8
entrepreneur      4000 non-null uint8
housemaid         4000 non-null uint8
management        4000 non-null uint8
retired           4000 non-null uint8
self-employed     4000 non-null uint8
services          4000 non-null uint8
student           4000 non-null uint8
technician        4000 non-null uint8
unemployed        4000 non-null uint8
divorced          4000 non-null uint8
married           4000 non-null uint8
single            4000 non-null uint8
primary           4000 non-null uint8
secondary         4000 non-null uint8
tertiary          4000 non-null uint8
cellular          4000 non-null uint8
telephone         4000 non-null uint8
apr               4000 non-null uint8
aug               4000 non-null uint8
dec               4000 non-null uint8
feb               4000 non-null uint8
jan               4000 non-null uint8
jul               4000 non-null uint8
jun               4000 non-null uint8
mar               4000 non-null uint8
may               4000 non-null uint8
nov               4000 non-null uint8
oct               4000 non-null uint8
sep               4000 non-null uint8
CallLength        4000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(11), uint8(34)
memory usage: 570.4 KB

df.to_csv('carInsurance_train_mod.csv')

Remember to convert the to DateTime objects again if we save our dataframe as a csv and read it again.¶

df.CallStart = pd.to_datetime(df.CallStart, infer_datetime_format =True ) df.CallEnd = pd.to_datetime(df.CallEnd, infer_datetime_format =True) df['CallLength'] = df.CallEnd - df.CallStart

	Id	Age	Default	Balance	HHInsurance	CarLoan	LastContactDay	NoOfContacts	DaysPassed	PrevAttempts	CarInsurance
count	4000.000000	4000.000000	4000.000000	4000.000000	4000.00000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000	4000.000000
mean	2000.500000	41.214750	0.014500	1532.937250	0.49275	0.133000	15.721250	2.607250	48.706500	0.717500	0.401000
std	1154.844867	11.550194	0.119555	3511.452489	0.50001	0.339617	8.425307	3.064204	106.685385	2.078647	0.490162
min	1.000000	18.000000	0.000000	-3058.000000	0.00000	0.000000	1.000000	1.000000	-1.000000	0.000000	0.000000
25%	1000.750000	32.000000	0.000000	111.000000	0.00000	0.000000	8.000000	1.000000	-1.000000	0.000000	0.000000
50%	2000.500000	39.000000	0.000000	551.500000	0.00000	0.000000	16.000000	2.000000	-1.000000	0.000000	0.000000
75%	3000.250000	49.000000	0.000000	1619.000000	1.00000	0.000000	22.000000	3.000000	-1.000000	0.000000	1.000000
max	4000.000000	95.000000	1.000000	98417.000000	1.00000	1.000000	31.000000	43.000000	854.000000	58.000000	1.000000

Data Econ

Saturday, 1 September 2018

Cold Calling for Car Insurance - Part 1 Data Visualisations and Feature Engineering