- A US bank cold called 4000 customers for car insurance. The goal is to analysis customer data and data relating to the calls to come up with models that predict the successes of cold calls.
- There is no single feature of a customer that strongly predicts whether they would buy car insurance.
- The most significant features are :
- 1) whether a customer has household insurance (negative correlation)
- 2) whether previous attempts were made on a customer (positive correlation),
- 3) how many days has passed since a customer was last contacted (positive correlation)
- 4) whether a customer has a car loan (negative correlation). Correlation Analysis
- I converted categorical data to dummy variables to make it easier for machine learning in Part 2. Creating dummy variables
- I created new features using existing features (feature engineering) - call lengths and business hours. These are used to improve model performance and intepretation. Creating new features
Car Insurance Cold Calls¶
The purpose of this project is to explore and analyse what factors affect cold calling success - a classification problem.The data set is downloaded and available at https://www.kaggle.com/kondla/carinsurance/home (accessed 28/7/2018).
This project follows mostly the steps from a tutorial from Analytics Vidya, "A Complete Tutorial to Learn Data Science with Python from Scratch"(link: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/)
Here is a brief description of the data from the data documentation:¶
Description¶
This is a dataset from one bank in the United States. Besides usual services, this bank also provides car insurance services. The bank organizes regular campaigns to attract new clients. The bank has potential customers’ data, and bank’s employees call them for advertising available car insurance options. We are provided with general information about clients (age, job, etc.) as well as more specific information about the current insurance sell campaign (communication, last contact day) and previous campaigns (attributes like previous attempts, outcome). You have data about 4000 customers who were contacted during the last campaign and for whom the results of campaign (did the customer buy insurance or not) are known.Data Summary¶
Let's begin by importing some libraries. I am using Seaborn for better looking charts.
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
Importing the data.
In [2]:
df = pd.read_csv('carInsurance_train.csv')
Taking a look at the top of the data and general summary statistics of the data:
In [3]:
df.head()
Out[3]:
In [4]:
df.describe()
Out[4]:
In [5]:
df.info()
Using the .describe and .info methods, we can see that there are 4000 lines of data. However, not all fields are completely filled. Many fields have missing data. These missing may or may not be needed to interpolated, depending on whether the charts or machine learning models we run them through.
Also note that there are many non-numerical objects presnent in the data (non-null object) types.
There are over 4000 lines of data but only 958 have clear outcomes.
Also note that there are many non-numerical objects presnent in the data (non-null object) types.
There are over 4000 lines of data but only 958 have clear outcomes.
Distribution Analysis¶
Let's take a look at the average yearly bank balance of the call recipients.
In [6]:
sns.distplot(df['Balance'].dropna(),kde=False, bins = 20 )
#Add labels
plt.title('Customer Balance Histogram, 20 Bins')
plt.xlabel('Average Yearly Balance, in USD')
plt.ylabel('Customer Count')
Out[6]:
As we can see, most customers have a pretty low bank balance, probably because the balance is net of overdraft or credit card debt.
How about the number of calls the customers were getting?
In [7]:
sns.countplot(df['NoOfContacts'].dropna())
#Add labels
plt.title('Number of Calls Customers Received')
plt.xlabel('Number of Calls')
plt.ylabel('Customer Count')
Out[7]:
As we can see, most customers only received a few calls. In fact half of them only received 1 to 2. This might make our machine learning result bias and the estimated effect of number of calls statistically insignificant. We should keep this in mind.
I suspect that those who has subscribed to a car insurance (successes) have, on average, higher average yearly balance (hence more money to spend).
Let's take a look at the distribution of bank balance by car insurance subscription:
Let's take a look at the distribution of bank balance by car insurance subscription:
In [8]:
#Set figure size
plt.figure(figsize = [5,15])
sns.boxplot(x=df.CarInsurance, y=df.Balance)
Out[8]:
The average balance of customers whom were successfully sold car insurance was only slightly higher, and their average balance shows a greater variance. It looks like average balance is not as decisive as hypothesised.
Does the number of contacts have an effect?
In [9]:
#Set figure size
plt.figure(figsize = [5,15])
sns.boxplot(x=df.CarInsurance, y=df.NoOfContacts)
Out[9]:
It looks like there are almost no effect at all!!! So what might affect a customer's decision to purchase car insurance?
Notice in this section, we are analysing only one continous variable at a time.
Notice in this section, we are analysing only one continous variable at a time.
Categorical Variable Analysis¶
To analyse the effects of categorical variables such as HHInsurance and CarLoan, we can use pivot tables
HHInsurance:
In [10]:
pivot_table1 = df.pivot_table(values = 'CarInsurance', index = 'HHInsurance',aggfunc= 'mean')
print('Probability of getting car insurance by household insurance category: \n')
print(pivot_table1)
So, it looks like people who don't already have insurance are more likely to purchase a car insurance.
CarLoan:
In [11]:
pivot_table2 = df.pivot_table(values = 'CarInsurance', index = 'CarLoan', aggfunc = 'mean')
print('Probability of getting car insurance by car insurance category: \n')
print(pivot_table2)
Again, we can see that people who did not have a car loan are way more likely to purchase a car insurance from the cold calling campaign.
It seems we have some clear indication as to which set of bank customers are most responsive to our cold calling campaign. If we visualise these pivot tables:
It seems we have some clear indication as to which set of bank customers are most responsive to our cold calling campaign. If we visualise these pivot tables:
In [12]:
fig, axes = plt.subplots(1,2, figsize = (15,5))
pivot_table1.plot(kind = 'bar',legend=False, ax = axes[0])
axes[0].set_ylabel('Probability')
axes[0].set_title('Probability of Purchasing Car Insurance by Household Insurance Status')
axes[0].set_ylim([0,0.6])
pivot_table2.plot(kind = 'bar',legend=False, ax = axes[1])
axes[1].set_ylabel('Probability')
axes[1].set_title('Probability of Purchasing Car Insurance by Car Insurance Status')
axes[1].set_ylim([0,0.6])
Out[12]:
Bivariate Correlation Analysis¶
We want to see if any predictors/features are significantly correlating with each other and identify the most important variables. A correlation heatmap allows us to hit two birds with one stone.
In [13]:
correlations = df.corr()
correlations = correlations.drop('Id', axis =1).drop('Id', axis =0)
sns.heatmap(correlations, annot = True, linecolor = 'blue')
Out[13]:
The features are not very correlated with the exception of DaysPassed and PrevAttempts, which makes sense : DaysPassed will be -1 if PrevAttempts is 0 and for everthing else we would see positive numbers in both.
The most significant features, in order of magnitude are: HHInsurance, PreAttempts, DaysPassed and CarLoan.
The most significant features, in order of magnitude are: HHInsurance, PreAttempts, DaysPassed and CarLoan.
Data Wrangling¶
Next up, we need to prepare our data for machine learning by taking are of two things:1) Missing data - Estimating value for all our NaNs in our dataframe. 2) Extreme values - Outliers need to be treated appropriately.
We will also look at categorical data for further information.
Checking for Missing values¶
We add up the missing values , "null", for each of our columns:
In [14]:
df.apply(lambda x : sum(x.isnull()), axis = 0)
Out[14]:
The data is almost complete. However, we have 3 variables/categories with missing data. We don't have data on the contacting methods for about a quarter of our customers. There are small number of customers with no job or education data. These missing data may not be significant because they are categorical data - an absence of catergories is unlikely to affect the result or other categories.
Most customers had no "Outcome" data. This can be either due to poor data quality or that these customers are new to the campaign. Since we have this ambiguity, we need to treat the effect of this variable with suspect.
No continuous variables have missing data so we do not need to impute or interpolate any data.
Most customers had no "Outcome" data. This can be either due to poor data quality or that these customers are new to the campaign. Since we have this ambiguity, we need to treat the effect of this variable with suspect.
No continuous variables have missing data so we do not need to impute or interpolate any data.
How should we treat the extreme values in Balance and NoOfContacts?¶
As we saw in the Distributon Analysis section, these variables are highly skewed, with quite a few extreme outliers. Instead of dropping the outliers, we can try log transformation to make the distribution less skewed. However if we try to log our raw data, Pandas will return an error because of the existence of negative values. Now, since the minimum balance is -3058, we can add 4000 (for simplicity) to all balances to make them positive.NoOfContacts has
Log (Balance + 4000):¶
In [15]:
df['BalanceLog'] = np.log(df['Balance']+4000.0)
sns.distplot(df['BalanceLog'],kde=False, bins = 20 )
#Add labels
plt.title('Log of Customer Balance + 4000 Histogram, 20 Bins')
plt.xlabel('Log of Average Yearly Balance + 4000, in USD')
plt.ylabel('Customer Count')
Out[15]:
Even after transformation, our data for Balance is still highly skewed. We should keep this in mind.
Log NoOfContacts:¶
In [16]:
df['NoOfContactsLog'] = np.log(df['NoOfContacts']+4000.0)
sns.countplot(df['NoOfContactsLog'] )
#Add labels
plt.title('Log of Number of Contacts Histogram, 20 Bins')
plt.xlabel('Log NoOfContacts, in USD')
plt.ylabel('Customer Count')
Out[16]:
Even after transformation, our data for NoOfContacts is still highly skewed - taking the log does not really help - so I am removing them from the dataset.
In [17]:
df = df.drop('NoOfContactsLog', axis =1)
df = df.drop('BalanceLog', axis =1)
In [18]:
#Create a set of dummies for the Outcome variable
df_Outcome = pd.get_dummies(df['Outcome'])
In [19]:
#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Outcome], axis = 1)
df.head()
Out[19]:
Creating Dummies for Job¶
In [20]:
#Create a set of dummies for the Job variable
df_Job = pd.get_dummies(df['Job'])
#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Job], axis = 1)
df.head()
Out[20]:
Creating Dummies for Marital (Status)¶
In [21]:
#Create a set of dummies for the Marital variable
df_Marital = pd.get_dummies(df['Marital'])
#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Marital], axis = 1)
df.head()
Out[21]:
Creating Dummies for Education¶
In [22]:
#Create a set of dummies for the Education variable
df_Education = pd.get_dummies(df['Education'])
#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Education], axis = 1)
df.head()
Out[22]:
Creating Dummies for Communication (method)¶
In [23]:
#Create a set of dummies for the Communcation variable
df_Communication = pd.get_dummies(df['Communication'])
#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Communication], axis = 1)
df.head()
Out[23]:
Creating Month Dummies¶
We are converting the month to dummy variables since the relationship is unlikely to be linear
In [24]:
#Create a set of dummies for the Month variable
df_Month = pd.get_dummies(df['LastContactMonth'])
#Join the dummy variable to the main dataframe
df = pd.concat([df, df_Month], axis = 1)
df.head()
Out[24]:
We want to drop the columns we have converted to dummies:
In [25]:
df = df.drop('Outcome',axis = 1)
df = df.drop('Job',axis = 1)
df = df.drop('Marital',axis = 1)
df = df.drop('Education',axis = 1)
df = df.drop('Communication',axis = 1)
df = df.drop('LastContactMonth',axis = 1)
Further Feature Engineering:¶
We want to convert the CallStart and CallEnd as DateTime objects. We also want to calculate the length of the last call.
These need to be converted to DateTime objects again if we save our dataframe as a csv and read it again.¶
In [26]:
import datetime as dt
df.CallStart = pd.to_datetime(df.CallStart, infer_datetime_format =True )
df.CallEnd = pd.to_datetime(df.CallEnd, infer_datetime_format =True)
df['CallLength'] = df.CallEnd - df.CallStart
df['CallLength'].head()
Out[26]:
It might also be useful to Create a category called business hours for when the customer last received their call. We can call Business Hour 9am to 5pm, and Non-Business Hour otherwise.
We also want to convert these to seconds for easy analysis
In [27]:
df['CallLength'] = df['CallLength'].dt.total_seconds()
In [28]:
df['CallLength'].head()
Out[28]:
In [29]:
import datetime
def BusinessHour(x):
if x >=pd.Timestamp('09:00:00') and x <=pd.Timestamp('17:00:00'):
x = 1
else:
x = 0
return x
df.BusinessHour = df.CallStart.apply(BusinessHour)
df.BusinessHour.sample(20)
Out[29]:
Saving the modified dataset¶
In [30]:
df.info()
In [31]:
df.to_csv('carInsurance_train_mod.csv')
No comments:
Post a Comment