Saturday, 1 September 2018

Cold Calling for Car Insurance - Part 2 Machine Learning Predictions

Summary:


  • Using machine learning techniques, I created models that could predict the success/failure of cold calling for car insurance with an accuracy of around 85%.
  • I splitted 1/4 of the data as a test set, since there was no independent test set loaded to Kaggle.
  • Train Test Split
  • I evaluated my models using K-fold cross valiations and the test data set.
  • I used GridSearchCV to search for better parameters in my Support Vector Machine and Random Forest Models (parameter tuning).
  • The tuned Random Forest model did the best on K-fold cross validation with 84.8% accuracy.
  • The tuned Support Vector Machine model did the best on the test set at 86% precision and recall.
  • The Light GBM model marginally came 2nd in accuracy with a K-fold cross validation accuracy of 84.7%. However, it was faster to run compare to the Random Forest model, taking seconds instead of minutes.

Car Insurance Cold Calls - Machine Learning Models

Car Insurance Cold Calls

Naive Bayes

The first machine learning model we are going to use to predict is a Naive Bayes algorithm. A Naive Bayes algorithm assumes the predictors are all independent. This is one of the simplest models for classification problems. It is often used as a base line for comparison with other models.

Importing Libaries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Importing the Modified Dataset

We are using the modified dataset so the predictors are matching.
In [2]:
df = pd.read_csv('carInsurance_train_mod.csv', index_col=0)
df.head()
Out[2]:
Id Age Default Balance HHInsurance CarLoan LastContactDay NoOfContacts DaysPassed PrevAttempts ... feb jan jul jun mar may nov oct sep CallLength
0 1 32 0 1218 1 0 28 2 -1 0 ... 0 1 0 0 0 0 0 0 0 70.0
1 2 32 0 1156 1 0 26 5 -1 0 ... 0 0 0 0 0 1 0 0 0 185.0
2 3 29 0 637 1 0 3 1 119 1 ... 0 0 0 1 0 0 0 0 0 340.0
3 4 25 0 373 1 0 11 2 -1 0 ... 0 0 0 0 0 1 0 0 0 819.0
4 5 30 0 2694 0 0 3 1 -1 0 ... 0 0 0 1 0 0 0 0 0 192.0
5 rows × 48 columns
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4000 entries, 0 to 3999
Data columns (total 48 columns):
Id                4000 non-null int64
Age               4000 non-null int64
Default           4000 non-null int64
Balance           4000 non-null int64
HHInsurance       4000 non-null int64
CarLoan           4000 non-null int64
LastContactDay    4000 non-null int64
NoOfContacts      4000 non-null int64
DaysPassed        4000 non-null int64
PrevAttempts      4000 non-null int64
CallStart         4000 non-null object
CallEnd           4000 non-null object
CarInsurance      4000 non-null int64
failure           4000 non-null int64
other             4000 non-null int64
success           4000 non-null int64
admin.            4000 non-null int64
blue-collar       4000 non-null int64
entrepreneur      4000 non-null int64
housemaid         4000 non-null int64
management        4000 non-null int64
retired           4000 non-null int64
self-employed     4000 non-null int64
services          4000 non-null int64
student           4000 non-null int64
technician        4000 non-null int64
unemployed        4000 non-null int64
divorced          4000 non-null int64
married           4000 non-null int64
single            4000 non-null int64
primary           4000 non-null int64
secondary         4000 non-null int64
tertiary          4000 non-null int64
cellular          4000 non-null int64
telephone         4000 non-null int64
apr               4000 non-null int64
aug               4000 non-null int64
dec               4000 non-null int64
feb               4000 non-null int64
jan               4000 non-null int64
jul               4000 non-null int64
jun               4000 non-null int64
mar               4000 non-null int64
may               4000 non-null int64
nov               4000 non-null int64
oct               4000 non-null int64
sep               4000 non-null int64
CallLength        4000 non-null float64
dtypes: float64(1), int64(45), object(2)
memory usage: 1.5+ MB
The data looks good with all rows and columns of our dataframe filled with data.

Train Test Split.

Since the Data source did not provide data for the Testing set, we have to create our own. I have decided to convert 1/4 of the training set to testing data.
We are also dropping the columns of CallStart, CallEnd, and LastContactDay, since a Naive Bayes model cannot interpret them. Also dropping Id since it is not useful.
In [4]:
#Dropping Columns we are not using:
df_mod = df.drop(['CallStart','CallEnd','LastContactDay','Id'], axis = 1)

#Use train test split to have separate training and test data sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(df_mod.drop('CarInsurance', axis =1), df_mod['CarInsurance'], test_size=0.25, random_state = 101)

Training the Naive Bayes Model with the Train dataset.

In [5]:
#Import Library
from sklearn.naive_bayes import GaussianNB

#Create Model Object
model = GaussianNB()

#Train the model with df
model.fit(X_train,y_train)

#Predictions:
predictions = model.predict(X_test)

Model Evaluation

Cross Validation using K-Folds With cross_val_score

In [6]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = model, X=X_train, y = y_train, cv =10, n_jobs = -1)

accuracies
Out[6]:
array([0.74418605, 0.75747508, 0.79      , 0.71333333, 0.72666667,
       0.72      , 0.76333333, 0.74      , 0.74916388, 0.76588629])
In [7]:
accuracies.mean()
Out[7]:
0.7470044630125521
Using 10-folds cross validation, we can reasonably say that our model has an average accuracy score of 75%. Not too good but not too bad.

Classification Report

In [8]:
from sklearn.metrics import classification_report
print (classification_report(y_test, predictions))
             precision    recall  f1-score   support

          0       0.77      0.88      0.82       598
          1       0.78      0.61      0.69       402

avg / total       0.77      0.77      0.77      1000

The Classification report shows the accuracies at which the different classes are correctly and incorrectly idenitified. Precision is the rate at which the predictions match their actual classifications (true positive rate and true negative rate). Recall is the rate at which actual classifications are identified/predicted. F1-scores are the the harmonic means of the precision and recall values. Support is the total in each class (0 and 1).
As you can see even though the model has pretty high average precisions and recall, the recall score for 1 (people who actually bought car insurance) is quite low at 61%. This means the Naive Bayes model is too selective and would miss a lot of customers who would buy car insurance.

Confusion Matrix

In [9]:
from sklearn.metrics import confusion_matrix
ConfMatrix = confusion_matrix(y_test, predictions)

row_label = ['Actual 0','Actual 1']
col_label = ['Predicted 0', 'Predicted 1']

pd.DataFrame(ConfMatrix, row_label, col_label)
Out[9]:
Predicted 0 Predicted 1
Actual 0 528 70
Actual 1 156 246
The confusion matrix breaks down the numbers in "Support" of the Classification Report further by prediction and actual classes.

Support Vector Machine (SVM)

Support Vector Machine is a supervised machine learning algorithm that optimises the hyperplane/line between the classes we are trying to separate. The goal is such that the hyperplane furthest from the data points (of either class).

Training SVM Model with the Train dataset.

In [10]:
#Import Library
from sklearn import svm

#Create Model Object
model = svm.SVC(random_state = 101)

#Train the model with df
model.fit(X_train,y_train)

#Predictions:
predictions = model.predict(X_test)

Model Evaluation

Cross Validation using K-Folds With cross_val_score

In [11]:
accuracies = cross_val_score(estimator = model, X=X_train, y = y_train, cv =10, n_jobs = -1)

accuracies.mean()
Out[11]:
0.5999993370296707
The basic SVM model looks pretty bad. With an average accuracy score at 60%, it's worse than our Naive Bayes model. However, we can improve on this.

Improving Model with scaling

If we scale the data by their distributions, we can improve the SVM result:
In [12]:
from sklearn import preprocessing
X_train_sd = preprocessing.scale(X_train)

model.fit(X_train_sd,y_train)
Out[12]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=101, shrinking=True,
  tol=0.001, verbose=False)
In [13]:
accuracies = cross_val_score(estimator = model, X=X_train_sd, y = y_train, cv =10, n_jobs = -1)

accuracies.mean()
Out[13]:
0.8196837038930062
By scaling our data, the average accuracy has improved to 82%!

Improving Model Using GridSearchCV

There are parameters of an SVM in which we can test out to improve our model. GridSearchCV allows us to do that. We first need to specify the different parameters we want to test - linear and rbf SVM models with different C and gamma parameters. C is the error penalty term. It controls the trade off between a smooth boundry and classifying the points correctly. gamma controls the inverse of the radius of influence of the samples selected by the model as selected vectors.
In [14]:
from sklearn.model_selection import GridSearchCV

#list of parameters we want to optimise their values
parameters = [{'C': [1, 5 , 10 ,100], 'kernel': ['linear']},
              {'C': [1, 5 , 10 ,100], 'kernel':['rbf'], 'gamma': [0.5, 0.1, 0.01, 0.001, 0.0001]}
                         ]


#Create GridSearchCV object
grid_search = GridSearchCV(estimator = model, param_grid = parameters, scoring = 'accuracy', cv = 5, n_jobs =-1)


#Fit it to the training set
grid_search = grid_search.fit(X_train_sd, y_train)
In [15]:
#Best accuracy (of the mean through k-fold cross validation of the best model)
best_accuracy = grid_search.best_score_
In [16]:
print('Best accurarcy: ', best_accuracy)
Best accurarcy:  0.83
So we have improved our accuracy from 82% to 83%.
And also, let's look at the best parameters for our Support Vector Machine.
In [17]:
#Best Parameters
best_parameters = grid_search.best_params_
print('Best Parameters: ' ,best_parameters)
Best Parameters:  {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}

Model with the Best Parameters

In [18]:
#Create Model Object
model = svm.SVC(C = 100, gamma = 0.001, kernel = 'rbf',random_state = 101)

#Train the model with df and fit it
model.fit(X_train_sd,y_train)

#Predictions:
X_test_sd = preprocessing.scale(X_test)
predictions = model.predict(X_test_sd)

Classfication Report

In [19]:
print (classification_report(y_test, predictions))
             precision    recall  f1-score   support

          0       0.88      0.89      0.88       598
          1       0.83      0.82      0.83       402

avg / total       0.86      0.86      0.86      1000

So the Support Vector Machine model did really well against our test set at 86% accuracy!

Random Forest

A random forest model uses random samples of features in the data set to create multiple decision trees. The trees "vote" on the classification of our target variable (y).

Training Random Forest Model with the Train dataset.

In [20]:
#Import Library
from sklearn.ensemble import RandomForestClassifier

#Create Model Object
model = RandomForestClassifier(n_estimators = 500 , random_state = 101)

#Train the model with df
model.fit(X_train,y_train)

#Predictions:
predictions = model.predict(X_test)

Model Evaluation

Cross Validation using K-Folds With cross_val_score

In [21]:
accuracies = cross_val_score(estimator = model, X=X_train, y = y_train, cv =10, n_jobs = -1)

accuracies.mean()
Out[21]:
0.8456760334374087
The default model is looking pretty good but I'm sure we can improve on this using GridSearchCV.
We are optimising the following parameters (they tend to make the most difference):
n_estimators = number of trees in the foreset
max_features = max number of features considered for splitting a node
max_depth = max number of levels in each decision tree

Improving Model Using GridSearchCV

In [22]:
from sklearn.model_selection import GridSearchCV

#list of parameters we want to optimise their values
parameters = [{'n_estimators': [100, 200], 'max_depth': [8, 10, 12], 'max_features':[ 12, 15, 17]}]


#Create GridSearchCV object
grid_search = GridSearchCV(estimator = model, param_grid = parameters, scoring = 'accuracy', cv = 10, n_jobs = -1)


#Fit it to the training set
grid_search = grid_search.fit(X_train, y_train)
In [23]:
#Best accuracy (of the mean through k-fold cross validation of the best model)
best_accuracy = grid_search.best_score_
print('Best Accuracy: ', best_accuracy)
Best Accuracy:  0.8483333333333334
It does not look like our optimised random forest model is doing much better than the one using default parameters.
In [24]:
#Best Parameters
best_parameters = grid_search.best_params_
print('Best Parameters: ', best_parameters)
Best Parameters:  {'max_depth': 10, 'max_features': 15, 'n_estimators': 100}

Model with the Best Parameters

In [25]:
#Create Model Object
model = RandomForestClassifier(max_depth = 10, max_features = 15, n_estimators = 100, random_state = 101)

#Train the model with df
model.fit(X_train,y_train)

#Predictions:
predictions = model.predict(X_test)

Classfication Report

In [26]:
print (classification_report(y_test, predictions))
             precision    recall  f1-score   support

          0       0.88      0.87      0.88       598
          1       0.81      0.82      0.81       402

avg / total       0.85      0.85      0.85      1000

Judging performance, I say the optimised random forest model was the best model because of the highest average accuracy rate of 85%. It has done well on both the training and test set. However, due to chance, the optimised SVM model did better on the testing set at 86%. These are sustantial improvements from the average of 75% accuracy and low positive recall of 61% of the Naive Bayes model.

Gradient Boosting with Light GBM

Light GBM is a gradient boosting method that uses tree based algorithms. It is one of the most popular ones out there. It allows us to achieve similar results as deep learning without high computational power.

Training and setting Parameters.

In [27]:
import lightgbm as lgbm
In [28]:
#Create the training dataset by combining X_train and y_train
d_train = lgbm.Dataset(X_train, label = y_train)


#Setting out the Parameter dictionary
parameters = {}
parameters['learning_rate'] = 0.06 #learning rate - controlling how fast estimates change
parameters['boosting_type'] = 'gbdt' # for  traditional Gradient Boosting Decision Tree
parameters['objective'] = 'binary' #for binary classification
parameters['metric'] = 'binary_error' #for binary classification
parameters['feature_faction'] = 0.8 # LightGBM to select 80% of features for each tree
parameters['max_depth'] = 10
parameters['min_data'] = 15

model_lgbm = lgbm.train(parameters, d_train, 200) #Training for 200 iterations

Cross Validation

Light GBM has its own cross validation method, .cv:
In [29]:
cv_result = lgbm.cv(parameters, d_train, num_boost_round = 200, nfold = 10, early_stopping_rounds = 40)
The results of the Cross Valiation:
In [30]:
# Display results
print('Current parameters:\n', parameters)
print('\nBest num_boost_round:', len(cv_result['binary_error-mean']))
print('Best CV score (mean accuracy):', 1 - cv_result['binary_error-mean'][-1])
Current parameters:
 {'learning_rate': 0.06, 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'binary_error', 'feature_faction': 0.8, 'max_depth': 10, 'min_data': 15, 'verbose': 1}

Best num_boost_round: 118
Best CV score (mean accuracy): 0.8473248480538673

Classification Report

In [31]:
#Light GBM Predictions
predictions = model_lgbm.predict(X_test)


#Convert to binary values from percentage predictions
for i in range (0,1000):
    if predictions[i]>=0.5:
        predictions[i]=1
    else:
        predictions[i]=0
In [32]:
print (classification_report(y_test, predictions))
             precision    recall  f1-score   support

          0       0.87      0.87      0.87       598
          1       0.81      0.81      0.81       402

avg / total       0.85      0.85      0.85      1000

The Light GBM algorithm achieved similar result as the Random Forest Model. However, it took seconds instead of minutes to run.

No comments:

Post a Comment

Portfolio Optimisation with Python

 Recently I have been busy so I have been neglecting this blog for a very long time. Just want to put out some new content. So there is this...