- This is a continuation of Part 1 (http://dataecon1.blogspot.com/2018/09/cold-calling-for-car-insurance-part-1.html).
- Using machine learning techniques, I created models that could predict the success/failure of cold calling for car insurance with an accuracy of around 85%.
- I splitted 1/4 of the data as a test set, since there was no independent test set loaded to Kaggle. Train Test Split
- I tried out modelling the data with Naive Bayes, Support Vector Machine (SVM), Random Forest, and Gradient Boosting with Light GBM. Naive Bayes SVM Random Forest Light GBM
- I evaluated my models using K-fold cross valiations and the test data set.
- I used GridSearchCV to search for better parameters in my Support Vector Machine and Random Forest Models (parameter tuning).
- The tuned Random Forest model did the best on K-fold cross validation with 84.8% accuracy.
- The tuned Support Vector Machine model did the best on the test set at 86% precision and recall.
- The Light GBM model marginally came 2nd in accuracy with a K-fold cross validation accuracy of 84.7%. However, it was faster to run compare to the Random Forest model, taking seconds instead of minutes.
Car Insurance Cold Calls¶
Naive Bayes¶
The first machine learning model we are going to use to predict is a Naive Bayes algorithm. A Naive Bayes algorithm assumes the predictors are all independent. This is one of the simplest models for classification problems. It is often used as a base line for comparison with other models.Importing Libaries:¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Importing the Modified Dataset¶
We are using the modified dataset so the predictors are matching.
In [2]:
df = pd.read_csv('carInsurance_train_mod.csv', index_col=0)
df.head()
Out[2]:
In [3]:
df.info()
The data looks good with all rows and columns of our dataframe filled with data.
We are also dropping the columns of CallStart, CallEnd, and LastContactDay, since a Naive Bayes model cannot interpret them. Also dropping Id since it is not useful.
Train Test Split.¶
Since the Data source did not provide data for the Testing set, we have to create our own. I have decided to convert 1/4 of the training set to testing data.We are also dropping the columns of CallStart, CallEnd, and LastContactDay, since a Naive Bayes model cannot interpret them. Also dropping Id since it is not useful.
In [4]:
#Dropping Columns we are not using:
df_mod = df.drop(['CallStart','CallEnd','LastContactDay','Id'], axis = 1)
#Use train test split to have separate training and test data sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_mod.drop('CarInsurance', axis =1), df_mod['CarInsurance'], test_size=0.25, random_state = 101)
Training the Naive Bayes Model with the Train dataset.¶
In [5]:
#Import Library
from sklearn.naive_bayes import GaussianNB
#Create Model Object
model = GaussianNB()
#Train the model with df
model.fit(X_train,y_train)
#Predictions:
predictions = model.predict(X_test)
Model Evaluation¶
Cross Validation using K-Folds With cross_val_score¶
In [6]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model, X=X_train, y = y_train, cv =10, n_jobs = -1)
accuracies
Out[6]:
In [7]:
accuracies.mean()
Out[7]:
Using 10-folds cross validation, we can reasonably say that our model has an average accuracy score of 75%. Not too good but not too bad.
Classification Report¶
In [8]:
from sklearn.metrics import classification_report
print (classification_report(y_test, predictions))
The Classification report shows the accuracies at which the different classes are correctly and incorrectly idenitified. Precision is the rate at which the predictions match their actual classifications (true positive rate and true negative rate). Recall is the rate at which actual classifications are identified/predicted. F1-scores are the the harmonic means of the precision and recall values. Support is the total in each class (0 and 1).
As you can see even though the model has pretty high average precisions and recall, the recall score for 1 (people who actually bought car insurance) is quite low at 61%. This means the Naive Bayes model is too selective and would miss a lot of customers who would buy car insurance.
As you can see even though the model has pretty high average precisions and recall, the recall score for 1 (people who actually bought car insurance) is quite low at 61%. This means the Naive Bayes model is too selective and would miss a lot of customers who would buy car insurance.
Confusion Matrix¶
In [9]:
from sklearn.metrics import confusion_matrix
ConfMatrix = confusion_matrix(y_test, predictions)
row_label = ['Actual 0','Actual 1']
col_label = ['Predicted 0', 'Predicted 1']
pd.DataFrame(ConfMatrix, row_label, col_label)
Out[9]:
The confusion matrix breaks down the numbers in "Support" of the Classification Report further by prediction and actual classes.
Support Vector Machine (SVM)¶
Support Vector Machine is a supervised machine learning algorithm that optimises the hyperplane/line between the classes we are trying to separate. The goal is such that the hyperplane furthest from the data points (of either class).Training SVM Model with the Train dataset.¶
In [10]:
#Import Library
from sklearn import svm
#Create Model Object
model = svm.SVC(random_state = 101)
#Train the model with df
model.fit(X_train,y_train)
#Predictions:
predictions = model.predict(X_test)
In [11]:
accuracies = cross_val_score(estimator = model, X=X_train, y = y_train, cv =10, n_jobs = -1)
accuracies.mean()
Out[11]:
The basic SVM model looks pretty bad. With an average accuracy score at 60%, it's worse than our Naive Bayes model. However, we can improve on this.
Improving Model with scaling¶
If we scale the data by their distributions, we can improve the SVM result:
In [12]:
from sklearn import preprocessing
X_train_sd = preprocessing.scale(X_train)
model.fit(X_train_sd,y_train)
Out[12]:
In [13]:
accuracies = cross_val_score(estimator = model, X=X_train_sd, y = y_train, cv =10, n_jobs = -1)
accuracies.mean()
Out[13]:
By scaling our data, the average accuracy has improved to 82%!
Improving Model Using GridSearchCV¶
There are parameters of an SVM in which we can test out to improve our model. GridSearchCV allows us to do that. We first need to specify the different parameters we want to test - linear and rbf SVM models with different C and gamma parameters. C is the error penalty term. It controls the trade off between a smooth boundry and classifying the points correctly. gamma controls the inverse of the radius of influence of the samples selected by the model as selected vectors.
In [14]:
from sklearn.model_selection import GridSearchCV
#list of parameters we want to optimise their values
parameters = [{'C': [1, 5 , 10 ,100], 'kernel': ['linear']},
{'C': [1, 5 , 10 ,100], 'kernel':['rbf'], 'gamma': [0.5, 0.1, 0.01, 0.001, 0.0001]}
]
#Create GridSearchCV object
grid_search = GridSearchCV(estimator = model, param_grid = parameters, scoring = 'accuracy', cv = 5, n_jobs =-1)
#Fit it to the training set
grid_search = grid_search.fit(X_train_sd, y_train)
In [15]:
#Best accuracy (of the mean through k-fold cross validation of the best model)
best_accuracy = grid_search.best_score_
In [16]:
print('Best accurarcy: ', best_accuracy)
So we have improved our accuracy from 82% to 83%.
And also, let's look at the best parameters for our Support Vector Machine.
And also, let's look at the best parameters for our Support Vector Machine.
In [17]:
#Best Parameters
best_parameters = grid_search.best_params_
print('Best Parameters: ' ,best_parameters)
Model with the Best Parameters¶
In [18]:
#Create Model Object
model = svm.SVC(C = 100, gamma = 0.001, kernel = 'rbf',random_state = 101)
#Train the model with df and fit it
model.fit(X_train_sd,y_train)
#Predictions:
X_test_sd = preprocessing.scale(X_test)
predictions = model.predict(X_test_sd)
Classfication Report¶
In [19]:
print (classification_report(y_test, predictions))
So the Support Vector Machine model did really well against our test set at 86% accuracy!
In [20]:
#Import Library
from sklearn.ensemble import RandomForestClassifier
#Create Model Object
model = RandomForestClassifier(n_estimators = 500 , random_state = 101)
#Train the model with df
model.fit(X_train,y_train)
#Predictions:
predictions = model.predict(X_test)
In [21]:
accuracies = cross_val_score(estimator = model, X=X_train, y = y_train, cv =10, n_jobs = -1)
accuracies.mean()
Out[21]:
The default model is looking pretty good but I'm sure we can improve on this using GridSearchCV.
We are optimising the following parameters (they tend to make the most difference):
n_estimators = number of trees in the foreset
max_features = max number of features considered for splitting a node
max_depth = max number of levels in each decision tree
We are optimising the following parameters (they tend to make the most difference):
n_estimators = number of trees in the foreset
max_features = max number of features considered for splitting a node
max_depth = max number of levels in each decision tree
Improving Model Using GridSearchCV¶
In [22]:
from sklearn.model_selection import GridSearchCV
#list of parameters we want to optimise their values
parameters = [{'n_estimators': [100, 200], 'max_depth': [8, 10, 12], 'max_features':[ 12, 15, 17]}]
#Create GridSearchCV object
grid_search = GridSearchCV(estimator = model, param_grid = parameters, scoring = 'accuracy', cv = 10, n_jobs = -1)
#Fit it to the training set
grid_search = grid_search.fit(X_train, y_train)
In [23]:
#Best accuracy (of the mean through k-fold cross validation of the best model)
best_accuracy = grid_search.best_score_
print('Best Accuracy: ', best_accuracy)
It does not look like our optimised random forest model is doing much better than the one using default parameters.
In [24]:
#Best Parameters
best_parameters = grid_search.best_params_
print('Best Parameters: ', best_parameters)
Model with the Best Parameters¶
In [25]:
#Create Model Object
model = RandomForestClassifier(max_depth = 10, max_features = 15, n_estimators = 100, random_state = 101)
#Train the model with df
model.fit(X_train,y_train)
#Predictions:
predictions = model.predict(X_test)
Classfication Report¶
In [26]:
print (classification_report(y_test, predictions))
Judging performance, I say the optimised random forest model was the best model because of the highest average accuracy rate of 85%. It has done well on both the training and test set. However, due to chance, the optimised SVM model did better on the testing set at 86%. These are sustantial improvements from the average of 75% accuracy and low positive recall of 61% of the Naive Bayes model.
In [27]:
import lightgbm as lgbm
In [28]:
#Create the training dataset by combining X_train and y_train
d_train = lgbm.Dataset(X_train, label = y_train)
#Setting out the Parameter dictionary
parameters = {}
parameters['learning_rate'] = 0.06 #learning rate - controlling how fast estimates change
parameters['boosting_type'] = 'gbdt' # for traditional Gradient Boosting Decision Tree
parameters['objective'] = 'binary' #for binary classification
parameters['metric'] = 'binary_error' #for binary classification
parameters['feature_faction'] = 0.8 # LightGBM to select 80% of features for each tree
parameters['max_depth'] = 10
parameters['min_data'] = 15
model_lgbm = lgbm.train(parameters, d_train, 200) #Training for 200 iterations
Cross Validation¶
Light GBM has its own cross validation method, .cv:
In [29]:
cv_result = lgbm.cv(parameters, d_train, num_boost_round = 200, nfold = 10, early_stopping_rounds = 40)
The results of the Cross Valiation:
In [30]:
# Display results
print('Current parameters:\n', parameters)
print('\nBest num_boost_round:', len(cv_result['binary_error-mean']))
print('Best CV score (mean accuracy):', 1 - cv_result['binary_error-mean'][-1])
Classification Report¶
In [31]:
#Light GBM Predictions
predictions = model_lgbm.predict(X_test)
#Convert to binary values from percentage predictions
for i in range (0,1000):
if predictions[i]>=0.5:
predictions[i]=1
else:
predictions[i]=0
In [32]:
print (classification_report(y_test, predictions))
The Light GBM algorithm achieved similar result as the Random Forest Model. However, it took seconds instead of minutes to run.
No comments:
Post a Comment