Project: Customer Churn Prediction¶


Description:¶

In This project we will try to create a model based on the tree methods to predict whether the customer will churn or not. We will start by exploring and analyzing data, using visualizing libraries such as seaborn and matplotlib, then we will focus on the model creation.

PART 1: Importing and Reading Data¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv('customer-churn.csv')
In [3]:
df.head()
Out[3]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

PART 2: Checking and Exploring Data¶

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7032 non-null   object 
 1   gender            7032 non-null   object 
 2   SeniorCitizen     7032 non-null   int64  
 3   Partner           7032 non-null   object 
 4   Dependents        7032 non-null   object 
 5   tenure            7032 non-null   int64  
 6   PhoneService      7032 non-null   object 
 7   MultipleLines     7032 non-null   object 
 8   InternetService   7032 non-null   object 
 9   OnlineSecurity    7032 non-null   object 
 10  OnlineBackup      7032 non-null   object 
 11  DeviceProtection  7032 non-null   object 
 12  TechSupport       7032 non-null   object 
 13  StreamingTV       7032 non-null   object 
 14  StreamingMovies   7032 non-null   object 
 15  Contract          7032 non-null   object 
 16  PaperlessBilling  7032 non-null   object 
 17  PaymentMethod     7032 non-null   object 
 18  MonthlyCharges    7032 non-null   float64
 19  TotalCharges      7032 non-null   float64
 20  Churn             7032 non-null   object 
dtypes: float64(2), int64(2), object(17)
memory usage: 1.1+ MB

We can notice that there is no null data

Let's get a statistical summary of the numeric features

In [5]:
df.describe()
Out[5]:
SeniorCitizen tenure MonthlyCharges TotalCharges
count 7032.000000 7032.000000 7032.000000 7032.000000
mean 0.162400 32.421786 64.798208 2283.300441
std 0.368844 24.545260 30.085974 2266.771362
min 0.000000 1.000000 18.250000 18.800000
25% 0.000000 9.000000 35.587500 401.450000
50% 0.000000 29.000000 70.350000 1397.475000
75% 0.000000 55.000000 89.862500 3794.737500
max 1.000000 72.000000 118.750000 8684.800000

Part 3: Exploratory Data Analysis¶

Let's verify if there is NaN values per column

In [7]:
df.isnull().sum()
Out[7]:
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

There is no missing data in all the columns

Let's verify the balance between the churn classes

In [12]:
plt.figure(figsize=(10, 4), dpi=150)
sns.countplot(data=df, x="Churn")
Out[12]:
<AxesSubplot:xlabel='Churn', ylabel='count'>

We can notice that there is no balance between the two classes, the no_churn is about the double of yes_churn.

Let's explore the distribution of TotalCharges between Churn categories.

In [14]:
sns.violinplot(x="Churn", y="TotalCharges", data=df)
Out[14]:
<AxesSubplot:xlabel='Churn', ylabel='TotalCharges'>

We can notice that the most of the customer that churn have a TotalCharges between 0 and 2000.

Let's explore the TotalCahrges per ContractType

In [20]:
plt.figure(figsize=(10,4), dpi=100)
sns.boxplot(x="Contract", y="TotalCharges", data=df, hue="Churn")
plt.legend(loc=(1.1, 0.5));

We can clearly notice that the customers who churn have a mean of total charges per One year and Two year Contract type greater than the customers who do not churn, except for the Month-to-month contract type where they are a little bit equal.

Let's explore the correlation between our features and the class label (Churn).
First we need to convert our categorical features to dummy variables as we have many categorical features, we will not convert the CustomerID feature as it does not have a meaning in term of the correlation and also it is unique for each customer

In [21]:
corr_df = pd.get_dummies(df[['gender', 'SeniorCitizen', 'Partner', 'Dependents','PhoneService', 'MultipleLines', 
 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'InternetService',
   'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', "Churn"]]).corr()
In [32]:
corr_df.iloc[1:-1]["Churn_Yes"]
Out[32]:
gender_Female                              0.008545
gender_Male                               -0.008545
Partner_No                                 0.149982
Partner_Yes                               -0.149982
Dependents_No                              0.163128
Dependents_Yes                            -0.163128
PhoneService_No                           -0.011691
PhoneService_Yes                           0.011691
MultipleLines_No                          -0.032654
MultipleLines_No phone service            -0.011691
MultipleLines_Yes                          0.040033
OnlineSecurity_No                          0.342235
OnlineSecurity_No internet service        -0.227578
OnlineSecurity_Yes                        -0.171270
OnlineBackup_No                            0.267595
OnlineBackup_No internet service          -0.227578
OnlineBackup_Yes                          -0.082307
DeviceProtection_No                        0.252056
DeviceProtection_No internet service      -0.227578
DeviceProtection_Yes                      -0.066193
TechSupport_No                             0.336877
TechSupport_No internet service           -0.227578
TechSupport_Yes                           -0.164716
InternetService_DSL                       -0.124141
InternetService_Fiber optic                0.307463
InternetService_No                        -0.227578
StreamingTV_No                             0.128435
StreamingTV_No internet service           -0.227578
StreamingTV_Yes                            0.063254
StreamingMovies_No                         0.130920
StreamingMovies_No internet service       -0.227578
StreamingMovies_Yes                        0.060860
Contract_Month-to-month                    0.404565
Contract_One year                         -0.178225
Contract_Two year                         -0.301552
PaperlessBilling_No                       -0.191454
PaperlessBilling_Yes                       0.191454
PaymentMethod_Bank transfer (automatic)   -0.118136
PaymentMethod_Credit card (automatic)     -0.134687
PaymentMethod_Electronic check             0.301455
PaymentMethod_Mailed check                -0.090773
Churn_No                                  -1.000000
Name: Churn_Yes, dtype: float64
In [46]:
plt.figure(figsize=(10, 4), dpi=150)
sns.barplot(x=corr_df.sort_values("Churn_Yes").iloc[1:-1].index, y=corr_df.sort_values("Churn_Yes").iloc[1:-1]["Churn_Yes"].values)
plt.xticks(rotation=90);

We can notice that the features with the greater correlation are "Contract-Month-to-mont", "OnlineSecurity_No", and "TechSupport_No", also the feature with the least negative correlation is "Contract_Two year" which means it is more likely that the customer will not churn if he is signin a Contract of two years.
Also there is no feature that has a significant or srong correlation to the label class.

Part 4: Churn Analysis¶

In this section we will focus on segmenting customers based on tenure, creating "cohort", allowing us to examine differences between cohort segments.

In [47]:
df["Contract"].unique()
Out[47]:
array(['Month-to-month', 'One year', 'Two year'], dtype=object)

Let's display the distribution of the tenure feature (tenure feature represent the number of months the customer was or has been on a customer)

In [61]:
plt.figure(figsize=(10, 4), dpi=100)
sns.histplot(x="tenure", data=df, bins=70);

We can notice that the tenure the majority of customers has been on is 2 months or more than 70 months.

Let's get more into the details, and display the tenure per Contract type and Churn categorie

In [67]:
sns.displot(x="tenure", bins=70, row="Churn", col="Contract", data=df);

It is clearly that the customers that sign a contract type of "two-years" are more likely to not churn

Let's display a scatter plot of the TotalCharges versus MonthlyCharges with hue of Churn

In [74]:
plt.figure(figsize=(10, 4), dpi=200)
sns.scatterplot(x="MonthlyCharges", y="TotalCharges", hue="Churn", data=df, alpha=.4);

We can notice that the customer is more likely to churn if he got charged a lot monthly, which means he may look for cheaper services.

Creating cohort based on tenure¶

Let's first treat each tenure (month) as its own cohort

In [89]:
no_churn = df.groupby(["Churn", "tenure"]).count().loc["No"]
yes_churn = df.groupby(["Churn", "tenure"]).count().loc["Yes"]
yes_churn_rate = 100 * yes_churn / (no_churn + yes_churn)
yes_churn_rate["customerID"]
Out[89]:
tenure
1     61.990212
2     51.680672
3     47.000000
4     47.159091
5     48.120301
        ...    
68     9.000000
69     8.421053
70     9.243697
71     3.529412
72     1.657459
Name: customerID, Length: 72, dtype: float64
In [104]:
plt.figure(figsize=(10, 4), dpi=100)
plt.plot(sorted(df["tenure"].value_counts().index), yes_churn_rate["customerID"].values)
Out[104]:
[<matplotlib.lines.Line2D at 0x1b6bcdebeb0>]

It is clearly that the more the tenure increase the more the number of customers decrease which is normal.

Cohort Groups:¶

Based on the tenure feature we will create 4 separated cohort :

  • '0-12 Months'
  • '12-24 Months'
  • '24-48 Months'
  • 'Over 48 Months'
In [105]:
def tenure_cohort(tenure):
    if tenure <= 12: return '0-12 Months'
    if tenure <= 24: return '12-24 Months'
    if tenure <= 48: return '24-48 Months'
    return 'Over 48 Months'
In [106]:
df["Tenure Cohort"] = df["tenure"].apply(tenure_cohort)
In [107]:
df["Tenure Cohort"]
Out[107]:
0          0-12 Months
1         24-48 Months
2          0-12 Months
3         24-48 Months
4          0-12 Months
             ...      
7027      12-24 Months
7028    Over 48 Months
7029       0-12 Months
7030       0-12 Months
7031    Over 48 Months
Name: Tenure Cohort, Length: 7032, dtype: object

Let's explore now the MonthlyCharges versus TotalCharges, with hue of TenureCohort

In [110]:
plt.figure(figsize=(10, 4), dpi=100)
sns.scatterplot(x="MonthlyCharges", y="TotalCharges", hue="Tenure Cohort", alpha=.5, data=df)
Out[110]:
<AxesSubplot:xlabel='MonthlyCharges', ylabel='TotalCharges'>

We can notice that the more the tenure increase the more the MonthlyChargesincrease and the more TotalCharges increase.

Let's visualize the Churn count per Tenure Cohort

In [114]:
plt.figure(figsize=(10, 4), dpi=100)
sns.countplot(x="Tenure Cohort", hue="Churn", data=df)
Out[114]:
<AxesSubplot:xlabel='Tenure Cohort', ylabel='count'>

It is clearly that the customers are more likely to not churn if they signed a long term contract. With a contract less than year the chances of churning are the same of not churning.

Let's create a grid of Count plots showing counts per Tenure Cohort, separated out by Contract Type and colored by the Churn hue

In [118]:
sns.catplot(x="Tenure Cohort",col="Contract", kind="count", hue="Churn", data=df)
Out[118]:
<seaborn.axisgrid.FacetGrid at 0x1b6be791ac0>

We can confirm here too that the Customers are more likely to not churn when they sign a long contract.

Part 5: Predictive Modeling¶

In this section we will explore 3 different tree based methods:

  • Decision Trees
  • Random Forest
  • AdaBoost

Let's first separate our data

In [129]:
X = df.drop(["customerID", "Churn"], axis=1)
#We do not customerID as it is just a primary key that defines each customer
y = df["Churn"]
#Churn is the label

Now we need to convert our categorical data into numeric by creating dummy variables

In [122]:
X_obj = pd.get_dummies(X.select_dtypes(include="object"))
X_non_obj = X.select_dtypes(exclude="object")
In [136]:
X = pd.concat([X_obj, X_non_obj], axis=1)

Let's split our data into training and test data

In [133]:
from sklearn.model_selection import train_test_split, GridSearchCV
In [138]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)

Decision Tree Performance¶

In [132]:
from sklearn.tree import DecisionTreeClassifier
In [147]:
dt = DecisionTreeClassifier(random_state=101, max_features="log2", max_depth=7) #Actually I got those values using GridSearchCV
dt.fit(X_train, y_train)
Out[147]:
DecisionTreeClassifier(max_depth=7, max_features='log2', random_state=101)
In [148]:
preds = dt.predict(X_test)
In [156]:
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
In [150]:
print(classification_report(y_test, preds))
              precision    recall  f1-score   support

          No       0.88      0.86      0.87       557
         Yes       0.51      0.56      0.53       147

    accuracy                           0.80       704
   macro avg       0.70      0.71      0.70       704
weighted avg       0.80      0.80      0.80       704

In [157]:
ConfusionMatrixDisplay.from_predictions(y_test, preds)
Out[157]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1b6ce53f460>

We got an overall accuracy of 0.8, We got poor result for the yes class because of the unbalanced data.

Let's visualize the features importance for our model

In [176]:
fi = pd.DataFrame(index=X_train.columns, columns=["Feature Importance"], data=dt.feature_importances_).sort_values("Feature Importance")
fi
Out[176]:
Feature Importance
TechSupport_No internet service 0.000000
OnlineBackup_No internet service 0.000000
OnlineSecurity_Yes 0.000000
StreamingTV_No internet service 0.000000
InternetService_No 0.000000
StreamingTV_Yes 0.000000
StreamingMovies_No 0.000000
DeviceProtection_No 0.000000
MultipleLines_No phone service 0.000000
StreamingMovies_No internet service 0.000000
StreamingTV_No 0.000205
PhoneService_No 0.000338
PaperlessBilling_Yes 0.000346
OnlineBackup_No 0.000376
SeniorCitizen 0.000561
Partner_Yes 0.000628
TechSupport_Yes 0.000691
Dependents_No 0.000809
DeviceProtection_Yes 0.000859
Tenure Cohort_24-48 Months 0.000915
PaymentMethod_Mailed check 0.001600
Partner_No 0.001707
gender_Female 0.001930
gender_Male 0.002238
StreamingMovies_Yes 0.002368
PaymentMethod_Bank transfer (automatic) 0.002534
MultipleLines_No 0.002775
PhoneService_Yes 0.002813
InternetService_Fiber optic 0.003303
OnlineBackup_Yes 0.005074
Dependents_Yes 0.006631
Tenure Cohort_12-24 Months 0.007046
MultipleLines_Yes 0.007685
Contract_One year 0.010213
PaymentMethod_Credit card (automatic) 0.011581
PaymentMethod_Electronic check 0.011629
DeviceProtection_No internet service 0.012838
PaperlessBilling_No 0.017618
Tenure Cohort_Over 48 Months 0.019423
Contract_Month-to-month 0.020610
Contract_Two year 0.023526
tenure 0.028138
TechSupport_No 0.031651
InternetService_DSL 0.034066
TotalCharges 0.076058
MonthlyCharges 0.076934
OnlineSecurity_No 0.121761
OnlineSecurity_No internet service 0.130077
Tenure Cohort_0-12 Months 0.320447
In [172]:
plt.figure(figsize=(10, 4), dpi=150)
sns.barplot(x=fi.index, y="Feature Importance", data=fi)
plt.xticks(rotation = 90);

So we can clearly see that the (tenure cohort_0-12 Months) feature has the most importance in term of predicting the churning of the customer.

Random Forest Performance:¶

In [177]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
In [178]:
param_grid = {"n_estimators": [64, 100, 110, 124], 
              "max_depth": np.arange(1, 20), 
              "max_features": ["log2", "sqrt", "auto"], 
              "max_samples": [.3],
             "random_state": [101]}
In [179]:
grid = GridSearchCV(estimator=rf, param_grid=param_grid)
grid.fit(X_train, y_train)
Out[179]:
GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19]),
                         'max_features': ['log2', 'sqrt', 'auto'],
                         'max_samples': [0.3],
                         'n_estimators': [64, 100, 110, 124]})
In [181]:
grid_rf = grid.best_estimator_
preds = grid_rf.predict(X_test)
In [182]:
print(classification_report(y_test, preds))
              precision    recall  f1-score   support

          No       0.88      0.92      0.90       557
         Yes       0.62      0.51      0.56       147

    accuracy                           0.83       704
   macro avg       0.75      0.71      0.73       704
weighted avg       0.82      0.83      0.83       704

In [184]:
ConfusionMatrixDisplay.from_predictions(y_test, preds)
Out[184]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1b6ce6d0c40>

We can notice that we could increase the overall accuracy of our model using the random forest and tuning hyperparameters using GridSearchCV from 0.80 to 0.83, but the most important is we could increase the precision of the yes class from 0.51 to 0.62.

AdaBoost Performance:¶

In [185]:
from sklearn.ensemble import AdaBoostClassifier
In [186]:
ab = AdaBoostClassifier()
In [189]:
param_grid = {'random_state': [101], "n_estimators":[64, 80, 100,110, 124]}
grid = GridSearchCV(estimator=ab, param_grid=param_grid)
grid.fit(X_train, y_train)
Out[189]:
GridSearchCV(estimator=AdaBoostClassifier(),
             param_grid={'n_estimators': [64, 80, 100, 110, 124],
                         'random_state': [101]})
In [190]:
ab = grid.best_estimator_
preds = ab.predict(X_test)
In [191]:
print(classification_report(y_test, preds))
              precision    recall  f1-score   support

          No       0.88      0.92      0.90       557
         Yes       0.62      0.51      0.56       147

    accuracy                           0.83       704
   macro avg       0.75      0.71      0.73       704
weighted avg       0.82      0.83      0.83       704

In [192]:
ConfusionMatrixDisplay.from_predictions(y_test, preds)
Out[192]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1b6ce646fd0>

We can notice that using Adaboost did not increase our accuracy, It still the same as with the Random Forest