Customer Churn Prediction Project
- pandas, numpy, matplotlib, seaborn, plotly, sklearn, xgboost
Develop a Churn Prediction model:
- Exploratory data analysis
- Feature engineering
- Investigating how the features affect Retention by using Logistic Regression - Building a classification model with XGBoost
Churn Prediction
Retention Rate is an indication of how good is your product market fit (PMF). If your PMF is not satisfactory, you should see your customers churning very soon. One of the powerful tools to improve Retention Rate (hence the PMF) is Churn Prediction. By using this technique, you can easily find out who is likely to churn in the given period. In this article, we will use a Telco dataset (https://www.kaggle.com/blastchar/telco-customer-churn) and go over the following steps to develop a Churn Prediction model:
- Exploratory data analysis
- Feature engineering
- Investigating how the features affect Retention by using Logistic Regression
- Building a classification model with XGBoost
Exploratory Data Analysis (EDA)
Data fall under two categories:
- Categorical features: gender, streaming tv, payment method &, etc.
- Numerical features: tenure, monthly charges, total charges
Feature Engineering
- Group the numerical columns by using clustering techniques
- Apply Label Encoder to categorical features which are binary
- Apply get_dummies() to categorical features which have multiple values
Categorical features
Gender |
Partner |
---|---|
![]() |
![]() |
Phone Service |
Multiple Lines |
![]() |
![]() |
Internet Service |
|
![]() |
This chart reveals customers who have Fiber optic as Internet Service are more likely to churn.
Online Security |
Online Backup |
---|---|
![]() |
![]() |
Device Protection |
Tech Support |
![]() |
![]() |
Customers don’t use Tech Support are more like to churn (~25% difference).
Streaming_TV |
Streaming_Movies |
---|---|
![]() |
![]() |
Contract |
Paperless |
![]() |
![]() |
Payment_methods |
|
![]() |
Automating the payment makes the customer more likely to retain in your platform (~30% difference).
Numerical features
Feature Engineering for Numerical columns
- Use Elbow Method to identify the appropriate number of clusters
- Apply K-means logic to the selected column and change the naming
- Observe the profile of clusters
Super apparent that the higher tenure means lower Churn Rate.
The appropriate number of clusters is 3 by Elbow Method
Logistic Regression
Predicting churn is a binary classification problem. Customers either churn or retain in a given period. Along with being a robust model, Logistic Regression provides interpretable outcomes too. As we did before, let’s sort out our steps to follow for building a Logistic Regression model:
- Prepare the data (inputs for the model)
- Fit the model and see the model summary
Binary Classification Model with XGBoost