Skip to content

This GitHub repository contains the complete codebase for a telecom churn prediction binary classification task. It uses a specific dataset to demonstrate thorough exploratory data analysis (EDA), address class imbalances, select machine learning algorithms, deploy models, and develop API endpoints using GCP Vertex AI for production ML systems.

Notifications You must be signed in to change notification settings

VLTSankalpa/TelcoChurnPrediction-VertexAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Telco Customer Churn Prediction

The goal of this project is to predict customer churn for a telecommunications company. The dataset contains information about customers, including their demographics, services they subscribe to, account information, and whether they churned or not. The project involves exploratory data analysis (EDA), data visualization, feature engineering, and data preprocessing to prepare the data for modeling.

Expected outcome of these process is to have a clean, well-understood dataset ready for feature engineering and model development. All the steps will be documented and explained in a Jupyter notebook. The project will involve identifying and handling missing values, encoding categorical variables, scaling numerical features, and splitting the data into training, validation, and test sets. The data will be saved to an .npz file, which can then be loaded for training a machine learning model.

After the data is cleaned and prepared, the project will involve training, tuning and deploying a machine learning model to predict customer churn using Google Cloud Vertex AI. The model will be evaluated using metrics such as accuracy, precision, recall, and F1 score. The project will also involve identifying important features that contribute to customer churn and providing recommendations to reduce churn rate.

Exploratory Data Analysis (EDA)

  • List of Columns
  • Dataset Shape
  • Data Types
  • List all unique values in each column
  • Convert data types of columns
  • Handling missing values
  • Summary Statistics of numeric columns

Data Visualization

  • Kernel Density Estimate (KDE) plots
  • Q-Q plots
  • Histograms
  • Boxplots
  • Scatter plots
  • Heatmaps
  • Count plots

Feature Engineering

  • AverageMonthlyCharges: It's common for customers to have variations in their charges throughout their tenure. This feature represents the average spend per month.
  • TenureGroups: Grouping tenure into categorical bins could reveal patterns related to customer loyalty and churn rate.

Identify outliers

  • IQR Method
  • Z-score Method

Encode Categorical Variables

  • Encode binary variables (gender, Partner, Dependents, PhoneService, PaperlessBilling, Churn) with 0 and 1.
  • Use one-hot encoding for nominal variables with more than two categories (MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaymentMethod, TenureGroups) to prepare them for modeling.
  • Scale Numerical Features: Standardize or normalize AverageMonthlyCharges,tenure, MonthlyCharges, and TotalCharges.

Training Data Preparing

  • Splits the data into feature (X) and label (y) arrays.
  • Uses train_test_split twice to create a train set (60% of the data), a validation set (20%), and a test set (20%).
  • Saves the training, validation, and test sets to an .npz file, which can then be loaded for training.

Machine Learning Model Development

The goal is to select and prototype suitable machine learning algorithms for predicting customer churn for a subscription-based telco service. This involves evaluating various models to identify the most effective approach for this specific churn prediction task.

Initial Model Prototyping

Several models were prototyped to assess their suitability and performance for the churn prediction task. These models can be built using standard libraries with minimal effort. If the dataset and preprocessing required vary significantly from one model to another, resulting in considerable training effort, we must stick to theoretical concepts. This approach involves selecting a few ML algorithms well-suited for the task and limiting the number of models tried. But in this case following models were prototyped:

  • Logistic Regression Model Prototyping
  • Random Forest Model Prototyping
  • XGBoost Model Prototyping
  • DNN Model Prototyping
  • CNN for Tabular Data Prototyping

Evaluation Metrics

For each prototyped model, several key metrics were considered to evaluate performance, including accuracy, precision, recall, and the confusion matrix. These metrics provide a comprehensive view of each model's strengths and weaknesses in predicting customer churn. Based on those metrics, the best models for Vertex AI Vizier hyperparameter tuning will be selected.

Vertex AI Training, Tuning, and Deployment

  • Training XGBoost Model on Vertex AI: Train the XGBoost Model on Vertex AI as a custom training job.
  • Training DNN Model on Vertex AI: Train the DNN Model on Vertex AI as a custom training job.
  • Tuning XGBoost Model on Vertex AI: Tune the XGBoost Model on Vertex AI using Vizier hyperparameter tuning.
  • Tuning DNN Model on Vertex AI: Tune the DNN Model on Vertex AI as part of a custom training job.
  • Deployment: Deploy the model as Vertex AI model endpoints for predictions.

About

This GitHub repository contains the complete codebase for a telecom churn prediction binary classification task. It uses a specific dataset to demonstrate thorough exploratory data analysis (EDA), address class imbalances, select machine learning algorithms, deploy models, and develop API endpoints using GCP Vertex AI for production ML systems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages