Overview This project implements a fraud detection system using a Gradient Boosting algorithm, specifically LightGBM. The goal is to predict fraudulent transactions in a dataset using a combination of data preprocessing, feature engineering, and model evaluation. The project leverages LightGBM for efficient handling of imbalanced data, and a variety of metrics are used to assess the model's performance.
Folder Structure
folder/ ├── lgbm.ipynb # Jupyter Notebook containing the fraud detection model code ├── fraud.csv # Dataset in CSV format └── README.md # This README file Getting Started Prerequisites To run this project, ensure you have the following installed:
Python 3.6 or above Jupyter Notebook Libraries: pandas numpy scikit-learn lightgbm matplotlib seaborn statsmodels Install the necessary Python libraries using:
pip install pandas numpy scikit-learn lightgbm matplotlib seaborn statsmodels Installation
Clone the repository: Clone the project from GitHub to your local machine:
git clone Navigate to the project folder:
cd Fraud Detector and Analyzer using Gradient Boosting Machine Open the Jupyter Notebook:
jupyter notebook lgbm.ipynb Dataset The dataset used for training is named fraud.csv. It contains various features describing each transaction, such as transaction details, user information, etc. The target variable is binary: 1 for a fraudulent transaction and 0 for a legitimate one. Project Workflow Data Loading:
The dataset is loaded using pandas, and basic information is printed, including the dataset size. Data Preprocessing:
Cleaning: Unnecessary columns are removed. Handling Categorical Features: Dummy variables are created for categorical features. Scaling: Standard scaling is applied to numerical features using StandardScaler from scikit-learn. Train-Test Split: The dataset is split into training and testing sets using an 80-20 split to ensure good generalization. Feature Engineering:
Calculations like the Variance Inflation Factor (VIF) are used to detect and handle multicollinearity. New features are generated as needed to enhance model performance. Model Training:
LightGBM Model: The LightGBM LGBMClassifier is used to train the model. The training is done with StratifiedKFold cross-validation to handle class imbalance. Hyperparameter tuning is performed, including setting early stopping and logging evaluation for optimization. Model Evaluation:
The model is evaluated using metrics like Confusion Matrix, Classification Report, ROC-AUC Score, Precision-Recall Curve, and Feature Importance. ROC and Precision-Recall curves are plotted to visualize model performance, with a focus on minimizing false negatives. Visualization:
Seaborn and Matplotlib are used for visualizing the data distribution and model results. Feature importance charts are created to interpret which features have the most influence on fraud detection. Usage Run the Notebook: Open code.ipynb in Jupyter Notebook and run each cell to execute the workflow, from data preprocessing to model training and evaluation.
Adjust Hyperparameters: You can modify hyperparameters such as learning_rate, num_leaves, and n_estimators in the LightGBM model to achieve better results.
Evaluate Model: Observe model performance using different evaluation metrics. You can also experiment with the dataset split ratio to see its effect on performance.
Results The model is evaluated primarily on metrics such as precision, recall, f1-score, and ROC-AUC due to the imbalanced nature of the dataset. Visualization of feature importances helps in understanding which transaction characteristics are most indicative of fraud. Key Findings The dataset is highly imbalanced, and techniques like StratifiedKFold and appropriate metric selection (e.g., precision-recall rather than accuracy) are essential to properly evaluate the model. LightGBM performed well in detecting fraudulent transactions, particularly when parameters were tuned using early stopping and cross-validation. Customization Data Preprocessing: You can add more preprocessing steps if needed, such as applying SMOTE for oversampling. Model Choice: Experiment with other models such as XGBoost or RandomForestClassifier to compare performance with LightGBM. Feature Engineering: Create more derived features to enhance model accuracy, such as aggregating user behavior over time.