This project provides a comprehensive pipeline for static analysis of Android APKs to detect malware. It extracts features from APK files, preprocesses the data, and trains machine-learning models to classify applications as benign or malicious. To perform this analysis, I used 800 malware and 800 benign APK files, which I collected from https://m4lware.org.
android-static-analysis/
├── android_malware_preprocessing.py : Preprocesses cleaned feature data
├── apk_features_updated.csv : Output of feature extraction
├── benignSample/ : Directory for benign APK samples
│ └── [benign APKs]
├── cleaned_features.csv : Output of feature dropping
├── drop_irrelevant_features.py : Removes irrelevant features
├── extract_apk_features.py : Extracts features from APKs
├── malwareSample/ : Directory for malware APK samples
│ └── [malware APKs]
├── model_comparison.py : Trains and evaluates ML models
├── preprocessed_data_[timestamp]/ : Preprocessed data output directory
├── trainModel/ : Trained model output directory
├── requirements.txt : Python dependencies
└── run_pipeline.py : Orchestrates the full pipeline
- Python: Version 3.8 or higher
- Virtual Environment: Recommended (e.g.,
venv
) - APK Samples: Place benign APKs in
benignSample/
and malware APKs inmalwareSample/
-
Clone the Repository:
git clone <repository_url> cd android-static-analysis
-
Set Up a Virtual Environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
To execute the entire pipeline from feature extraction to model training:
python3 run_pipeline.py
--malware-dir
: Directory with malware APKs (default:malwareSample
)--workers
: Number of worker processes for feature extraction (default: 5)--save-interval
: Save interval for feature extraction (default: 50)--resume
: Resume from the last successful step--clean
: Clean output directories and files (e.g.,python3 run_pipeline.py --clean 1
)
If the pipeline is interrupted, resume the last successful step:
python3 run_pipeline.py --resume
-
Feature Extraction (
extract_apk_features.py
):- Extract static features (permissions, API calls, etc.) from APKs using Androguard.
- Outputs:
apk_features_updated.csv
-
Feature Dropping (
drop_irrelevant_features.py
):- Removes irrelevant features (e.g.,
file_name
,package_name
). - Outputs:
cleaned_features.csv
- Removes irrelevant features (e.g.,
-
Preprocessing (
android_malware_preprocessing.py
):- Handles missing values, outliers, and creates derived features.
- Performs feature selection and standardization.
- Outputs:
preprocessed_data_[timestamp]/
with train/test splits and visualizations.
-
Model Training (
model_comparison.py
):- Trains and evaluates multiple models (Random Forest, SVM, etc.).
- Saves trained models and evaluation metrics.
- Outputs:
trainModel/
with models (e.g.,best_model_random_forest.pkl
) and plots.
See requirements.txt
for a full list. Key packages include:
numpy
,pandas
: Data processingscikit-learn
: Machine learningmatplotlib
,seaborn
: Visualizationandroguard
: APK analysis
- Missing APKs: Ensure
benignSample/
andmalwareSample/
contain.apk
files. - Dependency Errors: Verify all packages are installed (
pip install -r requirements.txt
). - Permission Issues: Run with appropriate permissions if accessing restricted directories.
- Model Not Saved: Check
pipeline_run_*.log
for errors in the "Model Comparison" step.
Feel free to submit issues or pull requests to enhance the pipeline.
This project is unlicensed unless specified otherwise by the repository owner.