This mini competition is adapted from the Kaggle Rossmann challenge.
Goal: Predict Sales of a given Rossmann store on a given day.
Input: train.csv and store.csv file. A description for the information contained in those files can be found on Kaggle.
Output: submission.csv file from the holdout.csv file with two columns: (store) ID and Sales Prediction.
Sina Rampe
Oskar Klaja
Aloïs Villa
- From your terminal, clone the repository:
With SSH: git clone [email protected]:alovg/DSR_minicomp.git
With HTTPS: git clone https://github.com/alovg/DSR_minicomp.git
-
Create environment. Compatible python version is 3.8.12.
- run
conda create -n minicomp python=3.8.12 pip ipykernel
- install jupyter notebook in minicomp environment with
pip3 install jupyter
python -m ipykernel install --user --name minicomp --display-name “minicomp”
- run
-
Get requirements by running
pip3 install -r requirements.txt
-
Run file
- from the cloned repository in your terminal, run
jupyter notebook
- navigate to xgb_group1.ipynb in browser
- select "minicomp" kernel
- Run all cells (from taskbar: go to run> run all cells)
- from the cloned repository in your terminal, run
-
You should now have a submission.csv file in the same folder.
Once cloned, the structure of the folder should look like:
DSR_minicomp
├── data
├── development_files
├── visualisations
├── .gitignore
├── README.md
├── Requirements.txt
└── xgb_group1.ipynb
The data folder should look like:
├── holdout_b29.csv
├── store.csv
└── train.csv
The prediction was created using a Gradient Boosted Tree (with xgboost).
For the data exploration, pandas profiling was used. The exported html profiles can be found in the visualisation folder.
Existing features (used as is):
'DayOfWeek', 'SchoolHoliday', 'CompetitionDistance', 'Promo2', 'Open', 'Promo', 'Store'.
Existing features (encoded):
'Store_encoded', 'StoreType', 'Assortment', 'StateHoliday'.
Engineered features:
'Year', ‘Month’, ‘Day’, ‘WeekOfYear’, 'PromoMonth'.
Below are the features ranked in terms of importance.
On kaggle, this model was evaluated on the root mean square percentage error (RMSPE).
RMSPE = .164
The following are attempts which did not result in an improvement of the general performance of the model:
Models:
- Mean Lazy Estimator
- Linear Regression
- Random Forest Regression
- Tested Gradient Boosted Trees (xgBoost) parameters with validation set
Feature Engineering:
- one-hot encode all features
- target encode all features
- combination of the above two
- cyclic encoding with sinus and cosinus for DayOfWeek, day, month, year.
- upsampling store and assortment (see figures 2 and 3 below).
- imputing values for CompetitionDistance. 75km (maximum value for the feature) was set for the missing values (see figure 1 below).