Skip to content

Commit

Permalink
permission fix + quarto added + rendered report added
Browse files Browse the repository at this point in the history
  • Loading branch information
Farhan-Faisal committed Dec 7, 2024
1 parent 75783ca commit 2a6184a
Show file tree
Hide file tree
Showing 8 changed files with 7,608 additions and 6 deletions.
Binary file removed docs/.index.html.icloud
Binary file not shown.
3,721 changes: 3,721 additions & 0 deletions docs/index.html

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ dependencies:
- notebook=6.5.4
- jupyter_contrib_nbextensions=0.7.0
- quarto=1.5.57
- click=8.1.7
- tabulate=0.9.0
- pip:
- altair_ally>=0.1.1
Expand Down
Binary file added report/.ipynb_checkpoints/report-checkpoint.pdf
Binary file not shown.
164 changes: 164 additions & 0 deletions report/.ipynb_checkpoints/report-checkpoint.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
---
title: Wine Chromatic Profile Prediction Project
author: "Adrian Leung, Daria Khon, Farhan Faisal & Zhiwei Zhang"
date: "2024/12/07"
format:
html:
toc: true
toc-depth: 2
embed-resources: true
fig-numbering: true
tbl-numbering: true
pdf:
toc: true
toc-depth: 2
fig-numbering: true
tbl-numbering: true
embed-resources: true
editor: source
bibliography: references.bib
jupyter: python3
execute:
echo: false
warning: false
---

```{python}
import pickle
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown, display
```


```{python}
# loading data
wine_train = pd.read_csv("../data/wine_train.csv")
wine_test = pd.read_csv("../data/wine_test.csv")
```

```{python}
info_df = pd.read_csv("../results/tables/feature_datatypes.csv")
summary_df = pd.read_csv("../results/tables/summary_statistics.csv")
```

```{python}
with open("../results/models/wine_pipeline.pickle", "rb") as file:
pipe = pickle.load(file)
# Running cross validation on default model
cross_validation_results = pd.read_csv("../results/tables/cross_validation.csv").rename(
columns={"Unnamed: 0": "Scoring Metric"}
)
# Hyperparameter optimization
with open("../results/models/wine_random_search.pickle", "rb") as file:
random_search = pickle.load(file)
random_search_results = pd.read_csv("../results/tables/random_search.csv")
best_params = random_search.best_params_["logisticregression__C"]
```

```{python}
X_test = wine_test.drop(columns = ["color"])
y_test = wine_test["color"]
best_model_accuracy = random_search.score(X_test, y_test)
```

```{python}
test_results = pd.read_csv("../results/tables/test_scores.csv")
test_accuracy = round(float(test_results["accuracy"]), 3)
test_precision = round(float(test_results["precision"]), 3)
test_recall = round(float(test_results["recall"]), 3)
test_f1 = round(float(test_results["F1 score"]), 3)
```

___
## Summary

In this project, we developed a machine learning model to predict the color of wine (red or white) using its physiochemical properties such as acidity, pH, sugar content, and alcohol level. A logistic regression model with balanced class weights was implemented and optimized through hyperparameter tuning. The final model performed exceptionally well, achieving an accuracy of `{python} round(best_model_accuracy, 2)` on unseen test data. The precision-recall analysis indicated high precision and recall scores (above `{python} round(test_recall, 2) - 0.01`), further corroborated by the confusion matrix, which showed minimal misclassifications.

While the model demonstrated strong predictive accuracy, the near-perfect results raise potential concerns about overfitting, suggesting further evaluation on truly unseen data is necessary. This work emphasizes the potential for data-driven tools to optimize wine classification processes, offering a scalable and efficient approach for the wine industry.

___
## Introduction

Wine classification plays a crucial role in both production and quality assessment, yet traditional methods often rely on subjective evaluations by experts. Therefore, this project seeks to answer **whether we accurately predict the color of wine using its physiochemical properties**.

Developing a machine learning model for wine classification has several advantages. For winemakers, it could provide a scalable method for analyzing large datasets, identifying trends, and optimizing production processes. For consumers and retailers, it could serve as a tool to verify wine characteristics without requiring advanced laboratory equipment. Through this project, we aim to contribute to the industry's adoption of data-driven approaches, enabling efficient, reproducible, and cost-effective methods for wine analysis.

___
## Methods

### Data

The dataset for this project is sourced from the UCI Machine Learning Repository [@dua2017uci] and focuses on wines from the Vinho Verde region in Portugal. It includes 11 physiochemical attributes, such as fixed acidity, volatile acidity, pH, and alcohol content, collected from 1,599 red wine samples and 4,898 white wine samples.

### Validation
We conducted several validation checks on our dataset, including assessments for duplicates, correct data types, and missing values, most of which passed successfully. However, the outlier check flagged a few variables with potential outliers. To keep the analysis straightforward, we chose not to remove these outliers for this iteration. Future iterations could explore handling these outliers more thoroughly to refine the analysis.

### Analysis
The logistic regression algorithm was used to build a classification model to predict whether a wine sample is red or white (as defined by the `color` column in the dataset). All 11 physiochemical features in the dataset, including fixed acidity, volatile acidity, pH, and alcohol content, were utilized for model training. The dataset was split into 70% for the training set and 30% for the test set.

Preprocessing steps included removing duplicate entries, ordinal encoding for the `quality` feature, and standardizing all numeric features to ensure uniform scaling. A randomized search with 10-fold cross-validation was conducted, using F1 as the scoring metric, to fine-tune the regularization parameter (`C`). This process helped minimize classification bias and maximize accuracy while identifying the optimal model. Balanced class weights were employed to address potential class imbalances in the dataset.

The Python programming language [@python3reference] and the following libraries were utilized for the analysis: NumPy [@harris2020numpy] for numerical computations, Pandas [@mckinney2010pandas] for data manipulation, Altair [@vanderplas2018altair] for visualization, and scikit-learn [@pedregosa2011scikit] for model development and evaluation. The complete analysis code is available on GitHub: <https://github.com/UBC-MDS/DSCI522-2425-22-wine-quality.git>.

___
## Results

To evaluate the usefulness of each feature in predicting wine color, we visualized their distributions in the training dataset, color-coded by class (@fig-feat-dist). While most predictors showed some overlap, notable differences were observed in their central tendencies and spreads. Features like wine quality and residual sugar exhibited less distinction between classes, but we retained them, anticipating that their interactions with other features might enhance predictive power.

We also examined multicollinearity among predictors (@fig-feat-cor), initially identifying high correlations between `alcohol and quality`, as well as `total sulfur dioxide and free sulfur dioxide`. However, further validation using Deepchecks confirmed that none of these correlations exceeded the threshold of 0.8. As a result, all features were retained in the model to leverage potential interactions and maximize predictive insights.

![Distribution of Features per Target Class](../results/figures/feature_densities_by_class.png){#fig-feat-dist}

![Correlation between Wine Color Prediction Features](../results/figures/feature_correlation.png){#fig-feat-cor}


```{python}
# #| label: tbl-cross-val
# #| tbl-cap: Cross Validation Results
# Markdown(cross_validation_results.to_markdown())
```

We employed a logistic regression model for our classification task, utilizing randomized search with 10 iterations for hyperparameter optimization. The primary goal was to determine the best regularization parameter (`C`), which was found to be `{python} round(best_params, 2)`, to maximize predictive performance. To evaluate model performance during the search, we used the F1 score (with "red" as the positive class) as our scoring metric. Cross validation results using the best model is shown in @tbl-random-search.

```{python}
#| label: tbl-random-search
#| tbl-cap: Random Search Best Model Cross-Validation Results
Markdown(cross_validation_results.to_markdown())
```

```{python}
target_drift_score = pd.read_csv("../results/tables/drift_score.csv").iloc[0, 0]
```

Before evaluating model performance on the test set, we validated that the target distributions between the training and test sets were comparable using a prediction drift check. The validation was successful, with a low Prediction Drift Score of `{python} round(target_drift_score, 3)`, indicating that the model's predictions on both datasets are consistent and align with the expected target distribution.

Finally, we evaluated the model on the test set (@tbl-test-results). We also generated a confusion matrix (@fig-confusion) and a precision-recall (PR) (@fig-pr) curve to summarize the results.

```{python}
#| label: tbl-test-results
#| tbl-cap: Test Set Results
Markdown(test_results.to_markdown())
```

![Confusion Matric](../results/figures/confusion_matrix.png){#fig-confusion}

![Precision Recall Curve](../results/figures/pr_curve.png){#fig-pr}

___
## Discussion

The confusion matrix (@fig-confusion) confirms that the number of false positives and false negatives is low, and the model achieved an impressive test accuracy of `{python} round(test_accuracy, 3)`. The precision-recall (PR) curve (@fig-pr) further demonstrates the model's robust performance across various thresholds (AP = 0.99), supported by a high test precision of `{python} test_precision` and test recall of `{python} test_recall`. These metrics collectively indicate that the model is highly effective at differentiating between wine colors.

The near-perfect results on the test data exceeded our expectations, as we anticipated more variability in performance, particularly with recall and precision both exceeding `{python} round(test_recall, 2) - 0.01`.

While these high scores suggest the model is likely to perform well on new data, they also raise concerns about potential overfitting. The model's exceptional performance on both the training and test sets may indicate limited generalizability to truly unseen data, warranting further evaluation to ensure robustness.

___
## References
Loading

0 comments on commit 2a6184a

Please sign in to comment.