update student version with curriculum book changes

geo-smart · Oct 18, 2024 · 055a386 · 055a386
1 parent d746f93
commit 055a386
Show file tree

Hide file tree

Showing 6 changed files with 575,593 additions and 574,873 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,43 @@
 # GeoSMART Curriculum Jupyter Book (ESS 469/569)
 
-[![Deploy](https://github.com/geo-smart/mlgeo-book/actions/workflows/deploy.yaml/badge.svg)](https://github.com/geo-smart/mlgeo-book/actions/workflows/deploy.yaml)
-[![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://geo-smart.github.io/mlgeo-book)
-[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/geo-smart/mlgeo-book/HEAD?urlpath=lab)
+[![Deploy](https://github.com/geo-smart/mlgeo-instructor/actions/workflows/deploy.yaml/badge.svg)](https://github.com/geo-smart/mlgeo-instructor/actions/workflows/deploy.yaml)
+[![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://geo-smart.github.io/mlgeo-instructor)
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/geo-smart/mlgeo-instructor/HEAD?urlpath=lab)
 [![GeoSMART Library Badge](book/img/curricula_badge.svg)](https://geo-smart.github.io/curriculum)
+[![Student Version](book/img/student_version_badge.svg)](https://geo-smart.github.io/mlgeo-book/)
 
-## About
+## Repository Overview
 
-This repository stores configuration for GeoSMART curriculum content, specifically the student version of the book. This version of the book should never be directly edited, as the student version is automatically generated on push.
+This repository stores configuration for GeoSMART curriculum content, specifically the teacher version of the book. Only this version of the book should ever be edited, as the student version is automatically generated on push by github actions.
+
+## Making Changes
+
+Edit the book content by modifying the `_config.yml`, `_toc.yml` and `*.ipynb` files in the `book` directory. The book is hosted on Github Pages and will be automatically updated on push, and the student book will also be created automatically on push.
+
+Making changes requires that you set up a conda environment and build locally before making sure that it will build with github actions. We accepted rendered notebooks, but some oddities, such as kernels different than python, will make it crash. So we recommend that contributors first build the book with the added notebooks.
+
+```sh
+    conda env create -f ./conda/environment.yml
+    conda activate curriculum_book
+
+```
+
+To modify the exact differences between this book and the student book, edit `.github/workflows/clean_book.py`. When you push, a github action will clone the repo and run this python file which modifies certain parts of `*.ipynb` file contents, then pushes to the student repo. To edit the student repo's README, edit `STUDENT_README.md`. The Github Actions workflow also automatically replaces `README.md` with `STUDENT_README.md` in the student repo.
+
+### `Student Response Sections`
+
+One modifications made by the `clean_book.py` workflow is to clear sections marked for student response. Code cells marked for student response may contain code in the teacher version of the book, but will have their code removed and replaced with a TODO comment in the student version.
+
+To mark a code cell to be cleared, insert a markdown cell directly preceding it with the following content:
+
+````markdown
+```{admonition} Student response section
+This section is left for the student to complete.
+```
+````
+
+## Serving Locally
+
+Activate the `curriculum_book` conda environment (or any conda environment that has the necessary jupyter book dependencies). Navigate to the root folder of the curriculum book repository in anaconda prompt, then run `python server.py`.
+
+On startup, the server will run `jb build book` to build all changes to the notebook and create the compiled HTML. The server code can take a `--no-build` flag (or `--nb` shorthand) if you don't want to build any changes you've made to the notebooks. In the case that you don't want to build changes made to the notebooks, you can just run `python serer.py --nb` from any terminal with python installed.
diff --git a/STUDENT_README.md b/STUDENT_README.md
@@ -0,0 +1,10 @@
+# GeoSMART Curriculum Jupyter Book (ESS 469/569)
+
+[![Deploy](https://github.com/geo-smart/mlgeo-book/actions/workflows/deploy.yaml/badge.svg)](https://github.com/geo-smart/mlgeo-book/actions/workflows/deploy.yaml)
+[![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://geo-smart.github.io/mlgeo-book)
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/geo-smart/mlgeo-book/HEAD?urlpath=lab)
+[![GeoSMART Library Badge](book/img/curricula_badge.svg)](https://geo-smart.github.io/curriculum)
+
+## About
+
+This repository stores configuration for GeoSMART curriculum content, specifically the student version of the book. This version of the book should never be directly edited, as the student version is automatically generated on push.
diff --git a/book/Chapter2-DataManipulation/2.11_feature_engineering.ipynb b/book/Chapter2-DataManipulation/2.11_feature_engineering.ipynb
diff --git a/book/Chapter2-DataManipulation/2.12_dimensionality_reduction.ipynb b/book/Chapter2-DataManipulation/2.12_dimensionality_reduction.ipynb
@@ -6,7 +6,7 @@
    "id": "059aa832-245c-4923-b168-e2add62b0222",
    "metadata": {},
    "source": [
-    "# 2.11 Dimensionality Reduction\n",
+    "# 2.12 Dimensionality Reduction\n",
     "\n",
     "Ideally, one would not need to extract or select features in the input data. However, reducing the dimensionality as a separate pre-processing step may be advantageous:\n",
     "\n",

diff --git a/book/Chapter2-DataManipulation/2.20_Final_Project_Assignement.md b/book/Chapter2-DataManipulation/2.20_Final_Project_Assignement.md
@@ -0,0 +1,114 @@
+### Assignment: **Preparing AI-Ready Data for The Final Project**
+
+#### **Objective**
+This assignment focuses on organizing, cleaning, and preparing data in a form suitable for machine learning. By the end of this task, you should have an organized repository that contains the raw data, cleaned data, annotated attributes, and exploratory analysis that prepares the data for use in machine learning models.
+
+#### **Structure of the Assignment**
+
+
+1. **Project Repository Setup and Documentation**
+   - **Task**: Create a public GitHub repository for the group project.
+   - **Requirements**:
+     - A clear and concise `README.md` file that explains:
+       - The data source(s).
+       - Project objectives. There, we should describe the rational of the project.
+       - Instructions for setting up the environment (dependencies, packages).
+       - High-level description of each script/notebook.
+     - Structure your repository using the [MLGEO guidelines](../Chapter1-GettingStarted/1.5_version_control_git.md) .
+
+2. **Data Download and Raw Data Organization**
+   - **Task**: Download the raw geoscientific dataset relevant to your project and discuss the basic modalities.
+   - **Requirements**:
+     - Include a script or notebook (`scripts/download_data.py` or `notebooks/Download_Data.ipynb`) that downloads and verifies the dataset.
+     - Ensure that the raw data is stored in a dedicated folder (`data/raw/`).
+     - If applicable, document any API keys or access credentials required to obtain the data in the `README.md`.
+     - Describe the data **modalities**, data **formats**
+     - If applicable, describe large data archives that can be used for model inference, their size.
+
+3. **Basic Data Cleaning and Manipulation**
+   - **Task**: Clean the raw data to handle missing values, outliers, or inconsistencies.
+   - **Requirements**:
+     - Write a script/notebook (`scripts/clean_data.py` or `notebooks/Data_Cleaning.ipynb`) that:
+       - Handles missing values (e.g., imputation, removal).
+       - Corrects or removes outliers.
+       - Ensures data consistency (e.g., uniform date formatting, unit conversions).
+       - Saves cleaned data in a new folder (`data/clean/`).
+
+4. **Organizing Data into AI-Ready Format**
+   - **Task**: Prepare the cleaned data for machine learning, ensuring it is properly annotated and structured.
+   - **Requirements**:
+     - Convert your data into a format suitable for ML (e.g., pandas DataFrame, NumPy arrays, Xarray).
+     - Ensure the data is well-documented with attributes, labels, and metadata.
+     - Include a notebook (`notebooks/Prepare_AI_Ready_Data.ipynb`) that clearly describes:
+       - The final shape of the data (number of samples, features, and target labels).
+       - A description of each feature/attribute.
+       - Save the final AI-ready data in a dedicated folder (`data/ai_ready/`).
+
+5. **Exploratory Data Analysis (EDA)**
+   - **Task**: Perform a basic exploration of the cleaned data to understand its structure and key characteristics.
+   - **Requirements**:
+     - Create a notebook (`notebooks/EDA.ipynb`) that includes:
+       - Basic summary statistics of the dataset (mean, variance, min, max, etc.).
+       - Visualization of feature distributions (histograms, box plots, etc.).
+       - Correlation analysis between different features and target variables (correlation matrix, heatmaps).
+       - Brief discussion on any patterns or insights observed during the analysis.
+
+6. **Dimensionality Discussion and Reduction**
+   - **Task**: Analyze the dimensionality of your dataset and propose methods to reduce it.
+   - **Requirements**:
+     - In a notebook (`notebooks/Dimensionality_Reduction.ipynb`):
+       - Discuss the current dimensions of the dataset and any challenges they present (e.g., high dimensionality, sparse data).
+       - Propose and implement at least two dimensionality reduction techniques:
+         - Feature extraction techniques like PCA (Principal Component Analysis).
+         - Non-linear methods like t-SNE (t-Distributed Stochastic Neighbor Embedding).
+       - Visualize the results of dimensionality reduction (scatter plots, explained variance charts).
+       - Discuss the implications of dimensionality reduction on your dataset.
+
+#### **Deliverables**
+- A GitHub repository with the following structure:
+  ```
+  - data/
+    - raw/
+    - clean/
+    - ai_ready/
+  - scripts/
+    - download_data.py
+    - clean_data.py
+  - notebooks/
+    - Download_Data.ipynb
+    - Data_Cleaning.ipynb
+    - Prepare_AI_Ready_Data.ipynb
+    - EDA.ipynb
+    - Dimensionality_Reduction.ipynb
+  - README.md
+  ```
+- Ensure all the scripts and notebooks are well-documented, with comments explaining the code.
+- Submit a link to your GitHub repository as your final assignment.
+
+#### **Grading Criteria**
+- **Repository Organization (10%)**: Clean structure with appropriate directories, well-documented `README.md`.
+- **Data Download and Cleaning (20%)**: Script functionality, handling missing/outlier data, clean data format.
+- **AI-Ready Data Preparation (20%)**: Proper data annotation, clear dimensionality description, format suitability for ML.
+- **Exploratory Data Analysis (20%)**: Quality of statistical analysis, insights, and visualizations.
+- **Dimensionality Reduction (20%)**: Quality of analysis, use of techniques, and discussion on dimensionality challenges.
+- **Documentation and Code Clarity (10%)**: Clear explanations and code readability.
+
+
+## Required Self Evaluation
+
+* You should use chatGPT (4o is best as of 2024) for self-assessment. You may use the following prompt
+```
+Can you grade the following repository <ENTER-URL> with the following rubric "Repository Organization (10%): Clean structure with appropriate directories, well-documented README.md.
+Data Download and Cleaning (20%): Script functionality, handling missing/outlier data, clean data format.
+AI-Ready Data Preparation (20%): Proper data annotation, clear dimensionality description, format suitability for ML.
+Exploratory Data Analysis (20%): Quality of statistical analysis, insights, and visualizations.
+Dimensionality Reduction (20%): Quality of analysis, use of techniques, and discussion on dimensionality challenges.
+Documentation and Code Clarity (10%): Clear explanations and code readability."
+```
+It may also provide additional feedback to improve if you use prompts like
+
+```
+Can you please provide more constructive feedback to improve?
+```
+
+* **Print & Upload the reports** to Canvas to show 1) the initial assessment, and 2) the final assessment.