Skip to content

cromano8/Snowflake_ML_Intro

Repository files navigation

Snowflake for Data Science

Getting Started

Although we recorded videos, we are constantly making upgrades and additions to this repo, so the videos may differ slightly from what is in the repo. Overall they are the same but we will continue to upload more videos on any additions to the repo.

Configuration Setup

  1. Create a .env file and populate it with your account details:

    SNOWFLAKE_ACCOUNT = abc123.us-east-1
    SNOWFLAKE_USER = username
    SNOWFLAKE_PASSWORD = yourpassword
    SNOWFLAKE_ROLE = sysadmin
    SNOWFLAKE_WAREHOUSE = compute_wh
    SNOWFLAKE_DATABASE = snowpark
    SNOWFLAKE_SCHEMA = titanic
    
  2. Utilize the environment.yml file to set up your Python environment for the demo:

    • Examples in the terminal:
      • conda env create -f environment.yml
      • micromamba create -f environment.yml -y

Why we partner with Anaconda

Image

Review of distributed Hyperparameter tuning benefits

Local run time 8 min 27 seconds

Screenshot 2024-02-05 at 10 13 50 AM

SnowflakeML run time 1 min 17 seconds (6.5x improvement in speed leveraging a Large WH)

Screenshot 2024-02-05 at 10 16 43 AM

Data Processing & ML Operations

Load & Transform Data

Execute the load_data notebook to accomplish the following:

  • Load the Titanic dataset from Seaborn, convert to uppercase, and save as CSV
  • Upload the CSV file to a Snowflake Internal Stage
  • Create a Snowpark DataFrame from the staged CSV
  • Write the Snowpark DataFrame to Snowflake as a table

Machine Learning Operations (snowml)

In the snowml notebook:

  • Generate a Snowpark DataFrame from the Titanic table
  • Validate and handle null values
  • Remove columns with high null counts and correlations
  • Adjust Fare datatype and impute categorical nulls
  • One-Hot Encode Categorical Values
  • Segregate data into Test & Train sets
  • Train an XGBOOST Classifier Model with hyperparameter tuning
  • Conduct predictions on the test set
  • Display Accuracy, Precision, and Recall metrics

Advanced MLOps with Live/Batch Inference & Streamlit

Following the load_data steps, utilize the deployment notebook to:

  • Create a Snowpark DataFrame from the Titanic table
  • Assess and eliminate columns with high null counts and correlated columns
  • Adjust Fare datatype and handle categorical nulls
  • One-Hot Encode Categorical Values
  • Split the data into Test & Train sets
  • Train an XGBOOST Classifier Model, optimizing with grid search
  • Display model accuracy and best parameters
  • Register the model in the model registry
  • Deploy the model as a vectorized UDF (User Defined Function)
  • Execute batch predictions on a table
  • Perform real-time predictions using Streamlit for interactive inference

About

Introduction to performing Machine Learning on Snowflake

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •