Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ChatGPT to generate Kedro pipeline for a Kaggle Dataset #90

Open
yetudada opened this issue Jun 9, 2023 · 1 comment
Open

Using ChatGPT to generate Kedro pipeline for a Kaggle Dataset #90

yetudada opened this issue Jun 9, 2023 · 1 comment
Labels
Blog post creation Blog posts (ideas and execution)

Comments

@yetudada
Copy link

yetudada commented Jun 9, 2023

Description

This task documents a blog post that we could explore around using ChatGPT to generate a Kedro pipeline for a Kaggle Dataset. Here is a prompt I've used to do this:

Can you please create a step-by-step tutorial that uses Kedro to create a data preprocessing pipeline? 

Additional requirements: 
- You must use the `kedro pipeline create` command to create the files for the pipeline instead of recommending that users create their own `nodes.py` and `pipeline.py` files.
- I want to see example code for the Python functions
- You must also recommend use of a `requirements.txt` file
- You do not need to create a virtual environment
- Use Kedro 0.18.6

The dataset to be used is on Kaggle: https://www.kaggle.com/datasets/datazng/telecom-company-churn-rate-call-center-data
@yetudada yetudada added the Blog post creation Blog posts (ideas and execution) label Jun 9, 2023
@astrojuanlu
Copy link
Member

Output from ChatGPT 4:

Step-by-Step Tutorial: Creating a Data Preprocessing Pipeline with Kedro

In this tutorial, we will walk through the process of setting up a data preprocessing pipeline using Kedro, a development workflow framework that helps structure reproducible, scalable, and maintainable data science code. We will use a telecom company churn rate dataset from Kaggle.

Prerequisites:

  • Python installed on your system
  • Kedro version 0.18.6
  • Access to the dataset from Kaggle

Step 1: Install Kedro

First, ensure you have Kedro installed. You can install Kedro using pip:

pip install kedro==0.18.6

Step 2: Create a New Kedro Project

Create a new directory for your project and navigate into it:

mkdir kedro_telecom
cd kedro_telecom

Now, create a new Kedro project:

kedro new --starter=pandas-iris

Follow the prompts to set up the project. For this tutorial, you can name the project "Telecom Churn Analysis".

Step 3: Set Up Your requirements.txt

Create a requirements.txt file in the root of your project directory and add the following dependencies:

kedro==0.18.6
pandas
scikit-learn

Install the requirements:

pip install -r requirements.txt

Step 4: Download and Configure the Dataset

Download the dataset from Kaggle and place it in the data/01_raw directory of your Kedro project.

Step 5: Create a Pipeline

Use Kedro CLI to create a new pipeline for data preprocessing:

kedro pipeline create data_preprocessing

This command creates a new pipeline in src/telecom_churn_analysis/pipelines/data_preprocessing.

Step 6: Define Nodes in the Pipeline

Navigate to src/telecom_churn_analysis/pipelines/data_preprocessing/nodes.py and define the functions for your data preprocessing steps. For example:

import pandas as pd
from sklearn.preprocessing import StandardScaler

def load_data(data_path: str) -> pd.DataFrame:
    """Loads data from the raw data directory."""
    return pd.read_csv(data_path)

def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    """Cleans the DataFrame by handling missing values, etc."""
    df.dropna(inplace=True)
    return df

def scale_features(df: pd.DataFrame) -> pd.DataFrame:
    """Scales numerical features using StandardScaler."""
    scaler = StandardScaler()
    numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns
    df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
    return df

Step 7: Build the Pipeline

In src/telecom_churn_analysis/pipelines/data_preprocessing/pipeline.py, import the nodes and build the pipeline:

from kedro.pipeline import Pipeline, node
from .nodes import load_data, clean_data, scale_features

def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=load_data,
                inputs="params:raw_data_path",
                outputs="preprocessed_data",
                name="load_data_node"
            ),
            node(
                func=clean_data,
                inputs="preprocessed_data",
                outputs="cleaned_data",
                name="clean_data_node"
            ),
            node(
                func=scale_features,
                inputs="cleaned_data",
                outputs="scaled_data",
                name="scale_features_node"
            ),
        ]
    )

Step 8: Configure Data Paths and Parameters

In conf/base/parameters.yml, specify the path to your raw data:

raw_data_path: data/01_raw/your_dataset.csv

Step 9: Run the Pipeline

Navigate back to the root of your Kedro project and run the pipeline:

kedro run

Step 10: Review the Results

Check the outputs in the data directory under the appropriate processing folder to see the results of your pipeline.

This tutorial outlines the basic setup and steps to get started with a data preprocessing pipeline using Kedro. You can expand this pipeline by adding more complex data transformations, integrating different data sources, or creating additional pipelines for data analysis and modeling.

Looks like, to this day, ChatGPT still doesn't understand Kedro very well.

@astrojuanlu astrojuanlu removed this from the Developer Advocacy and Messaging milestone Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blog post creation Blog posts (ideas and execution)
Projects
Status: No status
Development

No branches or pull requests

2 participants