-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[du] consistency issues #26416
[du] consistency issues #26416
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,8 +10,8 @@ An asset is an object in persistent storage that captures some understanding of | |
|
||
- **A database table or view**, such as those in a Google BigQuery data warehouse | ||
- **A file**, such as a file in your local machine or blob storage like Amazon S3 | ||
- **A machine learning model** | ||
- **An asset from an integration,** like a dbt model or a Fivetran connector | ||
- **A machine learning model**, such as TensorFlow or PyTorch | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding in example since it was the only bullet point without one |
||
- **An asset from an integration,** such as a dbt model or a Fivetran connector | ||
|
||
Assets aren’t limited to just the objects listed above - these are just some common examples. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,7 +30,7 @@ The asset you built should look similar to the following code. Click **View answ | |
deps=["taxi_zones_file"] | ||
) | ||
def taxi_zones() -> None: | ||
sql_query = f""" | ||
query = f""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Standardizing around |
||
create or replace table zones as ( | ||
select | ||
LocationID as zone_id, | ||
|
@@ -42,5 +42,5 @@ def taxi_zones() -> None: | |
""" | ||
|
||
conn = duckdb.connect(os.getenv("DUCKDB_DATABASE")) | ||
conn.execute(sql_query) | ||
conn.execute(query) | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,7 +25,7 @@ Now that you have a query that produces an asset, let’s use Dagster to manage | |
""" | ||
The raw taxi trips dataset, loaded into a DuckDB database | ||
""" | ||
sql_query = """ | ||
query = """ | ||
create or replace table trips as ( | ||
select | ||
VendorID as vendor_id, | ||
|
@@ -43,7 +43,7 @@ Now that you have a query that produces an asset, let’s use Dagster to manage | |
""" | ||
|
||
conn = duckdb.connect(os.getenv("DUCKDB_DATABASE")) | ||
conn.execute(sql_query) | ||
conn.execute(query) | ||
``` | ||
|
||
Let’s walk through what this code does: | ||
|
@@ -52,13 +52,13 @@ Now that you have a query that produces an asset, let’s use Dagster to manage | |
|
||
2. The `taxi_trips_file` asset is defined as a dependency of `taxi_trips` through the `deps` argument. | ||
|
||
3. Next, a variable named `sql_query` is created. This variable contains a SQL query that creates a table named `trips`, which sources its data from the `data/raw/taxi_trips_2023-03.parquet` file. This is the file created by the `taxi_trips_file` asset. | ||
3. Next, a variable named `query` is created. This variable contains a SQL query that creates a table named `trips`, which sources its data from the `data/raw/taxi_trips_2023-03.parquet` file. This is the file created by the `taxi_trips_file` asset. | ||
|
||
4. A variable named `conn` is created, which defines the connection to the DuckDB database in the project. To do this, it uses the `.connect` method from the `duckdb` library, passing in the `DUCKDB_DATABASE` environment variable to tell DuckDB where the database is located. | ||
|
||
The `DUCKDB_DATABASE` environment variable, sourced from your project’s `.env` file, resolves to `data/staging/data.duckdb`. **Note**: We set up this file in Lesson 2 - refer to this lesson if you need a refresher. If this file isn’t set up correctly, the materialization will result in an error. | ||
|
||
5. Finally, `conn` is paired with the DuckDB `execute` method, where our SQL query (`sql_query`) is passed in as an argument. This tells the asset that, when materializing, to connect to the DuckDB database and execute the query in `sql_query`. | ||
5. Finally, `conn` is paired with the DuckDB `execute` method, where our SQL query (`query`) is passed in as an argument. This tells the asset that, when materializing, to connect to the DuckDB database and execute the query in `query`. | ||
|
||
3. Save the changes to the file. | ||
|
||
|
@@ -98,9 +98,9 @@ This is because you’ve told Dagster that taxi_trips depends on the taxi_trips_ | |
To confirm that the `taxi_trips` asset materialized properly, you can access the newly made `trips` table in DuckDB. In a new terminal session, open a Python REPL and run the following snippet: | ||
|
||
```python | ||
> import duckdb | ||
> conn = duckdb.connect(database="data/staging/data.duckdb") # assumes you're writing to the same destination as specified in .env.example | ||
> conn.execute("select count(*) from trips").fetchall() | ||
import duckdb | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removing |
||
conn = duckdb.connect(database="data/staging/data.duckdb") # assumes you're writing to the same destination as specified in .env.example | ||
conn.execute("select count(*) from trips").fetchall() | ||
``` | ||
|
||
The command should succeed and return a row count of the taxi trips that were ingested. When finished, make sure to stop the terminal process before continuing or you may encounter an error. Use `Control+C` or `Command+C` to stop the process. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -63,7 +63,7 @@ To add the partition to the asset: | |
@asset( | ||
partitions_def=monthly_partition | ||
) | ||
def taxi_trips_file(context) -> None: | ||
def taxi_trips_file(context: AssetExecutionContext) -> None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First example snippet uses type annotation. Including for consistency |
||
partition_date_str = context.partition_key | ||
``` | ||
|
||
|
@@ -73,7 +73,7 @@ To add the partition to the asset: | |
@asset( | ||
partitions_def=monthly_partition | ||
) | ||
def taxi_trips_file(context) -> None: | ||
def taxi_trips_file(context: AssetExecutionContext) -> None: | ||
partition_date_str = context.partition_key | ||
month_to_fetch = partition_date_str[:-3] | ||
``` | ||
|
@@ -86,7 +86,7 @@ from ..partitions import monthly_partition | |
@asset( | ||
partitions_def=monthly_partition | ||
) | ||
def taxi_trips_file(context) -> None: | ||
def taxi_trips_file(context: AssetExecutionContext) -> None: | ||
""" | ||
The raw parquet files for the taxi trips dataset. Sourced from the NYC Open Data portal. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding in
uv
as an option since we use that other places in the repo and was what I used to complete the classThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would capitalize "poetry" here for consistency and add a comma after it.