How to write an I/O manager that can handle BackfillPolicy.single_run
#21515
jamiedemaria
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
If you write an asset that can be backfilled in a single run (i.e. with
backfill_policy=BackfillPolicy.single_run
) and use I/O managers to store and load the outputs of your assets, you'll need to ensure that the I/O manager can handle single-run backfills. Some dagster-provided I/O managers support single-run backfills, including the Snowflake, BigQuery, DuckDB, but others do not, including the filesystem, GCS, and S3 I/O managers. This is because how you store the outputs from a single-run backfill is highly dependent on your individual setup and storage needs, especially for file-based storage systems. This discussion will walk you through some of the decisions you will need to make when writing your I/O manager, and the tools you have available in the I/O manager.Let's say we have the following assets: one to fetch the daily temperature high in fahrenheit and a second to convert those daily highs to celsius.
For example, if we ran a backfill of the dates
2024-01-01
,2024-01-02
,2024-01-03
,2024-01-04
,daily_high_fehrenheit
would return a sequence of four integers, and we want the I/O manager to be able to store those values. Then we want to be able to load all four integers as a list to pass todaily_high_celsius
.When designing an I/O manager to store this data, there are several use cases to consider:
daily_high_fahrenheit
for one date (i.e2024-01-02
)?daily_high_celsius
for one date (i.e.2024-01-03
)?For example, if our I/O manager simply put the outputs of each asset materialization in a file, after the backfill we would have a single file containing all four fahrenheit values, and another file containing all four celsius values. During a materialization of a single value, as described above, we'd need to extract a single value from the fahreheit file, and have a system in place to store the new celsius value and remove the value from the previous materialization.
Having a strategy for how you will handle these use cases will largely inform your I/O manager design. In this case, one option could be to return a mapping of
date: temperature
so that we can store each value individually. We may also want to consider a storage option other than files. A table in a DataBase might be a better option since we can select the rows in a range of dates.Before we implement one of these solutions, let's go over some useful
context
properties you have available in the I/O managerThese three properties provide information about the partitions being materialized in the backfill in three different formats. For example, if we materialized the backfill described above (
2024-01-01
,2024-01-02
,2024-01-03
,2024-01-04
) in the I/O manager you would haveNote that for static partitions,
time_window
is unavailable, and will raise and exception.Now, we will go over writing an I/O manager that will store the output values in files. Our strategy will be to return the partition key along with the temperature value so that we can store each temperature in it's own file. Note that this is just one solution to this situation, you will likely need to write your I/O manager specific to your needs.
First we will update the assets to return a mapping of
date: temperature
:Now we can write our I/O manager
Beta Was this translation helpful? Give feedback.
All reactions