Replies: 1 comment
-
Hi @sryza , I really like this idea (and would definitely be an early adopter). I had a chat to @chenbobby about this yesterday, so I'd like to add my 2 cents about my use cases for this, some suggestions for making it user friendly, and some (perhaps long-shot) ideas for how to do some version/dependency inference. I use Dagster to drive an ML pipeline (currently only in the model development phase - but at some point potentially encapsulating deployment as well). My main interest is to be able to define the data flow to a set of candidate models and collect and compare the performance (including some preprocessing, splitting steps etc). I'd like to be sure that the model artifacts (pickle files of the models) and results that the pipeline produces are always up-to-date (with respect to the upstream data and code called by solid definitions). The easiest way to do this is to rerun the pipeline. The speed of this would be improved using the memoized development you mentioned. The way I originally managed this was through a Makefile, using the approach discussed here: http://zmjones.com/make/
The major drawbacks of this are:
My current workaround with Dagster for similar behaviour to a Makefile is as follows. @solid(
output_defs=[OutputDefinitionCached(filename='data/subset')]
)
@cache
def example_solid(context, data) -> IntermediatePickle:
return create_subset(data) This
Pros of this approach:
Cons of this approach:
Some long shot suggestions:
An alternative to this may be to use something like
Suggestions for user friendliness:
I'm happy to provide some of the code I've used, if you think it would be useful, or to discuss this further (and potentially contribute if there's scope for that). |
Beta Was this translation helpful? Give feedback.
-
Motivation
If, when Dagster stores a result, it can store information to uniquely identify the computation that produced it, it can compare that information with similar information for unexecuted computations to determine whether execution would produce a new result.
Intended Uses
General memoized re-execution - When I’m executing a pipeline, if a step would produce a result that’s identical to the result produced by a step in a previous run, I’d like to use the result from the step as the input to subsequent steps.
Memoized development - When I’m iterating on a pipeline in development, I want to avoid manually keeping track of which steps I need to re-run after I make a change.
Version-based backfills - When I update code, data, or configuration that an asset depends on, I want to automatically backfill the asset to reflect the change.
Where might versions come from?
Versioning definitions:
Versioning external inputs:
Version relationships
API proposal
Solids: add a version attr SolidDefinition
Hardcoding a version for a solid definition:
Writing a decorator to compute versions based on the SQL transform:
Setting a version based on the contents of relevant files:
Resources: add a version attr to ResourceDefinition
Versioning a resource definition:
Solid configuration: add a config_version next to config
In most cases, we can compute the version of a config subtree by hashing some normalized form of it - a simple way would be the string repr of a stably-sorted copy of the config dict.
In some rare cases, a user might need to hardcode the version of a config. E.g. perhaps they removed a vestigial config property that was having no effect, and they want to avoid dirtying their pipelines. They could do:
We probably shouldn’t spend time implementing this until someone requests it.
Composite solids: no changes required
Composite solids don’t have versions. Instead, the versions of solids in the flattened DAG are used. When a composite solid involves a config-mapping function, it’s taken into account into the versioning of relevant steps by using its output as the processed run config for the step.
Configured solids and resources: no changes required
Configured solids and resources don’t have versions. Instead, versions of the inner definitions are used. When a configured solid or resource involves a config-mapping function, the function is taken into account into the versioning of relevant objects by using its output as the processed run config for the object.
Dagster type loaders: add version attr and an external_version_fn attr to DagsterTypeLoader
The version attr refers to the version of the loading code, and the external_version_fn refers to the version of the input that's being loaded.
For modeling the versions of external inputs to pipelines. Unlike the static version attributes on definitions in the rest of the proposal, we use a function here to reflect that the versions of external inputs depend can depend on when we query them.
A risk worth noting is race conditions - we might ascertain the version of an external input, and then it might change before we have the chance to load it. It’s tough to address this in the general case.
Beta Was this translation helpful? Give feedback.
All reactions