0.10.0 RFC: Versioned solids, resources, and loaders #2904

sryza · 2020-09-11T17:00:24Z

sryza
Sep 11, 2020

Motivation

If, when Dagster stores a result, it can store information to uniquely identify the computation that produced it, it can compare that information with similar information for unexecuted computations to determine whether execution would produce a new result.

Intended Uses

General memoized re-execution - When I’m executing a pipeline, if a step would produce a result that’s identical to the result produced by a step in a previous run, I’d like to use the result from the step as the input to subsequent steps.

Memoized development - When I’m iterating on a pipeline in development, I want to avoid manually keeping track of which steps I need to re-run after I make a change.

Version-based backfills - When I update code, data, or configuration that an asset depends on, I want to automatically backfill the asset to reflect the change.

Where might versions come from?

Versioning definitions:

Have users explicitly assign versions to computations.
Let users define a code organization scheme that determines which source files and directories each computation definition can depend on. Then a hash all the contents of these files and directories forms a unique identifier for the computation.
Have users serialize representations of their computation at definition time - e.g. SQL statements, PySpark logical plans, Numba llvmlite, TensorFlow saved models, or JAX HLOs. Then hash those.

Versioning external inputs:

Many filesystems and databases enable checking the last time that a table was modified.
Many object stores enable checking the last time that a key was updated.
Streaming systems like Kafka enable checking whether new data has arrived.
Hash the object itself.

Version relationships

The version of an output comes from the step that generates it
The version of a step comes from
- The solid definition
- The version of the solid’s processed run configuration
- The version of all inputs
- The version of all resources
The version of a resource comes from
- The resource definition
- The resource’s processed run configuration
The version of an input comes from
- If it’s an externally loaded input, the version of data its loaded from
- If its a step input, the version of the upstream step

API proposal

Solids: add a version attr SolidDefinition

Hardcoding a version for a solid definition:

@solid(version='5')
def my_solid(context):
    '''Do stuff'''

Writing a decorator to compute versions based on the SQL transform:

def versioned_sql_solid(sql_statement, **kwargs):
    version = hashlib.sha1(sql_statement.encode('utf-8')).hexdigest()
    @solid(**kwargs, version=version)
    def execute_sql_solid(context, **kwargs):
        context.resources.db_connection.execute(sql_statement)

my_sql_solid = versioned_sql_solid('create table the_table as select * from ...')

Setting a version based on the contents of relevant files:

def file_hash(file_path):
    with open(file_path, 'rb') as f:
        return hashlib.sha1(f.read())
        
def combine_hashes(*hash_list):
    return hashlib.sha1(''.join(hash_list))
    
@solid(version=combine_hashes(file_hash(__file__), file_hash('requirements.txt')))
def my_solid(context):
    '''Do stuff'''

Resources: add a version attr to ResourceDefinition

Versioning a resource definition:

@resource(version='5')
def my_resource(init_context):
    '''Build the resource'''

Solid configuration: add a config_version next to config

In most cases, we can compute the version of a config subtree by hashing some normalized form of it - a simple way would be the string repr of a stably-sorted copy of the config dict.

In some rare cases, a user might need to hardcode the version of a config. E.g. perhaps they removed a vestigial config property that was having no effect, and they want to avoid dirtying their pipelines. They could do:

solids
  my_solid:
    config_version: 5
    config:
        ...

We probably shouldn’t spend time implementing this until someone requests it.

Composite solids: no changes required

Composite solids don’t have versions. Instead, the versions of solids in the flattened DAG are used. When a composite solid involves a config-mapping function, it’s taken into account into the versioning of relevant steps by using its output as the processed run config for the step.

Configured solids and resources: no changes required

Configured solids and resources don’t have versions. Instead, versions of the inner definitions are used. When a configured solid or resource involves a config-mapping function, the function is taken into account into the versioning of relevant objects by using its output as the processed run config for the object.

Dagster type loaders: add version attr and an external_version_fn attr to DagsterTypeLoader

The version attr refers to the version of the loading code, and the external_version_fn refers to the version of the input that's being loaded.

For modeling the versions of external inputs to pipelines. Unlike the static version attributes on definitions in the rest of the proposal, we use a function here to reflect that the versions of external inputs depend can depend on when we query them.

A risk worth noting is race conditions - we might ascertain the version of an external input, and then it might change before we have the chance to load it. It’s tough to address this in the general case.

def version_from_last_update_time(context, config):
    file_path = config['file_path']
    return str(os.path.getmtime(file_path))
    

@dagster_type_loader(version='1.0', external_version_fn=version_from_last_update_time, config_schema={'file_path': str})
def my_loader(context, config):
    '''Load stuff'''

rfishermonteith · 2020-11-10T11:06:46Z

rfishermonteith
Nov 10, 2020

Hi @sryza , I really like this idea (and would definitely be an early adopter). I had a chat to @chenbobby about this yesterday, so I'd like to add my 2 cents about my use cases for this, some suggestions for making it user friendly, and some (perhaps long-shot) ideas for how to do some version/dependency inference.

I use Dagster to drive an ML pipeline (currently only in the model development phase - but at some point potentially encapsulating deployment as well). My main interest is to be able to define the data flow to a set of candidate models and collect and compare the performance (including some preprocessing, splitting steps etc).

I'd like to be sure that the model artifacts (pickle files of the models) and results that the pipeline produces are always up-to-date (with respect to the upstream data and code called by solid definitions). The easiest way to do this is to rerun the pipeline. The speed of this would be improved using the memoized development you mentioned.

The way I originally managed this was through a Makefile, using the approach discussed here: http://zmjones.com/make/
The major benefits of this are:

make will manage file change dependencies (if the file dependencies are specified)
It's very fast (no overhead for loading or processing steps which don't need to be updated)

The major drawbacks of this are:

Large pipelines get very complicated to maintain
File/code dependencies have to be specified manually

My current workaround with Dagster for similar behaviour to a Makefile is as follows.
For each solid I'd like to cache the outputs of I decorate it with @cache, e.g.:

@solid(
    output_defs=[OutputDefinitionCached(filename='data/subset')]
)
@cache
def example_solid(context, data) -> IntermediatePickle:
    return create_subset(data)

This @cache decorator decides whether the solid needs to be computed, based on the following criteria (any of):

Are any of the inputs not IntermediateCachers
Are any of the outputs not IntermediateCachers
Do any of the outputs not exist (as persisted files)
Are any of the outputs older than any of the inputs
Are any of the outputs older than the last updated time of the file dependencies (not implemented yet)

IntermediatePickle (and some others) inherit from the base IntermediateCacher, which manages its last saved time, as well as abstracting methods for serializing and deserializing the intermediates. At present, these use the names defined using the solid's output_defs, so that these are stored in an human accessible place for later use (although I probably could also use another method to designate these as specific 'important' materializations).
The cache function will only load the data if the subsequent solid function will run, otherwise we're just passing the file references around.

Pros of this approach:

It mostly satisfies the memoized development you mentioned

Cons of this approach:

It breaks MyPy (these functions don't actually return objects of type IntermediatePickle)
More boilerplate code than I'd like
Potential divergence from Dagster's approach going forward
Still a little slow, since all the solids in the pipeline are 'run', even though we're not necessarily computing or loading data (typical overhead is 5 or 10 seconds in my case)
It currently doesn't consider differences in config, which could be important

Some long shot suggestions:

You suggested:

Let users define a code organization scheme that determines which source files and directories each computation definition can depend on. Then a hash all the contents of these files and directories forms a unique identifier for the computation.

An alternative to this may be to use something like coverage.py to infer which files (outside of the standard library/installed packages) each solid depends on. These files could then be hashed to detect changes. This may only be useful for pipelines run in pure python, which don't farm out computation elsewhere.

Version definitions could come from git (or other VCS) hashes, which also creates rollback capabilities, which could be very useful for governance and risk applications. There are the obvious challenges with ensuring that the committed version is the one that is run though.

Suggestions for user friendliness:

If the cached items are stored in a human readable format, it becomes easy to delete upstream data if you want to force a rerun for whatever reason (perhaps for testing?).
If you're storing cached items, you'd potentially want to provide them as inputs for specific later pipeline runs (e.g. supplying tuned hyperparameters from a previous run) - this may already be possible with specific configuration.

I'm happy to provide some of the code I've used, if you think it would be useful, or to discuss this further (and potentially contribute if there's scope for that).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.10.0 RFC: Versioned solids, resources, and loaders #2904

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

0.10.0 RFC: Versioned solids, resources, and loaders #2904

sryza Sep 11, 2020

Motivation

Intended Uses

Where might versions come from?

Version relationships

API proposal

Solids: add a version attr SolidDefinition

Resources: add a version attr to ResourceDefinition

Composite solids: no changes required

Configured solids and resources: no changes required

Dagster type loaders: add version attr and an external_version_fn attr to DagsterTypeLoader

Replies: 1 comment

rfishermonteith Nov 10, 2020

Some long shot suggestions:

Suggestions for user friendliness:

sryza
Sep 11, 2020

rfishermonteith
Nov 10, 2020