-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssetGraphView, TemporalContext, and AssetSlice #20312
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
d83cda2
to
7574878
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broadly in support of this pattern, had some comments on naming / structure
python_modules/dagster/dagster/_core/asset_graph_view/asset_graph_view.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/asset_graph_view/asset_graph_view.py
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/asset_graph_view/asset_graph_view.py
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/asset_graph_view/asset_graph_view.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/asset_graph_view/asset_graph_view.py
Outdated
Show resolved
Hide resolved
7574878
to
f976741
Compare
Thanks for feedback @OwenKephart updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! Tiny remaining nits but I'm feeling good about this structure.
self._partitions_def, | ||
{AssetPartition(self.asset_key, partition_key) for partition_key in partition_keys}, | ||
) | ||
& self._compatible_subset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should be self._compatible_subset & AssetSubset.from_partitions_set(...)
for perf. This method will often be called on an AssetSlice which contains an AllPartitionsSubset, which has a "free" implementation of __and__
(AllPartitionsSubset & X = X
), but if we do X & AllPartitionsSubset
, the default __and__
implementation will force us to construct the keys in the AllPartitionsSubset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's specialize the base class operators instead so we don't have to think about it. #20354
python_modules/dagster/dagster/_core/asset_graph_view/asset_graph_view.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/asset_graph_view/asset_graph_view.py
Outdated
Show resolved
Hide resolved
f976741
to
7ffcdd1
Compare
""" | ||
|
||
effective_dt: datetime | ||
last_event_id: Optional[int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming opinion: Since TemporalContext
represents a point in time/event history, IMO these should just be datetime
and event_id
-- the qualifiers "effective" and "last" are already implied by the fact that these are on a TemporalContext
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the only issue here is that there values will appear on their own without a TemporalContext object in lots of places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I understand e.g. "effective datetime" will became a new standard phrase, but I think my reasoning still holds. Like if I had a cursor class that represented an x, y upper bound:
cursor = MaxCursor(x=..., y=...) # not max_x, max_y
# then later if I need to extract one of these values
max_x = cursor.x
I think the situation is similar here. Just kinda a matter of taste though, NBD.
|
||
|
||
PartitionKey: TypeAlias = Optional[str] | ||
AssetPartition: TypeAlias = AssetKeyPartitionKey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming opinion: Having one of these qualified with Key
and the other not is IMO a bit confusing, since they are both actually keys. If you want to lose AssetKeyPartitionKey
, IMO should be either:
Partition
AssetPartition
or
PartitionKey
AssetPartitionKey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point. The only thing is that people have been quite happy renaming variables asset_partitions
. How do we feel about asset_partition_keys
instead everywhere? cc: @OwenKephart on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like asset_partition_key
and it also matches what we do with asset_check_key
/AssetCheckKey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K i'm sold. Would like consensus from @clairelin135 @OwenKephart @sryza before we do a mass renaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like AssetKeyPartitionKey
-> AssetPartitionKey
.
A couple concerns I have a about PartitionKey
:
- Main concern: if it's just a type alias, it's not enforced by the type system, which means we're likely to end up with
a mix ofstr
andPartitionKey
all over the codebase. And policing this in reviews feels like a potentially large time sync. - If we do go this route, I think it should be
str
, notOptional[str]
. There are some situations in which a partition key is required, and we'd need a way to express that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OwenKephart I think that's a great call. I'll take that approach and respin this 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm actually I think pyright is too permissive on this:
from typing import NewType
PartitionKey = NewType('PartitionKey', str)
def _takes_str(string: str):
...
def foo() -> None:
_takes_str(PartitionKey('foo'))
This works. We might need a custom class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the following is the scenario we'd be more worried about (which pyright does catch):
from typing import NewType
PartitionKey = NewType("PartitionKey", str)
def _takes_pk(pk: PartitionKey) -> None: ...
def foo() -> None:
_takes_pk("foo")
imagining _takes_pk as something like:
def get_children_partitions(
self,
dynamic_partitions_store: DynamicPartitionsStore,
current_time: datetime,
asset_key: AssetKey,
partition_key: PartitionKey,
) -> AbstractSet[AssetKeyPartitionKey]:
One issue there is that there are plenty of user-facing methods like execute_in_process
which take a partition_key
argument so we wouldn't be able to change those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to quick do a version that makes a completely new type to see how it feels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue there is that there are plenty of user-facing methods like execute_in_process which take a partition_key argument so we wouldn't be able to change those.
Yes I think the normal, public API should be in raw strings regardless. However tradeoffs are different there than places where a user will be writing custom AMP rules that require deep understanding of the partitioning system.
|
||
@property | ||
def _partitions_def(self) -> Optional["PartitionsDefinition"]: | ||
return self._asset_graph_view.asset_graph.get_partitions_def(self.asset_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My second AssetGraph callsites PR removes AssetGraph.get_partitions_def
. We should use new API asset_graph.get(self.asset_key).partitions_def
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great I just approved. Merge and I'll restack.
6210591
to
99cd2dc
Compare
## Summary & Motivation Owen made the sensible suggestion (#20312 (comment)) to take advantage of native partition set features to avoid materializing partition keys with the AllPartitionsSubset is involved. This required changing operation order. Instead, this PR proposes to also special case the base class operators when "other" is AllPartitionsSubset so we don't have to think about order. ## How I Tested These Changes BK
b92518d
to
6bb9be9
Compare
""" | ||
PartitionKey is a string that represents a partition key. It is a new type so | ||
that we can track the use of strs as partition keys. | ||
|
||
PartitionKey should only be created via the static factory methods on AssetPartitionKey. | ||
We want to consolidate all partition key creation so we have a leverage point to change | ||
how they. Eventually we want to create a new singleton string partition to represent | ||
the "all" partition key for unpartitioned assets. | ||
""" | ||
|
||
|
||
class AssetPartitionKey(NamedTuple): | ||
"""Represents a a partition in a particular asset. Prefer usage to AssetKeyPartitionKey. | ||
If you need to convert AssetKeyPartitionKeys to AssetPartitionKeys, use the static | ||
factory methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OwenKephart and @smackesey would like your sign-off here. I want to make sure that we are aligned that this executes on the plan we discussed in the Slack huddle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The docblock on AssetPartitionKey
and PartitionKey
)
Heads up @OwenKephart @smackesey and @sryza I decided not to introduce This PR is now written in terms of |
b2a2725
to
25abc9f
Compare
Seems sensible to me. Also feel like it would be good to fully replace |
partition windows. | ||
""" | ||
|
||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use a dataclass/namedtuple/basemodel here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do not play nicely with @cached_method
and @cached_property
.
I actually experimented with declaring __slots__
for the cached method instances, but that seemed like premature optimization.
25abc9f
to
1a3f697
Compare
Deploy preview for dagster-docs ready! Preview available at https://dagster-docs-36mj4somo-elementl.vercel.app Direct link to changed pages: |
## Summary & Motivation Owen made the sensible suggestion (#20312 (comment)) to take advantage of native partition set features to avoid materializing partition keys with the AllPartitionsSubset is involved. This required changing operation order. Instead, this PR proposes to also special case the base class operators when "other" is AllPartitionsSubset so we don't have to think about order. ## How I Tested These Changes BK
Summary & Motivation
With the additional of AMP, asset partitions, dynamic partitioning, and other related features, the complexity of our system has outstripped the ability of our abstractions to model it. A shallow indication of this our repeated threading of current time/evaluation time, storage_id/cursors, and dynamic_partitions_store up and down our stack. Another is that also have one off "caching" classes like
CachingStaleStatusResolver
,CachingInstanceQueryer
,CachingDataTimeResolver
and perhaps others I do not know about that present a wide range of capabilities inconsistently. Superficially it is annoying to have to thread time, storage_id, and dynamic_partitions_store around everywhere and hard to understand what class to use when interrogating the asset graph.This belies a more profound problem: Some of our capabilities respect current time; some do not. Some of our capabilities respect storage_id; some do not. That means the engineers do not know what reads are volatile with respect to time and underlying storage. It is also difficult to know what is to safe to cache or not. This means also that as an engineer navigates the state of the asset graph, it is difficult to know what operations are cheap to compute, versus potentially extremely expensive to compute.
This PR introduces
AssetGraphView
,AssetSlice
andTemporalContext
to address this issue.Temporal Context: TemporalContext represents an effective time, used for business logic, and last_event_id which is used to identify that state of the event log at some point in time. Put another way, the value of a TemporalContext represents a point in time and a snapshot of the event log.
Asset Graph View. It is a view of the asset graph and its state from the perspective of a specific temporal context. From the asset graph you can access asset slices, the main API for navigating the view of an asset graph.
Asset Slice: Represents a set of partitions attached to a particular asset. By having
AssetSlice
contain a reference toAssetGraphView
this enables a more elegant traversal of an asset graph's partition space than before.AssetSlice strives to be "partition-native". Very specifically, it deliberately does not have properties like
is_partitioned
. Instead they are just represented a slice with a single asset partition. Right now we do an inordinate (and unnecessary) special casing for unpartitioned assets. This will take some adjustment but will result in much cleaner code and and mental model.e.g.
Before:
After:
We also have naming conventions to indicate the performance characteristics of methods:
Naming conventions
get_
do some work in-memory but not hugely expensivecompute_
do potentially expensive work, like computeThese can potentially be very expensive if the underlying partition set has
an in-memory representation that involves large time windows. I.e. if
the underlying PartitionsSubset in the ValidAssetSubset is a TimeWindowPartitionsSubset
Usage of these methods should be avoided if possible if you are potentially
dealing with slices with large time-based partition windows.
This PR also addes two type aliases:
I consider the rename of
AssetKeyPartitionKey
toAssetPartition
merely an acknowledgement of the current ground truth in the code base, where nearly all local variables and method names refer toasset_partition
becauseasset_key_partition_key
is self evidently gross.The
PartitionKey
alias is perhaps more controversial.FAQ:
AssetSlice
and not reuseValidAssetSubset
?AssetSlice
is different as it contains a reference to the asset graph view. This makes it fundamentally different. It also allow for elegant traversal of the graph without having to thread a datetime and a reference to an instance or instance queryer everywhere, or having to convert betweenValidAssetSubset
andAssetSubset
.slice
andsubset
as local variables is very clear.bool_value
andsubset_value
onAssetSubset
to be quite gross, so this abstractions seeks to encapsulate that.The plan here is to introduce this abstraction then use it instead of direct use of various "resolver" and "queryer" classes throughout the code base. This will allow capabilities such as AMP, the backfill daemon, and state/outdated calculations int the web server to be more consistent and done with less code.
Two concrete items this will help with immediately:
Another objective is to allow user-defined AMP rules against a higher level API that is either this or something directly backed by this.
How I Tested These Changes
Included unit tests