-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dagster-airlift] complete revamp #23369
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
# Each task_id.yaml will look something like this: | ||
# - asset: my/asset/key | ||
# tags: {} | ||
# metadata: {} | ||
# deps: [upstream/asset/key] | ||
# compute_pointer: my.compute.pointer | ||
# - asset: my/other/asset/key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is totally configurable on a per-blueprint basis, correct?
@@ -0,0 +1,5 @@ | |||
dags: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can easily change this later but I do think we will want one file per dag
class AssetComputePointer(BaseModel): | ||
module_name: str | ||
fn_name: str | ||
|
||
|
||
class AssetInfo(BaseModel): | ||
asset_key: str | ||
tags: Dict[str, str] = {} | ||
metadata: Dict[str, Any] = {} | ||
deps: List[str] = [] | ||
asset_compute_pointer: Optional[AssetComputePointer] = None | ||
|
||
|
||
class MultiAssetSpec(BaseModel): | ||
asset_infos: List[AssetInfo] | ||
|
||
|
||
class MigratingAssetsBlueprint(BaseModel): | ||
type: Literal["external_asset"] = "external_asset" | ||
asset_infos: List[AssetInfo] | ||
migrated: bool | ||
compute_kind: str | ||
name: str | ||
|
||
def build_defs(self) -> Definitions: | ||
asset_specs = [ | ||
AssetSpec( | ||
key=AssetKey.from_user_string(info.asset_key), | ||
metadata=info.metadata, | ||
tags=info.tags, | ||
deps=[AssetDep(AssetKey.from_user_string(dep)) for dep in info.deps], | ||
) | ||
for info in self.asset_infos | ||
] | ||
|
||
@multi_asset( | ||
specs=asset_specs, | ||
name=self.name, | ||
compute_kind=self.compute_kind, | ||
) | ||
def _external_asset() -> None: | ||
if self.migrated: | ||
for info in self.asset_infos: | ||
ptr = check.not_none(info.asset_compute_pointer) | ||
module = import_module(ptr.module_name) | ||
compute_fn = getattr(module, ptr.fn_name) | ||
compute_fn() | ||
|
||
return Definitions(assets=[_external_asset]) | ||
|
||
|
||
class TaskMigrationState(BaseModel): | ||
task_id: str | ||
migrated: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our own sake, I think we should split these into different files, and make it super clear what we expect to be in the airflow process versus the dagster process. We will probably want to have different modules eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heavily agreed about the different modules. I figure for a lot of people adding a heavyweight dependency to dagster within their airflow code will be a non-starter. I've been thinking that airlift will consist of the dagster component, and then a lightweight-pipes esque package with the only dependency being requests. Maybe we could even get away without that.
I'll put them into their own file for now.
class MigratingAssetsBlueprint(BaseModel): | ||
type: Literal["external_asset"] = "external_asset" | ||
asset_infos: List[AssetInfo] | ||
migrated: bool | ||
compute_kind: str | ||
name: str | ||
|
||
def build_defs(self) -> Definitions: | ||
asset_specs = [ | ||
AssetSpec( | ||
key=AssetKey.from_user_string(info.asset_key), | ||
metadata=info.metadata, | ||
tags=info.tags, | ||
deps=[AssetDep(AssetKey.from_user_string(dep)) for dep in info.deps], | ||
) | ||
for info in self.asset_infos | ||
] | ||
|
||
@multi_asset( | ||
specs=asset_specs, | ||
name=self.name, | ||
compute_kind=self.compute_kind, | ||
) | ||
def _external_asset() -> None: | ||
if self.migrated: | ||
for info in self.asset_infos: | ||
ptr = check.not_none(info.asset_compute_pointer) | ||
module = import_module(ptr.module_name) | ||
compute_fn = getattr(module, ptr.fn_name) | ||
compute_fn() | ||
|
||
return Definitions(assets=[_external_asset]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to see the pattern where have an existing blueprint (e.g. dbt) and have it only return specs when it is not migrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one probably worth talking through live.
# metadata: {} | ||
# deps: [upstream/asset/key] | ||
# compute_pointer: my.compute.pointer | ||
# - asset: my/other/asset/key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type: python
python_fn: my_module.python_fn
assets:
- key:
tags:
metadata:
deps:
if self.migrated: | ||
for info in self.asset_infos: | ||
ptr = check.not_none(info.asset_compute_pointer) | ||
module = import_module(ptr.module_name) | ||
compute_fn = getattr(module, ptr.fn_name) | ||
compute_fn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accessing the attr should happen in the factory function, not the body of the multi-asset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing through comments
1eb3e6b
to
587e720
Compare
4ff80f5
to
de7ca4a
Compare
587e720
to
2c7d6eb
Compare
de7ca4a
to
a57a8c1
Compare
2c7d6eb
to
3247e12
Compare
a57a8c1
to
7c7c8b4
Compare
e745d9a
to
15efcd4
Compare
1ad114f
to
a70502a
Compare
class FromEnvVar(BaseModel): | ||
env_var: str | ||
|
||
|
||
class DbtProjectDefs(Blueprint): | ||
type: Literal["dbt_project"] = "dbt_project" | ||
dbt_project_path: FromEnvVar | ||
group: Optional[str] = None | ||
|
||
def build_defs(self) -> Definitions: | ||
source_file_name_without_ext = self.source_file_name.split(".")[0] | ||
dbt_project_path = Path(os.environ[self.dbt_project_path.env_var]) | ||
dbt_manifest_path = dbt_project_path / "target" / "manifest.json" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to see FromEnvVar
go away so we don't rely on it
-e ../../../../../python_modules/libraries/dagster-databricks | ||
-e ../../../../../python_modules/libraries/dagster-pandas | ||
-e ../../../../../python_modules/libraries/dagster-pyspark | ||
-e ../../../../../python_modules/libraries/dagster-spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need all of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok given that this is experimental we can fix forward stuff, but
- Eliminating unnecessary dependencies very important (e.g. early design partners who do not use databricks should not install dagster-databricks)
- We should cut our dependency on blueprints.
a70502a
to
ba0ad3f
Compare
ba0ad3f
to
6ea9e6b
Compare
Complete project revamp.