-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply hints for nested tables #2165
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
Adding this type annotation fixed 69 failing tests. The missing Optional impacted the dlt.common.validation.validate_dict().validate_prop() functions to parse the RESTAPIConfig object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my suggestion how to deal with naming convention. Docs requirements are in the ticket.
bef7a3f
to
e51bd25
Compare
The current implementation adds the tables to the schema (as tested), but it doesn't affect how the data is loaded. For example, the hints will appear in pipeline.default_schema.tables.keys()
# ignoring the dlt tables
# 'nested_data', 'override_child_outer1', 'override_child_outer1_innerfoo','nested_data__outer1', 'nested_data__outer1__innerfoo' Whereas the normalizer row counts show no ingested data for the tables pipeline.last_trace.last_normalize_info.row_counts
# 'nested_data': 2, 'nested_data__outer1': 2, 'nested_data__outer1__innerfoo': 2 I believe changes need to be made to The extractor would need to hold some mapping, but it could be more appropriate to move the logic to |
Relational normalizer follows its logic of creating nested tables and column names. it comes only from the data. there's no mechanism to rename those, except the root table name which the user must set.
I assume that in example you are giving, you used a custom table name for nested table. If this is not the case ping me on slack. maybe there's a bug somewhere in the ticket above, there's a note:
so I'd say we block setting table name on nested hints (also parent name and incremental do make sense) |
Working on the requested documentation (#1647), I encounter many edge cases of the current implementation. My understanding is that our goal is for these two snippets to be equivalent Current testFrom @dlt.source(schema=schema)
def ethereum(slice_: slice = None):
@dlt.resource(
table_name="blocks",
write_disposition={"disposition": "merge", "strategy": merge_strategy},
)
def data():
with open(
"tests/normalize/cases/ethereum.blocks.9c1d9b504ea240a482b007788d5cd61c_2.json",
"r",
encoding="utf-8",
) as f:
yield json.load(f) if slice_ is None else json.load(f)[slice_]
# also modify the child tables (not nested)
schema_ = dlt.current.source_schema()
blocks__transactions = schema_.tables["blocks__transactions"]
blocks__transactions["write_disposition"] = "merge"
blocks__transactions["x-merge-strategy"] = merge_strategy # type: ignore[typeddict-unknown-key]
blocks__transactions["table_format"] = destination_config.table_format
blocks__transactions__logs = schema_.tables["blocks__transactions__logs"]
blocks__transactions__logs["write_disposition"] = "merge"
blocks__transactions__logs["x-merge-strategy"] = merge_strategy # type: ignore[typeddict-unknown-key]
blocks__transactions__logs["table_format"] = destination_config.table_format
return data Target API@dlt.source(schema=schema)
def ethereum(slice_: slice = None):
@dlt.resource(
table_name="blocks",
write_disposition={"disposition": "merge", "strategy": "delete-insert"}
)
def data():
with open(
"./ethereum.blocks.9c1d9b504ea240a482b007788d5cd61c_2.json",
"r",
encoding="utf-8",
) as f:
yield json.load(f) if slice_ is None else json.load(f)[slice_]
data.apply_hints(
path=["transactions"],
write_disposition="merge",
additional_table_hints={"x-merge-strategy": "delete-insert"},
table_format=None,
)
data.apply_hints(
path=["transactions", "logs"],
write_disposition="merge",
additional_table_hints={"x-merge-strategy": "delete-insert"},
table_format=None,
)
return data ProblemCurrently, inspecting the source via When in the lifecycle should the hints be assigned? After Setting parents
This PR explicitly sets odd bugWhen debugging the call to def ensure_compatible_tables(
schema_name: str, tab_a: TTableSchema, tab_b: TPartialTableSchema, ensure_columns: bool = True
) -> None:
"""Ensures that `tab_a` and `tab_b` can be merged without conflicts. Conflicts are detected when
- tables have different names
- nested tables have different parents
- tables have any column with incompatible types
Note: all the identifiers must be already normalized
"""
if tab_a["name"] != tab_b["name"]:
raise TablePropertiesConflictException(
schema_name, tab_a["name"], "name", tab_a["name"], tab_b["name"]
)
table_name = tab_a["name"]
# check if table properties can be merged
# breakpoint()
# tab_a.get("parent"): None
# tab_b.get("parent"): None
if tab_a.get("parent") != tab_b.get("parent"):
# breakpoint()
# tab_a.get("parent"): None
# tab_b.get("parent"): "blocks"
raise TablePropertiesConflictException(
schema_name, table_name, "parent", tab_a.get("parent"), tab_b.get("parent")
)
if not ensure_columns:
return |
right!
this is pretty deep PR. we'll take time to iron this out. I still have some issues with user interface (ie. @dlt.resource should accept a list of nested resources so users do not need to do many |
d268006
to
87f8a4e
Compare
… table name from meta was not applied in compute_table_schema of hints
87f8a4e
to
61c8872
Compare
633cad8
to
ec8d2bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed all files. I believe there are some typos and dev comments to remove.
I'm very excited about this feature. Awesome work!
) | ||
# this is a new table so allow evolve once | ||
if schema_contract["columns"] != "evolve" and self.schema.is_new_table(table_name): | ||
computed_table["x-normalizer"] = {"evolve-columns-once": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sidenote: Reading through the codebase "evolve-columns-once": True
is set after the table sees data for the first time. Maybe the name "columns-evolved-once"
would better communicate that.
…ommented leftovers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was a good review thx! stuff fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All reviews were addressed. Looks good to me.
Just fix failing tests. Seems related to comparing lists and item ordering. Could use sets instead
ah yeah I know exactly what happens here. this must pass. it checks the inferred column order... |
6f44e48
to
14c55cc
Compare
22816dc
to
318530d
Compare
Description
Draft of nested table hints implementation:
Is working so far but there are some bugs and tests needed.
Related Issues
Additional Context