Single input & multiple output: Best practices with combining Dagster and star schema / normalization of data #22803
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We are receiving wide transaction type dataset that then gets modeled into more of a star schema in our DW. Effectively we are normalizing the data. So currently we extract the dimensions values from the transaction data (think the product, for example), we check whether it exists in dimension table and if not, it is added. We repeat this for all other dimensions. All of this is done in a single big IO-manager for the asset that imports this transaction data.
The issue is that is the IO-manager gets very complicated and once we another source transaction that shares dimensions we need to repeat IO-manager mess.
How should this be handled? For one case I made a "master" IO-manager that kicks of these sub IO-managers so that at least the dimension level "sub" IO-managers can be shared. Another way I suppose would be is to have dimension be its own asset and reuse at the asset level. The asset approach though you cannot guarantee atomicity (either add all transactions and dimensions or neither) and it carries some overhead and complexity with handling partitions and loading inputs.
How have you handled this? Is there any good examples or guidance / best practices on how to handle this kind of single input -> multiple output scenario?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions