Single input & multiple output: Best practices with combining Dagster and star schema / normalization of data #22803

edwinvehmaanpera · 2024-07-02T11:43:50Z

edwinvehmaanpera
Jul 2, 2024

We are receiving wide transaction type dataset that then gets modeled into more of a star schema in our DW. Effectively we are normalizing the data. So currently we extract the dimensions values from the transaction data (think the product, for example), we check whether it exists in dimension table and if not, it is added. We repeat this for all other dimensions. All of this is done in a single big IO-manager for the asset that imports this transaction data.

The issue is that is the IO-manager gets very complicated and once we another source transaction that shares dimensions we need to repeat IO-manager mess.

How should this be handled? For one case I made a "master" IO-manager that kicks of these sub IO-managers so that at least the dimension level "sub" IO-managers can be shared. Another way I suppose would be is to have dimension be its own asset and reuse at the asset level. The asset approach though you cannot guarantee atomicity (either add all transactions and dimensions or neither) and it carries some overhead and complexity with handling partitions and loading inputs.

How have you handled this? Is there any good examples or guidance / best practices on how to handle this kind of single input -> multiple output scenario?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single input & multiple output: Best practices with combining Dagster and star schema / normalization of data #22803

{{title}}

Replies: 0 comments

Select a reply

Single input & multiple output: Best practices with combining Dagster and star schema / normalization of data #22803

edwinvehmaanpera Jul 2, 2024

Replies: 0 comments

edwinvehmaanpera
Jul 2, 2024