Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dagster-elt #16681

Merged
merged 24 commits into from
Oct 3, 2023
Merged

Add dagster-elt #16681

merged 24 commits into from
Oct 3, 2023

Conversation

PedramNavid
Copy link
Contributor

@PedramNavid PedramNavid commented Sep 21, 2023

Summary & Motivation

This adds the dagster-elt integration which aims to be a set of embedded ELT/ETL tooling to help data engineers with common ingestion/transform tasks such as ingesting data to and from databases, warehouses, and files.

This first iteration is a wrapper around the sling package. Sling is a lightweight CLI tool that allows users to sync data from a source/destination without the extra burden of heavyweight tooling that other solutions require.

How I Tested These Changes

Unit tests, local testing.

@petehunt
Copy link
Contributor

Can you confirm that this library works on Serverless?

@schrockn
Copy link
Member

I want to surface the concern I have (which @sryza also shares and flagged yesterday) that the dagster-elt without any reference to sling as the underlying technology is pretty problematic, as there is 1) more than one underlying tech to do ELT and 2) we don't want to overpromise our capabilties.

I fully support having elt in the same somewhere, but I need to be convinced that dagster-elt is the right thing to GTM with w/r/t to alternatives.

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Req'ing changes for q mgmt

@petehunt
Copy link
Contributor

I want to surface the concern I have (which @sryza also shares and flagged yesterday) that the dagster-elt without any reference to sling as the underlying technology is pretty problematic, as there is 1) more than one underlying tech to do ELT and 2) we don't want to overpromise our capabilties.

Let's call it dagster_embedded_elt?

@PedramNavid
Copy link
Contributor Author

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@github-actions
Copy link

Deploy preview for dagster-university ready!

✅ Preview
https://dagster-university-q2tdyhick-elementl.vercel.app
https://pdrm-dagster-elt.dagster-university.dagster-docs.io

Built with commit 2b9c00d.
This pull request is being automatically deployed with vercel-action

@github-actions
Copy link

github-actions bot commented Sep 27, 2023

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-axpkygfg3-elementl.vercel.app
https://pdrm-dagster-elt.components-storybook.dagster-docs.io

Built with commit 17f3570.
This pull request is being automatically deployed with vercel-action

@github-actions
Copy link

github-actions bot commented Sep 27, 2023

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-3vkrfmi38-elementl.vercel.app
https://pdrm-dagster-elt.core-storybook.dagster-docs.io

Built with commit 17f3570.
This pull request is being automatically deployed with vercel-action

@PedramNavid PedramNavid marked this pull request as ready for review September 29, 2023 03:29
@PedramNavid PedramNavid force-pushed the pdrm/dagster-elt branch 3 times, most recently from 5fd7cc2 to 88977bb Compare September 29, 2023 04:06
Copy link
Contributor

@petehunt petehunt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not appropriate to be the sole accept, but the api design and docs lgtm, barring a few small notes

update_key = [update_key]

@multi_asset(
compute_kind="sling", specs=[asset_spec], required_resource_keys={sling_resource_key}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice and well-document. A few comments:

  • This is a very large PR. We generally like to work in more incremental, digestible pieces. This leads to higher quality, more efficient code review and better outcomes.
  • There are some complicated components (esp regexs) that are worthy on independent unit tests, so I'm little concerned about test coverage here.

However:

spice_must_flow

Take my comments as things to note for future and followups.


@staticmethod
def _exec_sling_cmd(cmd, stdin=None, stdout=PIPE, stderr=STDOUT) -> Generator[str, None, None]:
ansi_escape = re.compile(r"\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this regex is complex enough that it is worthy of it's own comment and independent unit tests

assert proc.stdout

for line in proc.stdout:
fmt_line = str(line, "utf-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this another way of going line.decode("utf-8")?

Comment on lines +141 to +153
def _exec_sling_cmd(cmd, stdin=None, stdout=PIPE, stderr=STDOUT) -> Generator[str, None, None]:
ansi_escape = re.compile(r"\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])")
with Popen(cmd, shell=True, stdin=stdin, stdout=stdout, stderr=stderr) as proc:
assert proc.stdout

for line in proc.stdout:
fmt_line = str(line, "utf-8")
clean_line = ansi_escape.sub("", fmt_line).replace("INF", "")
yield clean_line

proc.wait()
if proc.returncode != 0:
raise Exception("Sling command failed with error code %s", proc.returncode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole code path def a bit scary and I wish it had its own test suite

Comment on lines +168 to +172
if self.source_connection.type == "file" and not source_stream.startswith("file://"):
source_stream = "file://" + source_stream

if self.target_connection.type == "file" and not target_object.startswith("file://"):
target_object = "file://" + target_object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems both very magical and very undocumented

Comment on lines +35 to +36
primary_key (Optional[Union[str, List[str]]], optional): The optional primary key to use when syncing.
update_key (Optional[Union[str, List[str]]], optional): The optional update key to use when syncing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to clarify that this is not an asset key. Threw me for a second.

compute_kind="sling", specs=[asset_spec], required_resource_keys={sling_resource_key}
)
def sync(context: AssetExecutionContext) -> MaterializeResult:
sling: SlingResource = getattr(context.resources, sling_resource_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should really have a better API for this where you can just get a dict of resources

match = re.search(r"(\d+) rows", stdout_line)
if match:
last_row_count_observed = int(match.group(1))
context.log.info(stdout_line)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info or debug?

@PedramNavid PedramNavid merged commit 5f07c5a into master Oct 3, 2023
@PedramNavid PedramNavid deleted the pdrm/dagster-elt branch October 3, 2023 16:07
yuhan pushed a commit that referenced this pull request Oct 3, 2023
## Summary & Motivation

This adds the `dagster-elt` integration which aims to be a set of embedded ELT/ETL tooling to help data engineers with common ingestion/transform tasks such as ingesting data to and from databases,
warehouses, and files.

This first iteration is a wrapper around the [sling](https://slingdata.io) package. Sling is a lightweight CLI tool that allows users to sync data from a source/destination without the extra burden of heavyweight tooling that other solutions require.

## How I Tested These Changes

Unit tests, local testing.

---------

Co-authored-by: Ben Pankow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants