Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to execute meltano prod runs with dagster? #44

Closed
ReneTC opened this issue Oct 10, 2023 · 11 comments
Closed

How to execute meltano prod runs with dagster? #44

ReneTC opened this issue Oct 10, 2023 · 11 comments

Comments

@ReneTC
Copy link
Contributor

ReneTC commented Oct 10, 2023

I am really sorry about me spamming this repo. I see a huge potential in it and I am already quite invested.

I've been running into a problem, I am interested if someone already solved it.

You can easily make a job run in dagster with the repo and putting this in the meltano.yml :

- name: raw_data_to_duckdb
  tasks:
  - tap-spreadsheets-anywhere target-duckdb

However, if you need to run this in prod, you should add the flag --environment=prod , so:

- name: raw_data_to_duckdb_prod
  tasks:
  - --environment=prod  tap-spreadsheets-anywhere target-duckdb

But running
meltano invoke dagster:start
Results in an error:
dagster._core.errors.DagsterInvalidDefinitionError: "__environment=prod__tap_spreadsheets_anywhere_target_duckdb" is not a valid name in Dagster. Names must be in regex ^[A-Za-z0-9_]+$

Any ideas?

@ReneTC
Copy link
Contributor Author

ReneTC commented Oct 10, 2023

Trying with using meltano _run_op()
same issue

from dagster import repository, job
from dagster_meltano import meltano_resource, meltano_run_op

@job(resource_defs={"meltano": meltano_resource})
def meltano_run_job():
    tap_done = meltano_run_op("-environment=prod  tap-1 target-1")()
    meltano_run_op("-environment=prod  tap-2 target-2")(tap_done)

@repository()
def repository():
    return [meltano_run_job]

gives same error

@ReneTC
Copy link
Contributor Author

ReneTC commented Oct 10, 2023

Seems to me this could be fixed by chaning the dagster name here

Just remove everything that is not in the regex ^[A-Za-z0-9_]+$ but make sure the executed command is not the same as the dagster name

@JulesHuisman
Copy link
Contributor

This could either be fixed here:

def generate_dagster_name(input_string) -> str:
"""
Generate a dagster safe name (^[A-Za-z0-9_]+$.)
"""
return input_string.replace("-", "_").replace(" ", "_").replace(":", "_")

By also replacing the =.

But it might be easier to set the MELTANO_ENVIRONMENT to prod.

@ReneTC
Copy link
Contributor Author

ReneTC commented Oct 10, 2023

Would you like me to fix it, test it, and send a MR? (might first be done tomorrow).
For me the replacing of = works best. But I am not sure of the direction you want to go as the package owner.

@JulesHuisman
Copy link
Contributor

Would be great! I will see the PR appear.

@ReneTC
Copy link
Contributor Author

ReneTC commented Oct 10, 2023

Draft here: #45
I was not able to test it, I was confused how Meltano install this package.

I know you can add custom github urls (i.e my fork to test) to a package like so:

  - name: dagster
    variant: quantile-development
    pip_url: dagster-ext git+https://github.com/my_fork.git
    config:
      repository_dir: ${MELTANO_PROJECT_ROOT}/orchestrate

But I I am not sure where to switch out the main package dagster-meltano with a custom git url

@ReneTC
Copy link
Contributor Author

ReneTC commented Oct 17, 2023

Okay after this is merged #47 it sadly does not work yet.
If I have the prod task in meltano.yml

- name: task1
  tasks:
  - tap-spreadsheets-anywhere target-duckdb
- name: task1_prod
  tasks:
  - tap-spreadsheets-anywhere target-duckdb --environment=prod

When dagster-meltano runs, it will execute:
meltano run tap-spreadsheets-anywhere target-duckdb --environment=prod
but that is wrong it it returns the error:
Error: No such option: --environment

Correct syntax is meltano --environment=prod run tap-spreadsheets-anywhere target-duckdb but I don't see how that is possible with the package here. I've asked in meltano slack how to execute a dagster run in another env here.

@JulesHuisman
Copy link
Contributor

You should use the MELTANO_ENVIRONMENT variable to specify which environment to use.

@ReneTC
Copy link
Contributor Author

ReneTC commented Oct 17, 2023

Thanks Jules but I don't see how to use MELTANO_ENVIRONMENT in this example. Do you mind providing an example?

@JulesHuisman
Copy link
Contributor

For example, we deploy Meltano using a Docker container. In the Docker container we set:

ENV MELTANO_ENVIRONMENT=prod

That way we run meltano in production in our production environment.

@ReneTC
Copy link
Contributor Author

ReneTC commented Oct 18, 2023

Thanks for your specific example @JulesHuisman I appreciate that.
However, we are not using a docker container so that solution does not fix the issue.

I found one kinda-working-solution. If you run:
meltano --environment=prod invoke dagster:start All of the jobs will be executed as prod. Not ideal, because if you want to run dagster as --environment=dev next time, the dagster logs does not distinguish and so exeucution time, number of fails and so on is very confusing to see in the dagster UI.

I wouldn't mark this as closed, at least for my case. Possible solutions for me, could be an meltano-dagster operator that also accepts env as input, i.e something like:

    return meltano_command_op_with_env(
        command=f"--environment={env} run {command} --force", dagster_name=dagster_name
    )

But I am not sure it is the direction to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants