-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dagster SDA workflow using @asset #28
Comments
Dagster recommends using SDAs for new pipelines, and I'd like to follow that mind shift in approaching a pipeline setup:
It would be great if this library would support this new way of pipeline design. Please correct me if my reasoning is wrong or I'm missing something! I'm new in the field of DataOps, which a history of tooling like Jenkins, Rundeck, Gitlab, etc. :) |
Yeah I think this would help! I tried using @multi_asset(
resource_defs={"meltano": meltano_resource},
compute_kind="meltano",
group_name="sources",
outs={
"departments": AssetOut(key_prefix=["hr"]),
"employees": AssetOut(key_prefix=["hr"]),
"jobs": AssetOut(key_prefix=["hr"]),
}
)
def meltano_run_job():
meltano_command_op("run tap-oracle target-postgres")() Now there's a continuous lineage from Meltano-produced tables (dbt sources) to the following dbt assets: However, when running
Same error when using |
I don't think you are able to call an op like that in the asset function. You might be able to achieve this by replicating this line: dagster-meltano/dagster_meltano/ops.py Line 94 in 0517c3c
|
We are currently stuck at creating the meltano assets dynamically from a list. Our approach at the moment looks somewhat like this but it keeps overwriting the prior assets and just displays the last one. @JulesHuisman do you know of any way how this could be achieved? names = ["a", "b", "c"]
for name in names:
@multi_asset(
compute_kind="meltano",
group_name="sources",
outs={
name: AssetOut(key_prefix=["hr"])
}
)
def meltano_run_single(context: OpExecutionContext):
return "a" |
You could do something like this, use a factory design to automatically create assets. In this example it are individual assets, but you could do the same to dynamically create multi assets. def meltano_asset_factory(names: list) -> list:
assets = []
for name in names:
@asset
def compute():
return name
assets.append(compute)
return assets |
Thank you. With the help of the dagster slack channel we kinda figured it out. def meltano_run_job(context, table: str):
context.log.info(context.selected_output_names)
#Run the meltano job "import_hr" with logging
execute_shell_command(
f"NO_COLOR=1 TAP_ORACLE__HR_FILTER_TABLES={table} meltano run import_hr",
output_logging="STREAM",
log=context.log,
cwd=MELTANO_PROJECT,
# env={"FILTER_TABLES":table},
) Code to build the assets out of YAML spec def build_asset(spec) -> AssetsDefinition:
@asset(name=spec["name"], group_name="sources", key_prefix="hr", compute_kind="meltano")
def _asset(context):
meltano_run_job(context=context, table=spec["table"])
return _asset
assets=[build_asset(spec) for spec in asset_list] The last thing we are missing is how to correctly use the filter_table env variable. |
@LeqitSebi I think the SELECT env var might help if you just want to run meltano to update a single table. Did you move away from the multi asset approach? Multi asset makes more sense if you have many child streams. With the single asset approach would you be reiterating your API calls? Multi asset fits better with my intended outcome, that any request to any meltano table causes a run of the whole tap and target combo. |
I figured out a workflow that more or less works and gives the correct lineage, assuming you want to run the whole tap, plus downstream tables. This goes into the dagster import os
from pathlib import Path
from dagster import ScheduleDefinition, DefaultScheduleStatus, Definitions, define_asset_job, AssetOut, multi_asset, OpExecutionContext, ConfigurableResource, AssetSelection
import enum
from dagster_meltano import meltano_resource
from dagster_dbt import load_assets_from_dbt_project, DbtCliResource
DBT_PROJECT_PATH = str(Path(__file__).parent.parent.parent.parent / "my_dbt_directory")
DBT_PROFILE = os.getenv('DBT_PROFILE')
DBT_TARGET = os.getenv('DBT_TARGET')
class MeltanoEnv(enum.Enum):
dev = enum.auto()
prod = enum.auto()
MELTANO_PROJECT_DIR = os.getenv("MELTANO_PROJECT_ROOT", os.getcwd())
MELTANO_BIN = os.getenv("MELTANO_BIN", "meltano")
resources= {
"dbt": DbtCliResource(project_dir=DBT_PROJECT_PATH, target=DBT_TARGET, profile=DBT_PROFILE),
# "meltano": meltano_resource,
}
ALL_TAP_STREAMS = {
"freshdesk": [
"conversations",
"ticket_fields",
"tickets_detail",
],
"mailchimp": [
"campaigns",
"lists",
"lists_members",
"reports_email_activity",
"reports_sent_to",
"reports_unsubscribes",
],
"instagram": [
"media",
"media_children",
"media_insights",
"stories",
"story_insights",
],
"tiktok": [
"accounts",
"videos",
"comments",
]
}
def meltano_asset_factory(all_tap_streams: list) -> list:
multi_assets = []
jobs = []
schedules = []
for tap_name, tap_streams in all_tap_streams.items():
@multi_asset(
name=tap_name,
resource_defs={'meltano': meltano_resource},
compute_kind="meltano",
group_name=tap_name,
outs={
stream: AssetOut(key_prefix=[f'raw_{tap_name}'])
for stream
in tap_streams
}
)
def compute(context: OpExecutionContext, meltano: ConfigurableResource):
command = f"run tap-{context.op.name} target-postgres"
meltano.execute_command(f"{command}", dict(), context.log)
return tuple([None for _ in context.selected_output_names])
multi_assets.append(compute)
asset_job = define_asset_job(f"{tap_name}_assets", AssetSelection.groups(tap_name))
basic_schedule = ScheduleDefinition(
job=asset_job,
cron_schedule="@hourly",
default_status=DefaultScheduleStatus.RUNNING
)
jobs.append(asset_job)
schedules.append(basic_schedule)
return multi_assets, jobs, schedules
meltano_assets, jobs, schedules = meltano_asset_factory(ALL_TAP_STREAMS)
dbt_assets = load_assets_from_dbt_project(DBT_PROJECT_PATH, profiles_dir=DBT_PROJECT_PATH,)
defs = Definitions(
assets= (dbt_assets + meltano_assets),
resources= resources,
jobs=jobs,
schedules=schedules,
) |
@jaceksan and I are working on an extenstion of the dagster-meltano plugin that includes functionality to automatically load tapstreams into dagster as assets. We're just getting started on it; collaborators are welcome! |
If I understand correctly that would mean it's not necessary to keep a list of taps and streams in |
We are now struggling with how streams/attributes are defined in meltano.yml, resp. in meltano_manifest.json. I loaded dbt assets to Dagster in my demo project and asset names are equal to underlying table names. |
The issue I've found is consistency in naming between meltano and DBT. I don't worry too much about the asset names in dagster really. In my instance, every tap in meltano has an equivalent source named
Without this info, dagster can't infer that the dbt source is a downstream of the associated meltano stream. Does that explain a bit more? |
Hi, I want to define dagster
@asset
s for meltano runs, as the dagster SDA seems to be the recommended/most reasonable way to design new data flows. How could this be configured withdagster-meltano
?The text was updated successfully, but these errors were encountered: