How can I upsert partitions with an IO manager? #18066

2023-11-16T14:34:28Z

dagsir[bot]
bot Nov 16, 2023

What is the best practice for IO Management of a statically partitioned table using Dagster(v1.5.8) and Postgres (v15). In my scenario I have a table that i'm intending to partition by 'state' i.e. the states in a country. I'm unclear how i'm suppose to write the IO Manager to 'upsert' data only for each 'state'. Dagster's documentation suggests that ...
"I/O managers can be written to handle partitioned assets. For a partitioned asset, each invocation of handle_output will (over)write a single partition, and each invocation of load_input will load one or more partitions."
I'd like my IO Manager to write only data related to a partition, however the only options i see available on the DataFrame.to_sql(if_exists="fail/replace/append")
So it only appends data or rebuilds (wipes the table and starts again) the table. If we use the states 'California', 'New York', 'Texas'. I'd like the handle_output() to process the data per state meaning that after i've added data for 'California' and I write data for 'New York', that I dont loose 'California' when writing 'New York'.

On my previous projects i've used DBT to manage incremental tables and used Dagster for the pipeline partitioning. This is the first time i'm trying to do things only with Dagster and i'm struggling with this simple scenario.

Partitioned Software Asset Code

from dagster import asset, OpExecutionContext, AssetIn
from rewild_dagster.partitions import state_partitions
import pandas as pd
from datetime import datetime
import os
import requests
import json

@asset(compute_kind="python", io_manager_key="stage_species_occurrence_io_manager", partitions_def=state_partitions)
def stage_species_occurrence(context: OpExecutionContext, unique_location: pd.DataFrame) -&gt; pd.DataFrame:

  state = context.asset_partition_key_for_output()
  context.log.info(f"Partition {state}")

  flimit = 10000
  radius = 10

  location_id_list = []
  lat_list = []
  long_list = []
  occurrence_list = []

  partitioned_df = unique_location.loc[unique_location['state'] == state]
  partitioned_df_len = len(partitioned_df)
  counter = 1

  for index, row in partitioned_df.iterrows():

      location_id = row['location_id']
      lat = row['lat']
      long = row['long']

      context.log.info(f"Processing location_id {location_id}, # {counter} of {partitioned_df_len}")
      

      #call ala occurrence API
      ala_occurrences_api_endpoint_env = os.getenv("ELT_ALA_OCCURRENCES_API_ENDPOINT")
      ala_occurrences_api_endpoint_query_params = os.getenv("ELT_ALA_OCCURRENCES_API_ENDPOINT_QUERY_PARAMS")

      ala_occurrences_api_endpoint = ala_occurrences_api_endpoint_env + ala_occurrences_api_endpoint_query_params.format(radius=radius, lat=lat, lon=long,flimit=flimit)

      request = requests.Request('GET', url=ala_occurrences_api_endpoint)
      prepared = request.prepare()

      request_session = requests.Session()
      response = request_session.send(prepared)
      occurrence_json = json.loads(response.content)

      location_id_list.append(location_id)
      lat_list.append(lat)
      long_list.append(long)
      occurrence_list.append(occurrence_json)

      counter+=1

  data = {'location_id': location_id_list, 'lat': lat_list, 'long': long_list, 'occurrence':occurrence_list}
  df = pd.DataFrame(data)
  df['state'] = state
  df['_run_id'] = context.run_id
  df['_dwh_processed_change_dtm'] = datetime.now()

  return df

IO Manager Code

import pandas as pd
import os

from dagster import IOManager, io_manager, InputContext, OutputContext
from sqlalchemy import create_engine
from sqlalchemy.types import String, DateTime,DECIMAL, JSON
from dotenv import load_dotenv

load_dotenv()

class Stage_Species_Occurrence_IOManager(IOManager):
    def __init__(self):
        pass

    def handle_output(self, context: OutputContext, df: pd.DataFrame) -&gt; None:
        engine = create_engine(os.getenv("ELT_DATABASE_CONN_STRING"))

        dtype = {'location_id':String(), 'state':String(), 'lat':DECIMAL(), 'long':DECIMAL(), 'occurrence':JSON(), '_run_id':String(), '_dwh_processed_change_dtm':DateTime()}

        df.to_sql(name='stage_species_occurrence',schema='public', con=engine, if_exists='append', index=False, dtype=dtype)

    def load_input(self, context: "InputContext") -&gt; pd.DataFrame:
        table_name = context.upstream_output.asset_key.path[-1]
        return pd.read_sql(f"SELECT * FROM {table_name}", con=os.getenv("ELT_DATABASE_CONN_STRING"))
    
@io_manager()
def stage_species_occurence_io_manager(context):
    return Stage_Species_Occurrence_IOManager()

The question was originally asked in Dagster Slack.

alangenfeld · 2023-11-16T15:33:46Z

alangenfeld
Nov 16, 2023
Maintainer

Some additional details around what partition information is available on the context is documented here
https://docs.dagster.io/concepts/io-management/io-managers#handling-partitioned-assets

If you want to reference some internal implementation details, you can see https://github.com/dagster-io/dagster/blob/1.5.8/python_modules/dagster/dagster/_core/storage/db_io_manager.py which powers our DB io managers like snowflake and bigquery

0 replies

blackdigitaldotnet · 2023-11-16T21:09:36Z

blackdigitaldotnet
Nov 16, 2023

Thanks @sryza @tacastillo @alangenfeld , so I need to write a postgres io manager that extends dagsters DB io manager? Is there any plans on the card for Dagster to write a io manager for Postgres that handles partitioned assets? I've had a look at the code in your DB io manager and I would'nt even know where to begin if i was looking to extend this for Postgres partition upsert behaviour.

Why does it feel like i'm trying to crack a nut with a sledgehammer. I've got a basic scenario of a statically partitioned software asset that is being materialized in Postgres and i'd like to get the benefits of saving partitions independently of each other without having to introduce something like DBT.

1 reply

alangenfeld Nov 17, 2023
Maintainer

so I need to write a postgres io manager that extends dagsters DB io manager?

Not necessarily, it was provided as a reference to consult though I understand it is quite complex and hard to grok.

The way that the DBIoManager handles the upsert problem is that it deletes the "TableSlice" before appending the new replacement rows
https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/storage/db_io_manager.py#L131-L138

For example, pandas to snowflake would clear the rows with
https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-snowflake/dagster_snowflake/snowflake_io_manager.py#L332-L342

and then to_sql append the rows with

https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-snowflake-pandas/dagster_snowflake_pandas/snowflake_pandas_type_handler.py#L133-L139

So you could take a similar approach in your IO manager using context.asset_partition_keys (or the state values of the dataframe) delete the conflicting rows before appending their updates.

Alternatively looking at the pandas docs, it appears you can use a custom method in the to_sql to do a custom "on conflict update" behavior.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

m-o-leary · 2023-11-19T12:09:28Z

m-o-leary
Nov 19, 2023

We have this exact same use case and the missing piece to make this straightforward is the absence of an “upset” approach in the pandas API

I used this gist as the basis for our implementation: https://gist.github.com/pedrovgp/b46773a1240165bf2b1448b3f70bed32

0 replies

blackdigitaldotnet · 2023-11-19T19:17:39Z

blackdigitaldotnet
Nov 19, 2023

@alangenfeld and @m-o-leary thanks for the advice and guidance. I will give it a go when time permits and let you know how I went.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I upsert partitions with an IO manager? #18066

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How can I upsert partitions with an IO manager? #18066

dagsir[bot] bot Nov 16, 2023

Replies: 4 comments · 1 reply

alangenfeld Nov 16, 2023 Maintainer

blackdigitaldotnet Nov 16, 2023

alangenfeld Nov 17, 2023 Maintainer

m-o-leary Nov 19, 2023

blackdigitaldotnet Nov 19, 2023

dagsir[bot]
bot Nov 16, 2023

Replies: 4 comments 1 reply

alangenfeld
Nov 16, 2023
Maintainer

blackdigitaldotnet
Nov 16, 2023

alangenfeld Nov 17, 2023
Maintainer

m-o-leary
Nov 19, 2023

blackdigitaldotnet
Nov 19, 2023