Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load CDK: Insert Loader Interface #56428

Open
wants to merge 3 commits into
base: jschmidt/dest-mssql/mssql-uses-bulk-load
Choose a base branch
from

Conversation

johnny-schmidt
Copy link
Contributor

Creates an InsertLoader interface, which drives the case where you're collecting records into bulk queries/api calls but without needing a specific connection open (ie, not like a jdbc prepared statement, but more like a bigquery bulk insert.)

I tested this with MSSQL StandardInsert and it works, but it's not a good fit. (In a separate PR I'm moving MSSQL to DirectLoader.) But I'm leaving this here for people to have something to build on it for BigQuery when the time comes.

@johnny-schmidt johnny-schmidt requested a review from a team as a code owner March 27, 2025 01:01
Copy link

vercel bot commented Mar 27, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 28, 2025 7:12pm

@johnny-schmidt johnny-schmidt force-pushed the jschmidt/dest-mssql/mssql-uses-bulk-load branch from c7c25f3 to 0d82af4 Compare March 28, 2025 19:05
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/load-cdk/insert-loader-only branch from 8220b2a to 2ac2a07 Compare March 28, 2025 19:06
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/dest-mssql/mssql-uses-bulk-load branch 2 times, most recently from e650f75 to 4bac608 Compare March 28, 2025 20:03
import kotlinx.coroutines.runBlocking
import org.jetbrains.annotations.VisibleForTesting

class ResourceReservingPartitionedQueue<T>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking: this code already existed elsewhere, and this PR just moves it all into here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the logic existed, spread out over a few beans, i packed it into one thing and migrated the tests as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could say you opened the can

import io.airbyte.cdk.load.message.DestinationRecordRaw
import io.airbyte.cdk.load.write.LoadStrategy

/**
Copy link
Contributor

@edgao edgao Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reviewing the PR, but I'm curious:

  • why is MSSQL a bad fit for this? my read from this comment is that it should fit exactly into this paradigm (... not that I've really read how MSSQL standard inserts work)
  • I think bigquery actually falls into the cases where the insert query is built in a streaming fashion via an open shared connection paradigm? unless you're thinking of the kafka case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because with MSSQL you can't start a new query for table X until you fully commit the previous one, so you might as well do the work in one step

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible bigquery works the same way and we don't need this at all

private val reservation = runBlocking {
reservationManager.reserveOrThrow(requestedResourceAmount, this)
}
private val minNumUnits: Int = numProducers + numConsumers * 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure, you don't mean (numProducers + numConsumers) * 2 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. What I'm calculating here really is "number of units each producer will hold" (numProducers) + "number of units each consumer will hold" (numConsumers) + "number of units if there's exactly 1 thing enqueued in each consumer's partition" (numConsumers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants