-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log based mode performance #972
Comments
What do you wanna speed up exactly? There are many moving parts here. |
I think general speed. If I have more than 100K inserts/updates/deletes per 1.5 minute sync job will start to back up. |
Just a disclaimer that there will always be a lag, this will not do real-time native replication unless you're willing to provision powerful machines or snowflake warehouses. You need to investigate where you're bottleneck is, the replication is both CPU and IO-bound: Is the tap processing inserts/updates/deletes events fast enough for your needs? if not then there is no config here to change, you have to bump the CPU of the machines on which Pipelinewise runs. But if it is fast but the Snowflake warehouse you're using is small or has other workloads running there then there might be queues in the warehouse and the batches are gonna take longer to flush and the pipeline would not be doing anything in the meantime, regardless of how fast the tap is at consuming change logs. If it's well provisioned, you could try smaller batch size or time-based batch flushing (check out the pipelinewise-target-snowflake Readme). |
The main issue is that most of the CPU intensive operations for the log-based replication in both the tap and target components are single-threaded and cannot be easily scaled. Your only option there is to either use multiple replication slots with multiple instances of Pipelinewise each syncing different tables or use CPU cores with a higher clock speed or IPC. If you have hstore or array type columns in your tables then this problem here will also seriously slow you down. |
This one is a follow up on #971 but a different case.
I was also tasting LOG_BASED replication and it works but it feels that it can be faster.
Eg I've set
batch_size_rows
to 100000 and I see in logs that to sync each batch of 100K take 1.5 minutes. Which looks a bit slow.I there any way to speed up this? (I did try
fastsync_parallelism
)The text was updated successfully, but these errors were encountered: