Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an onBatchCopyRetry hook to allow more fine-grained error handling #1499

Open
grodowski opened this issue Feb 19, 2025 · 0 comments
Open

Comments

@grodowski
Copy link

grodowski commented Feb 19, 2025

Following #1498, my team is looking to add one more feature to support our gh-ost workflow. We sometimes run online migrations on databases with low max_binlog_cache_size, which may result in batch inserts failing due to FATAL Error 1197 (HY000): Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage;. In our current approach with LHM, we mitigated this with retry logic that reduces the batch size gradually inside the retry loop (original PR: Shopify/lhm#165)

I thought we don't have to copy this feature 1:1 and could instead leverage dynamic reconfiguration and hooks:

  • a batch insert fails with error 1197
  • a hook is executed and updates chunk-size to a smaller value
  • batch gets retried until the default-retries is reached

The only issue I found was the lack of a suitable hook to use like that. e.g. onFailure will be called on the error in question, but at this time we'd know the gh-ost process is about to panic with Fatale. So, I'd like to propose adding a new hook that gets invoked before each retry inside iterateChunks and lets us dynamically reconfigure the chunk-size.

We have a working prototype in Shopify#2 that solves our issue with max_binlog_cache_size and also seems like a sound general addition to gh-ost that providees more flexible error-handling options.

For context, let me also share the draft gh-ost-on-batch-copy-retry script that we wrote to handle the binlog cache errors:

#!/usr/bin/env ruby
# frozen_string_literal: true

require 'socket'

socket_path = '/tmp/ghost.sock'
backoff_factor = 0.8
min_chunk_size = 10
output = nil
handled_error_re = %r{Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage}

mysql_error_mesage = ENV.fetch("GH_OST_LAST_BATCH_COPY_ERROR")

unless mysql_error_mesage.match?(handled_error_re)
  puts "Nothing to do for error: #{mysql_error_mesage}"
  exit
end

Socket.unix(socket_path) do |socket|
  socket.puts 'chunk-size=?'
  output = socket.gets
end

chunk_size = output.to_i
new_chunk_size = [(chunk_size * backoff_factor).to_i, min_chunk_size].max

exit if chunk_size == new_chunk_size

Socket.unix(socket_path) do |socket|
    socket.puts "chunk-size=#{new_chunk_size}"
end

Please share any suggestions before I open a PR here, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant