[m3msg] Support parallel consumerWriter Flushes to pick next Write #4331

shaan420 · 2025-03-27T08:10:55Z

What this PR does / why we need it:
When m3msg writes to multiple replicas of the consumer services shards, it does so in a serial fashion, blocking if required. Such blocking results in unnecessary latencies and queue build up in the producer service. It does this to avoid sending a msg to all replicas each time since that would increase the network usage significantly but the tradeoff is that the other consumer writer for a replica might have enough room in the send buffer to accommodate the msg write. This PR introduces support for ForcedFlushes to avoid increasing network usage but at the same time choose a consumer writer we know will not block.
This is done by invoking a ForcedFlush() concurrently on all consumerWriter replicas and then using the consumerWriter that returned first. A Flush() does not write any new data to the connection but only flushes the data that it has buffered so far. This way no additional data is introduced on the wire that would not have been previously written.

Details:
A consumerWriter corresponds to one instance of the downstream service. Since an instance owns multiple shards of the m3msg topic, a consumerWriter will multiplex messages from multiple shards onto the same consumerWriter.
This can quickly fill up the flush buffer of a consumerWriter. When the consuming service is slow (for example when it is starting up and warming up its internal cache etc) it might not ACK the messages in time causing a build-up of messages in the m3msg producer msg queue. This leads to elevated consume latencies and memory pressure issues and possibly OOMs. The existing method of picking a consumerWriter was random which would potentially pick the slower one.
With this change we will pick the consumerWriter that has the most available capacity in its flush buffer. Not only that but we will initiate a ForcedFlush on all replicas in parallel and then pick the one that completed first. Note that this could still result in picking a consumerWriter that does not have enough capacity since other goroutines are also queueing up into the same consumerWriter but it at least picks a consumerWriter that was able to clear its flush buffer quickly. We emit a metric called "forced-flush-not-enough-buffer" that tracks how many times we encounter the scenario where we just returned from a flush but there still isn't enough capacity in the consumerWriter to accommodate the upcoming write(). In this case we should tune the WriteBufferSize() via the config.

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

Support blocking flushes for consumerWriter

fc3c5d3

shaan420 force-pushed the snair/m3msg-non-blocking-flushes branch from ec5bb11 to fc3c5d3 Compare March 27, 2025 08:16

shaan420 added 9 commits March 27, 2025 15:00

re-organize code, rename to ForcedFlush

44aac3f

add unit tests

ec7fbb9

conditionally wait for forcedFlush to finish, add unit tests

98197c2

release conditional var for forcedFlush state

9768f80

add logs

d1479c2

simplify forcedFlush conditional wait

39856ba

re-organize flush() method

2830759

code cleanup

6e856ff

simplify unittest

e7cdef9

fengcheng1518 approved these changes Apr 8, 2025

View reviewed changes

shaan420 added 4 commits April 8, 2025 16:31

fix unittest

f62ddd0

fix unittest

6dfa59e

fix unittest

f9bb49f

fix unittest

b2d6e84

shaan420 merged commit d1b37a5 into master Apr 9, 2025
2 checks passed

shaan420 deleted the snair/m3msg-non-blocking-flushes branch April 9, 2025 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[m3msg] Support parallel consumerWriter Flushes to pick next Write #4331

[m3msg] Support parallel consumerWriter Flushes to pick next Write #4331

shaan420 commented Mar 27, 2025 •

edited

Loading

[m3msg] Support parallel consumerWriter Flushes to pick next Write #4331

[m3msg] Support parallel consumerWriter Flushes to pick next Write #4331

Conversation

shaan420 commented Mar 27, 2025 • edited Loading

shaan420 commented Mar 27, 2025 •

edited

Loading