`ProduceSync` may block indefinitely despite context cancellation #930

iwinux · 2025-03-12T07:19:25Z

Description

The ProduceSync method in the kafka client can block indefinitely even when the context is cancelled. This occurs because ProduceSync uses a sync.WaitGroup to wait for all records to be processed, but there's no mechanism to break out of wg.Wait() when the context is cancelled.

Steps to Reproduce

start a broker locally with default settings
env KAFKA_HOST=127.0.0.1:9092 KAFKA_TOPIC=test go run produce.go
the client starts producing records every 5s
shut down the broker
the client hangs at ProduceSync(), even if deadline (1min) already exceeded.

produce.go - client code that produces records repeatedly within deadline

package main

import (
	"context"
	"fmt"
	"os"

	"log"
	"time"

	"github.com/twmb/franz-go/pkg/kgo"
)

func produce(ctx context.Context) error {
	client, _ := kgo.NewClient(
		kgo.SeedBrokers(os.Getenv("KAFKA_HOST")),
		kgo.RecordDeliveryTimeout(10*time.Second),
		kgo.AllowAutoTopicCreation(),
	)
	defer client.Close()

	var sequence int64
	topic := os.Getenv("KAFKA_TOPIC")
	ticker := time.NewTicker(5 * time.Second)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-ticker.C:
			sequence++

			record := &kgo.Record{
				Topic: topic,
				Value: []byte(fmt.Sprintf("%d", sequence)),
			}

			deadline, _ := ctx.Deadline()
			timeLeft := time.Until(deadline).Round(time.Second)
			log.Printf("Producing message with sequence %d to topic %s (%s remaining)", sequence, topic, timeLeft)

			if err := client.ProduceSync(ctx, record).FirstErr(); err != nil {
				return fmt.Errorf("failed to produce message: %w", err)
			}

			log.Printf("Message produced to partition %d at offset %d", record.Partition, record.Offset)
		}
	}
}

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), time.Minute)
	defer cancel()

	if err := produce(ctx); err != nil {
		log.Fatalf("Failed to produce: %v", err)
	}
}

Possible Root Cause(s)

#520 (comment)

So, the client retries sending the same batch over and over again (which will be deduplicated due to the sequence numbers) until the client receives a response -- and the client ignores any end-user context cancellation or retry limit.

Proposed Fix

If the goal is to unblock ProduceSync, either one might work:

replace sync.Group with golang.org/x/sync/errgroup.Group
wg.Wait() in another goroutine and signal by close(waitCh), meanwhile waitCh can be checked with a select {} block.

However, the comment mentioned data loss and idempotent write, which I don't quite understand.

The text was updated successfully, but these errors were encountered:

tr4pezohedron · 2025-03-26T10:46:01Z

+1. We ran into the same problem using ProduceSync. At some point, Kafka went down, and our requests hung indefinitely until Kafka came back up or we restarted our application pods, even though we expected them to fail by the context timeout.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ProduceSync` may block indefinitely despite context cancellation #930

`ProduceSync` may block indefinitely despite context cancellation #930

iwinux commented Mar 12, 2025

tr4pezohedron commented Mar 26, 2025

ProduceSync may block indefinitely despite context cancellation #930

ProduceSync may block indefinitely despite context cancellation #930

Comments

iwinux commented Mar 12, 2025

Description

Steps to Reproduce

Possible Root Cause(s)

Proposed Fix

tr4pezohedron commented Mar 26, 2025

`ProduceSync` may block indefinitely despite context cancellation #930

`ProduceSync` may block indefinitely despite context cancellation #930