Can the api give a clear warning message if miss config for BatchLogRecordProcessorBuilder's maxQueueSize and maxExportBatchSize? #6454

tongshushan · 2024-05-11T01:40:17Z

Hello,
For BatchLogRecordProcessorBuilder configurations， if miss configue maxQueueSize < maxExportBatchSize, can the api give a clear warning message? At present it's no hint message, and the logs will be lost.

io.opentelemetry: 1.37.0

related link:
#6443

Thanks.

tongshushan · 2024-05-13T09:41:12Z

Additional: If the users miss configue maxQueueSize < maxExportBatchSize, besides give a warning message, can we set maxExportBatchSize=maxQueueSize , so that can ensure the logs not lost?

jkwatson · 2024-05-13T14:48:56Z

@tongshushan Are you able to put in a PR to address this?

…ueue size.

chukunx · 2025-01-28T03:25:39Z

Hey @jkwatson i took a stab at it, please let me know if it looks good #7045

trask · 2025-01-28T15:30:57Z

I have similar question to @breedx-splk's #7024 (comment), I'm not clear why maxQueueSize must be greater than maxExportBatchSize to ensure data loss doesn't occur?

at the same time, I agree that I probably would recommend configuring maxQueueSize >= maxExportBatchSize, I'd just like to be clear whether we're recommending this as a must to avoid data loss, or as a recommendation / best practice

jack-berg · 2025-01-28T22:54:24Z

These lines of code are the problem:

Together, they collaborate to mean that the worker thread is never notified that its time to export based on the queue filling up. Instead, it always has to wait for next export time based on scheduleDelayNanos.

And so as the issue poster points out, the seemingly benign mistake of setting maxExportBatchSize to be greater than maxQueueSize results in data loss as soon the qeueue starts filling up faster than the time allotted in scheduleDelayNanos.

We don't necessarily have to throw an exception when the user misconfigures like this, but we need to fix the behavior so that when the queue fills up, the worker is properly signaled to perform an export.

trask · 2025-01-28T23:03:25Z

Together, they collaborate to mean that the worker thread is never notified that its time to export based on the queue filling up. Instead, it always has to wait for next export time based on scheduleDelayNanos.

oh, yikes! I totally missed the wait/notify, I was thinking the queue was continuously drained (but I like the cleverness of limiting the context switching 👍)

chukunx · 2025-02-02T20:20:19Z

Heyy thanks folks for chiming in, I did some leg work to see how other languages are doing on this matter, please let me know if that helps and how you'd like to proceed on this

jack-berg · 2025-02-03T21:40:47Z

Thanks for that @chukunx. I think opentelemetry-java's behavior I described here is a bug. Options to fix:

Continue to allow maxExportBatchSize > maxQueueSize, but fix the bug so that export is triggered when the maxQueueSize is reached. Log a warning when maxExportBatchSize > maxQueueSize.
Throw an exception when maxExportBatchSize > maxQueueSize.

Option 1 represents a more lenient approach. We accept the invalid config and essentially ignore it, since maxExportBatchSize doesn't play any roll once maxQueueSize is used to trigger export and the queue size drained to the export batch has size maxQueueSize.

Option 2 is more rigid, representing the fail fast mentality.

We generally fail fast in this repo, although this is a bit of a special case because nothing is actually broken (after we fix the bug) when maxExportBatchSize > maxQueueSize. In a related situation, we don't throw an exception when users configure an OTLP connection timeout to be greater than the overall request timeout, despite there being no practical value in this config. This suggests option 1 is the solution.

chukunx · 2025-02-04T01:34:55Z

Glad that helped @jack-berg! Option 1 sounds better from backward compatibility point of view as well.

For implementation I can see two approaches:

a. Add an additional check to this condition

opentelemetry-java/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java

Line 257 in cb64451

    
           if (batch.size() >= maxExportBatchSize || System.nanoTime() >= nextExportTime) {

to be something like

if (batch.size() >= maxExportBatchSize || batch.size() >= maxQueueSize || System.nanoTime() >= nextExportTime)

which effectively means batch.size() is checked against min(maxExportBatchSize, maxQueueSize). (Second thought on this: data loss can still happen when queue is filling up fast)

b. Comparing the two values at the builder and adjusting their values so that the max batch size does not exceeds the max queue size when creating the processor. The Go implementation is a good one to borrow in my opinion.

	if maxExportBatchSize > maxQueueSize {
		if DefaultMaxExportBatchSize > maxQueueSize {
			maxExportBatchSize = maxQueueSize
		} else {
			maxExportBatchSize = DefaultMaxExportBatchSize
		}
	}

Do you have preference over the two approaches?

jack-berg · 2025-02-07T15:33:27Z

(Second thought on this: data loss can still happen when queue is filling up fast)

Yes data loss can happen with the batch processor. This is necessary to protect an application from unbounded resource utilization. Users can detect data loss and reconfigure (turn off instrumentation, reduce sampling rate, increase batch processor queue size) by looking at the processSpans counter where dropped=true, which is incremented here.

a. Add an additional check to this condition

I would also want to double check that the batch and spansNeeded fields are being sized / set appropriately, since they are involved in signalling as well.

Comparing the two values at the builder and adjusting their values so that the max batch size does not exceeds the max queue size when creating the processor.

This is easier to reason about and implement IMO. I think there is a minor semantic difference between the two approaches. The first allows a situation where the worker thread is triggered by the max queue size being reached, but a export batch size bigger than the max queue size since spans may continue to flow and be added to the queue as it is being drained. But I think we should probably ignore this edge case and opt for this simpler solution unless needed.

tongshushan added the Feature Request Suggest an idea for this project label May 11, 2024

tongshushan mentioned this issue May 11, 2024

BatchLogRecordProcessorBuilder doesn't force flush logs to export when the maxQueueSize is full. #6443

Closed

jack-berg added the help wanted label Jan 6, 2025

chukunx added a commit to chukunx/opentelemetry-java that referenced this issue Jan 26, 2025

open-telemetry#6454 log warning message for misconfigured batch and q…

0744fa5

…ueue size.

chukunx linked a pull request Jan 26, 2025 that will close this issue

#6454 log warning message for misconfigured batch and queue size. #7045

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can the api give a clear warning message if miss config for BatchLogRecordProcessorBuilder's maxQueueSize and maxExportBatchSize? #6454

Can the api give a clear warning message if miss config for BatchLogRecordProcessorBuilder's maxQueueSize and maxExportBatchSize? #6454

tongshushan commented May 11, 2024 •

edited

Loading

tongshushan commented May 13, 2024 •

edited

Loading

jkwatson commented May 13, 2024

chukunx commented Jan 28, 2025

trask commented Jan 28, 2025 •

edited

Loading

jack-berg commented Jan 28, 2025

trask commented Jan 28, 2025

chukunx commented Feb 2, 2025

jack-berg commented Feb 3, 2025

chukunx commented Feb 4, 2025 •

edited

Loading

jack-berg commented Feb 7, 2025

Can the api give a clear warning message if miss config for BatchLogRecordProcessorBuilder's maxQueueSize and maxExportBatchSize? #6454

Can the api give a clear warning message if miss config for BatchLogRecordProcessorBuilder's maxQueueSize and maxExportBatchSize? #6454

Comments

tongshushan commented May 11, 2024 • edited Loading

tongshushan commented May 13, 2024 • edited Loading

jkwatson commented May 13, 2024

chukunx commented Jan 28, 2025

trask commented Jan 28, 2025 • edited Loading

jack-berg commented Jan 28, 2025

trask commented Jan 28, 2025

chukunx commented Feb 2, 2025

jack-berg commented Feb 3, 2025

chukunx commented Feb 4, 2025 • edited Loading

jack-berg commented Feb 7, 2025

tongshushan commented May 11, 2024 •

edited

Loading

tongshushan commented May 13, 2024 •

edited

Loading

trask commented Jan 28, 2025 •

edited

Loading

chukunx commented Feb 4, 2025 •

edited

Loading