Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can the api give a clear warning message if miss config for BatchLogRecordProcessorBuilder's maxQueueSize and maxExportBatchSize? #6454

Open
tongshushan opened this issue May 11, 2024 · 10 comments · May be fixed by #7045
Labels
Feature Request Suggest an idea for this project help wanted

Comments

@tongshushan
Copy link

tongshushan commented May 11, 2024

Hello,
For BatchLogRecordProcessorBuilder configurations, if miss configue maxQueueSize < maxExportBatchSize, can the api give a clear warning message? At present it's no hint message, and the logs will be lost.

io.opentelemetry: 1.37.0

related link:
#6443

Thanks.

@tongshushan
Copy link
Author

tongshushan commented May 13, 2024

Additional: If the users miss configue maxQueueSize < maxExportBatchSize, besides give a warning message, can we set maxExportBatchSize=maxQueueSize , so that can ensure the logs not lost?

@jkwatson
Copy link
Contributor

@tongshushan Are you able to put in a PR to address this?

chukunx added a commit to chukunx/opentelemetry-java that referenced this issue Jan 26, 2025
@chukunx
Copy link
Contributor

chukunx commented Jan 28, 2025

Hey @jkwatson i took a stab at it, please let me know if it looks good #7045

@trask
Copy link
Member

trask commented Jan 28, 2025

I have similar question to @breedx-splk's #7024 (comment), I'm not clear why maxQueueSize must be greater than maxExportBatchSize to ensure data loss doesn't occur?

at the same time, I agree that I probably would recommend configuring maxQueueSize >= maxExportBatchSize, I'd just like to be clear whether we're recommending this as a must to avoid data loss, or as a recommendation / best practice

@jack-berg
Copy link
Member

These lines of code are the problem:

Together, they collaborate to mean that the worker thread is never notified that its time to export based on the queue filling up. Instead, it always has to wait for next export time based on scheduleDelayNanos.

And so as the issue poster points out, the seemingly benign mistake of setting maxExportBatchSize to be greater than maxQueueSize results in data loss as soon the qeueue starts filling up faster than the time allotted in scheduleDelayNanos.

We don't necessarily have to throw an exception when the user misconfigures like this, but we need to fix the behavior so that when the queue fills up, the worker is properly signaled to perform an export.

@trask
Copy link
Member

trask commented Jan 28, 2025

Together, they collaborate to mean that the worker thread is never notified that its time to export based on the queue filling up. Instead, it always has to wait for next export time based on scheduleDelayNanos.

oh, yikes! I totally missed the wait/notify, I was thinking the queue was continuously drained (but I like the cleverness of limiting the context switching 👍)

@chukunx
Copy link
Contributor

chukunx commented Feb 2, 2025

Heyy thanks folks for chiming in, I did some leg work to see how other languages are doing on this matter, please let me know if that helps and how you'd like to proceed on this

@jack-berg
Copy link
Member

Thanks for that @chukunx. I think opentelemetry-java's behavior I described here is a bug. Options to fix:

  1. Continue to allow maxExportBatchSize > maxQueueSize, but fix the bug so that export is triggered when the maxQueueSize is reached. Log a warning when maxExportBatchSize > maxQueueSize.
  2. Throw an exception when maxExportBatchSize > maxQueueSize.

Option 1 represents a more lenient approach. We accept the invalid config and essentially ignore it, since maxExportBatchSize doesn't play any roll once maxQueueSize is used to trigger export and the queue size drained to the export batch has size maxQueueSize.

Option 2 is more rigid, representing the fail fast mentality.

We generally fail fast in this repo, although this is a bit of a special case because nothing is actually broken (after we fix the bug) when maxExportBatchSize > maxQueueSize. In a related situation, we don't throw an exception when users configure an OTLP connection timeout to be greater than the overall request timeout, despite there being no practical value in this config. This suggests option 1 is the solution.

@chukunx
Copy link
Contributor

chukunx commented Feb 4, 2025

Glad that helped @jack-berg! Option 1 sounds better from backward compatibility point of view as well.

For implementation I can see two approaches:

a. Add an additional check to this condition

if (batch.size() >= maxExportBatchSize || System.nanoTime() >= nextExportTime) {

to be something like

if (batch.size() >= maxExportBatchSize || batch.size() >= maxQueueSize || System.nanoTime() >= nextExportTime)

which effectively means batch.size() is checked against min(maxExportBatchSize, maxQueueSize). (Second thought on this: data loss can still happen when queue is filling up fast)

b. Comparing the two values at the builder and adjusting their values so that the max batch size does not exceeds the max queue size when creating the processor. The Go implementation is a good one to borrow in my opinion.

	if maxExportBatchSize > maxQueueSize {
		if DefaultMaxExportBatchSize > maxQueueSize {
			maxExportBatchSize = maxQueueSize
		} else {
			maxExportBatchSize = DefaultMaxExportBatchSize
		}
	}

Do you have preference over the two approaches?

@jack-berg
Copy link
Member

(Second thought on this: data loss can still happen when queue is filling up fast)

Yes data loss can happen with the batch processor. This is necessary to protect an application from unbounded resource utilization. Users can detect data loss and reconfigure (turn off instrumentation, reduce sampling rate, increase batch processor queue size) by looking at the processSpans counter where dropped=true, which is incremented here.

a. Add an additional check to this condition

I would also want to double check that the batch and spansNeeded fields are being sized / set appropriately, since they are involved in signalling as well.

Comparing the two values at the builder and adjusting their values so that the max batch size does not exceeds the max queue size when creating the processor.

This is easier to reason about and implement IMO. I think there is a minor semantic difference between the two approaches. The first allows a situation where the worker thread is triggered by the max queue size being reached, but a export batch size bigger than the max queue size since spans may continue to flow and be added to the queue as it is being drained. But I think we should probably ignore this edge case and opt for this simpler solution unless needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Suggest an idea for this project help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants