-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processors/batch] Size calculations inconsistent, causing unbounded batch size in bytes #3262
Comments
There's also the added question of how batch should handle records where a single unsplittable record exceeds |
Counting the bytes is very exporter protocol specific unfortunately and there is no easy way for us to do in a way that will work for all protocols, also even for OTLP is pretty expensive to calculate size, so it will be good if this will be an additional component that splits by size instead of the core batch processor. |
Hey Bogdan, thanks for your response! If counting size isn't desired in this I totally understand that counting bytes is pretty exporter specific and can be expensive to do so adding it to this module may not be desired. I think there's a use case for a core processor that does batch by size though, as many of the core exporters (otlp, otlphttp, jaeger, etc) may send data to an endpoint behind a reverse proxy like nginx or aws api gateway that limits payload sizes. It wouldnt be perfect for sure, and if the limit on the gateway is 5MB then people may have to set their batch size limit to 4.5MB, but I think a central size-based batcher is still helpful/necessary given how many exporters are likely to have some sort of payload size limitation in place, and it would be easier to build a central batch size processor than add size/splitting logic to all of the relevant exporters. |
If I'm not mistaken the Payload size for API GW is 10MB and for Lambda its 6MB |
Can a byte size limiting option be implemented by counting the bytes of the in memory representation of the records? Obviously being a generic component its not possible for the batch processor to know the final serialized size of an unknown downstream exporter, but each exporter likely has a semi predictable translation ratio between the size of the in memory representation and the serialized representation. E.g. maybe on average, the OTLP representation is 1.2x the size of the in memory representation. This would give a better, although still imperfect, way of limiting the size of the exported payload. |
I think being able to batch by in-memory bytes, although it wouldn't be accurate without knowing the exporter representation, would give a more accurate way of defining batch limits than item count when the receiver expects payloads to be of a given size. Systems that behave better with a more constant payload size would also benefit from this as well, regardless of payload limits, as currently the distribution of payload sizes can vary tremendously depending on the nature of the spans within them. Sending small (in byte size) batches can affect performance. |
I'm considering 3 approaches for this task:
The second idea comes from the assumption that the cost of calling bm.sizer.MetricsSize(req) to get request size is the biggest concern in this thread, and we would like to minimise this but during the same time still have a good enough solution. Do we actually have a preference on the 2 approaches? |
Describe the bug
The
batch
processor is not limiting batch size by bytes, but rather by item count. I believe this is due to confusing nomenclature, let me try to explain...In
config.go
we defineSendBatchSize
andSendBatchMaxSize
, neither of which specify the unit being counted, i.e whether the size is counting bytes or items.In
batch_processor.go
size()
explicitly refers to bytes, anditemCount()
is explicitly the number of items based on the documentation.The problem begins in
processItem()
where the linefor bp.batch.itemCount() >= bp.sendBatchSize
begins to mix definitions of count and size and compares the number of items to the configured batch size.When we later
sendItems()
we have two metricsstatBatchSendSize
andstatBatchSendSizeBytes
which are properly counted usingitemCount()
andsize()
respectively.In the
export()
functions we double down on the mix up withif sendBatchMaxSize > 0 && bl.logCount > sendBatchMaxSize
again mixing count and size terms when the config and function documentation shows that they are distinct values.In
split_logs.go
the comparison conflates count and size once again.The tests for this appear to be misleading, because they are explicitly counting bytes for the size, but then use a
logsPerRequest
variable when calculating the expected number of batches.This causes logs/metrics/traces to be batched based on item count rather than byte size, which means we have no way to account for payload size limits when using the
otlphttpexporter
or any other exporter that can be payload size sensitive.Steps to reproduce
Any configuration will reproduce this issue.
What did you expect to see?
I expected setting
send_batch_size
would limit the size of a batch in bytes, not by item count. I would even go as far to point out that the default of 8192 being a common memory power of 2 multiple (8 KiB, 8 MiB, 8 GiB depending on units) creates a super misleading experience.What did you see instead?
The batch processor batches by item count, which causes us to hit payload size limits in common reverse proxies or frameworks with no way to set a custom payload size limit as we have no way to restrict the batch size.
What version did you use?
Version: main branch
What config did you use?
Environment
OS:
otel/opentelemetry-collector-contrib:0.27.0
on AWS EKSCompiler(if manually compiled): N/A, Docker image
Additional context
I was considering putting together a PR for this, but decided that given the pervasiveness of the problem (affects logs, traces, metrics, and processor telemetry) it would be better to open an issue first to discuss the desired state.
I consider this a pretty big issue, potentially even a blocker for 1.0, as it's a common configuration (payload size limits) that can cause undeliverable batches that are unrecoverable.
I would propose that we deprecate the configurations
send_batch_size
andsend_batch_max_size
and replace them with two new values,send_batch_min_size_bytes
andsend_batch_max_size_bytes
, and temporarily map the old names to the new names internally with a warning log so we don't break anything. Then we will need to update all code to look atsize()
rather thanitemCount()
when determining whether or not to split the batch. This serves two purposes, it clarifies the difference between the two configuration variables, and it clarifies the units and purpose of the variables.I can take a stab at this if the maintainers agree with my proposed approach, but am new to Go so it may take some time to do and given the implications it may be wiser to have someone more familiar take this on.
The text was updated successfully, but these errors were encountered: