Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogsDB compatibility: specify if the ordering of arrays needs to be preserved #2372

Closed
felixbarny opened this issue Aug 27, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@felixbarny
Copy link
Member

We're about to roll out LogsDB for all integrations. LogsDB uses synthetic _source. The result is that _source may differ from the original one in several ways. For example, the ordering of arrays is not preserved and values in an array are de-duplicated (internally arrays are stored in a sorted set).

I'd like to propose that ECS defines which for which fields the ordering is important, so that store_array_source should be enabled. This comes with a storage overhead but allows us to return the original values.

An example for a field where the ordering is important is process.args:

ecs/schemas/process.yml

Lines 143 to 153 in 5376570

- name: args
level: extended
type: keyword
short: Array of process arguments.
description: >
Array of process arguments, starting with the absolute path to the executable.
May be filtered to protect sensitive information.
example: "[\"/usr/bin/ssh\", \"-l\", \"user\", \"10.0.0.16\"]"
normalize:
- array

The ordering isn't always important. For example, I'd consider the storage tradeoff for process.thread.capabilities.permitted to not be worth it. What matters here is the set of capabilities a thread permits, not in which order.

ecs/schemas/process.yml

Lines 205 to 215 in 5376570

- name: thread.capabilities.permitted
level: extended
type: keyword
short: Array of capabilities a thread could assume.
pattern: ^(CAP_[A-Z_]+|\d+)$
description: >
This is a limiting superset for the effective capabilities that the
thread may assume.
example: "[\"CAP_BPF\", \"CAP_SYS_ADMIN\"]"
normalize:
- array

@ruflin
Copy link
Contributor

ruflin commented Aug 28, 2024

++ on the idea that Elasticsearch is smart about if a field contains an array or not and how to store it. If you send an array, ordering is preserved without having to specify any special mappings. This also ensures users don't have to learn about any new concept.

@andrewkroh
Copy link
Member

++ on the idea that Elasticsearch is smart about if a field contains an array or not and how to store it. If you send an array, ordering is preserved without having to specify any special mappings. This also ensures users don't have to learn about any new concept.

👍

I like this approach. Then we turn the question around, and ask which ECS array fields can we optimize by never storing source because they are truly sets.

@andrewkroh
Copy link
Member

Closing in favor of #2376 because the approach changed to opt-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants