Skip to content

Do not respect synthetic_source_keep=arrays if type parses arrays #127796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

parkertimmins
Copy link
Contributor

@parkertimmins parkertimmins commented May 7, 2025

Types that parse arrays directly should not need to store values in _ignored_source if synthetic_source_keep=arrays. Since they have custom handling of arrays, it provides no benefit to store in _ignored_source when there are multiple values of the type.

For example, consider the document with point field of type geo_point:

{
    "obj":  { "point": [[2,3], [4, 5]] }
}

This set of points is not subject to synthetic_source_keep=arrays since the geo_point mapper does custom array parsing. Since this is the case, it does not make sense to require that the following equivalent document use _ignored_source:

{
    "obj":  [
         { "point": [2,3] },
         { "point": [4,5] }
   ]
}

Currently, this second document does use _ignored_source, which is a waste of space. This PR causes the second document to behave the same as the first, not using _ignored_source.

Fixes #126155

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels May 7, 2025
@parkertimmins parkertimmins requested a review from lkts May 7, 2025 03:28
@parkertimmins parkertimmins added the :StorageEngine/Mapping The storage related side of mappings label May 7, 2025
@elasticsearchmachine elasticsearchmachine added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels May 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @parkertimmins, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@parkertimmins parkertimmins requested a review from martijnvg May 7, 2025 14:37
@parkertimmins
Copy link
Contributor Author

parkertimmins commented May 7, 2025

Should we add the getFieldNameOffset machinery to geo_point to preserve the order of arrays? If so, perhaps we don't want this ticket? I think that if we are using getFieldNameOffset we probably will want to use _ignored_source for objects arrays 🤔.

@elasticsearchmachine
Copy link
Collaborator

Hi @parkertimmins, I've updated the changelog YAML for you.

Copy link
Contributor

@lkts lkts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think offset-based implementation would definitely be a better solution than "do nothing" approach we currently have. We can still do this later though and this PR does not block that work. So i am in favour of merging this.

Should we add a note about this here https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/geo-point#geo-point-synthetic-source?

@@ -460,7 +460,7 @@ static void parseObjectOrField(DocumentParserContext context, Mapper mapper) thr
if (context.canAddIgnoredField()
&& (fieldMapper.syntheticSourceMode() == FieldMapper.SyntheticSourceMode.FALLBACK
|| sourceKeepMode == Mapper.SourceKeepMode.ALL
|| (sourceKeepMode == Mapper.SourceKeepMode.ARRAYS && context.inArrayScope())
|| (sourceKeepMode == Mapper.SourceKeepMode.ARRAYS && context.inArrayScope() && parsesArrayValue(mapper) == false)
Copy link
Contributor

@lkts lkts May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correct fix in my opinion. But now that you mention offsets i actually wonder how does this work if offsets are enabled. It seems like (sourceKeepMode == Mapper.SourceKeepMode.ARRAYS && context.inArrayScope() is true and we'll proceed with storing the value in ignored source. But that seems wrong since we should be using offsets?

Copy link
Contributor Author

@parkertimmins parkertimmins May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we enable offsets, parsesArrayValue will still be true for geo_point, and we won't use ignored source, right? I'm not totally sure if this is safe though. Maybe by adding offsets, geo_point will be subject to the same issue that other types have when they are in an array context. In which case we would want to use ignored_source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, i was asking about existing fields that implement offsets logic like ip. Does it not have the same problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think existing fields with offsets do have this issue, and need to be use stored values. For example, if

|| (sourceKeepMode == Mapper.SourceKeepMode.ARRAYS && context.inArrayScope() && parsesArrayValue(mapper) == false)
is commented out.

The followings:


curl -XPUT localhost:9200/idx7 -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "mapping": {
        "source": {
          "mode": "synthetic"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "obj": {
        "properties": {
          "str": { 
            "synthetic_source_keep": "arrays",
            "type": "keyword"
          }
        }
      }
    }
  }
}
'

curl -XPOST localhost:9200/idx7/_doc -H 'Content-Type: application/json' -d'
{
    "obj":  [
        { 
            "str": ["cat", "dog"]
        },
        { 
            "str": "cat"
        }
    ]
}
'

Results in:

  "_source": {
          "obj": {
            "str": [
              "cat",
              "dog"
            ]
          }
        }

Meaning we do need the stored source. Or is that not what you're talking about?

@parkertimmins parkertimmins requested a review from lkts May 7, 2025 21:27
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Let's also backport this to 8.19 branch?

"point": { "type": "geo_point" }
"point": {
"type": "geo_point",
"synthetic_source_keep": "arrays"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to flip the points here to emphasize that order is different? Right now it happens to be a correct order.

@parkertimmins parkertimmins merged commit c04a956 into elastic:main May 9, 2025
16 of 17 checks passed
@parkertimmins parkertimmins deleted the parkertimmins/synthetic-source-geo-point-arrays branch May 9, 2025 19:49
@parkertimmins
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.19

Questions ?

Please refer to the Backport tool documentation

parkertimmins added a commit to parkertimmins/elasticsearch that referenced this pull request May 9, 2025
…astic#127796)

Types that parse arrays directly should not need to store values in _ignored_source if synthetic_source_keep=arrays. Since they have custom handling of arrays, it provides no benefit to store in _ignored_source when there are multiple values of the type.

(cherry picked from commit c04a956)
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request May 12, 2025
…astic#127796)

Types that parse arrays directly should not need to store values in _ignored_source if synthetic_source_keep=arrays. Since they have custom handling of arrays, it provides no benefit to store in _ignored_source when there are multiple values of the type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

synthetic_source_keep: "arrays" overtriggers for fields that parse arrays
4 participants