Adding support for commas in field names #1896

masseyke · 2022-02-02T19:33:23Z

This commit adds support for commas in field names, which is allowed by Elasticsearch.
Closes #1380

jbaiera · 2022-02-10T18:12:09Z

I'm wondering, would it be easier if we just HTTP encoded the data when concatenating it like this? Do we display concatenated information anywhere? If so, maybe we just HTTP encode the fields before concatenating them for the field setting?

masseyke · 2022-02-10T19:02:17Z

I had thought about base64 encoding them but figured it would be nice to keep them readable. HTTP encoding would be a good balance. I'll see if there are any problems with that.

masseyke · 2022-02-10T20:07:13Z

That works nicely for my tests, and is a good bit less code. The only downside is that it runs on every call to tokenize() and concatenate() rather than only calls on strings with the delimiter. I'm not sure if we'll run into any cases where that will hurt us (for example things getting double-http-encoded or something like that).

masseyke · 2022-02-10T21:02:41Z

Actually the http encoding wound up making me a little too nervous. It broke one unit test (HttpEncodingToolsTest). But worse, it does something that I don't think most people would expect of methods called tokenize and concatenate (they're both public methods). So for now I've reverted back to the escaping commas solution, unless someone can think of something better.

jbaiera

Should we pick this back up? Should probably avoid letting it fall between the cracks.

jbaiera · 2022-05-02T19:10:33Z

mr/src/main/java/org/elasticsearch/hadoop/util/StringUtils.java

+        boolean inQuotedToken = false;
+        for (char character : string.toCharArray()) {
+            if (character == '\"') {
+                inQuotedToken = !inQuotedToken;


I still think just going in an encoding/decoding the fields where needed would be simpler. For instance, this logic now breaks if we include quotation marks in the field names. Granted, that's probably even more unlikely to happen than commas, but if we're talking about respecting the tokenize method contracts, it's still broken.

I'll have to come back to this when I have a little time to get into it. I don't remember exactly what I meant by

it does something that I don't think most people would expect of methods called tokenize and concatenate

But I remember it being bad enough that it seemed like a deal-killer for encoding/decoding the fields.

Adding support for commas in field names

9717105

masseyke added v8.2.0 :Core feature labels Feb 3, 2022

masseyke marked this pull request as ready for review February 3, 2022 19:46

masseyke requested a review from jbaiera February 7, 2022 17:07

Trying to force a build

a333af2

masseyke added 3 commits February 10, 2022 14:12

Switching to http encoding

1ee6992

optimizing imports

b6be271

Reverting back to the escaped commas solution

7bd4997

jbaiera added v8.3.0 and removed v8.2.0 labels May 2, 2022

jbaiera reviewed May 2, 2022

View reviewed changes

mark-vieira changed the base branch from master to main May 6, 2022 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for commas in field names #1896

Adding support for commas in field names #1896

masseyke commented Feb 2, 2022

jbaiera commented Feb 10, 2022

masseyke commented Feb 10, 2022

masseyke commented Feb 10, 2022

masseyke commented Feb 10, 2022

jbaiera left a comment

jbaiera May 2, 2022

masseyke May 2, 2022

Adding support for commas in field names #1896

Are you sure you want to change the base?

Adding support for commas in field names #1896

Conversation

masseyke commented Feb 2, 2022

jbaiera commented Feb 10, 2022

masseyke commented Feb 10, 2022

masseyke commented Feb 10, 2022

masseyke commented Feb 10, 2022

jbaiera left a comment

Choose a reason for hiding this comment

jbaiera May 2, 2022

Choose a reason for hiding this comment

masseyke May 2, 2022

Choose a reason for hiding this comment