Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for commas in field names #1896

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

masseyke
Copy link
Member

@masseyke masseyke commented Feb 2, 2022

This commit adds support for commas in field names, which is allowed by Elasticsearch.
Closes #1380

@masseyke masseyke marked this pull request as ready for review February 3, 2022 19:46
@masseyke masseyke requested a review from jbaiera February 7, 2022 17:07
@jbaiera
Copy link
Member

jbaiera commented Feb 10, 2022

I'm wondering, would it be easier if we just HTTP encoded the data when concatenating it like this? Do we display concatenated information anywhere? If so, maybe we just HTTP encode the fields before concatenating them for the field setting?

@masseyke
Copy link
Member Author

I had thought about base64 encoding them but figured it would be nice to keep them readable. HTTP encoding would be a good balance. I'll see if there are any problems with that.

@masseyke
Copy link
Member Author

That works nicely for my tests, and is a good bit less code. The only downside is that it runs on every call to tokenize() and concatenate() rather than only calls on strings with the delimiter. I'm not sure if we'll run into any cases where that will hurt us (for example things getting double-http-encoded or something like that).

@masseyke
Copy link
Member Author

Actually the http encoding wound up making me a little too nervous. It broke one unit test (HttpEncodingToolsTest). But worse, it does something that I don't think most people would expect of methods called tokenize and concatenate (they're both public methods). So for now I've reverted back to the escaping commas solution, unless someone can think of something better.

@jbaiera jbaiera added v8.3.0 and removed v8.2.0 labels May 2, 2022
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pick this back up? Should probably avoid letting it fall between the cracks.

boolean inQuotedToken = false;
for (char character : string.toCharArray()) {
if (character == '\"') {
inQuotedToken = !inQuotedToken;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think just going in an encoding/decoding the fields where needed would be simpler. For instance, this logic now breaks if we include quotation marks in the field names. Granted, that's probably even more unlikely to happen than commas, but if we're talking about respecting the tokenize method contracts, it's still broken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to come back to this when I have a little time to get into it. I don't remember exactly what I meant by

it does something that I don't think most people would expect of methods called tokenize and concatenate

But I remember it being bad enough that it seemed like a deal-killer for encoding/decoding the fields.

@mark-vieira mark-vieira changed the base branch from master to main May 6, 2022 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ES Hadoop Library doesn't support fields with commas in them
2 participants