-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise constraints on name/ID. #483
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
8739811
to
3be28f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely an improvement on the dataset name, but still kind of concerning on the node name requirement for data repos as it puts us in some tough situations. Take the Economy (GDP per Capita)
column. If parentheses aren't allowed, we have to somehow force it into shape by converting invalid characters to valid ones, which is why you see the name of Economy--GDP-per-Capita-
. And then when you load and work with the data, that's what you're coding against. I understand we want end users to generate these semantically from what they know about the data and avoid these sorts of things, but the constraint still feels limiting. Did the JSON-LD @id
field not prove beneficial (I admit I didn't look into it myself). It does seem a little weird to use a JSON-based standard the relies on XML naming standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -178,9 +177,16 @@ def validate_name(self): | |||
self.add_error( | |||
f'The identifier "{name}" is too long (>{_MAX_ID_LENGTH} characters).' | |||
) | |||
regex = re.compile(rf"^{ID_REGEX}$") | |||
if self.ctx.is_v0(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for the unfamiliar reader it might be useful to add a comment specifying that we don't enforce this on jsonld conforming to <1.0 ? ctx.is_v0
is not necessarily very explicative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx.is_v0
is likely to appear many times in the code, before we do the refacto of a config-based parsing of the JSON-LD. So I think it should be explicit without needing a comment each time we use it. What would you change to make it more explicit while still readable/short?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just a comment line above? Not sure though, maybe it is unnecessary. You call :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledge.
I would rather change everywhere ctx.is_v0
is used by a more explicit wording. Otherwise, we'd have to add a comment at each occurrence. We can think of it and do the change in another PR.
if not regex.match(name): | ||
self.add_error(f'The identifier "{name}" contains forbidden characters.') | ||
self.add_error( | ||
f'The identifier "{name}" contains forbidden characters. Make sure' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is correct only if conforms_to >= 1, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be correct in both cases as we check a regular expression in both cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I probably understand the code wrongly.
I thought only conforms_to => 1
would conform to the guidelines in https://www.w3.org/TR/xml/#sec-common-syn (e.g. r"^[:A-Z_a-z][:A-Z_a-z-.0-9]*$", etc).
While r"[a-zA-Z0-9\-_\.]+", which is checked for conforms_to<1
, would not conform to https://www.w3.org/TR/xml-id
Is this not correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct! Didn't understand you were questioning the URL in the error.
' "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"' | ||
" is too long (>255 characters)." | ||
' "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"' | ||
" is too long (>256 characters)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😂
3be28f5
to
38a2875
Compare
38a2875
to
1274a28
Compare
Closing this PR as |
Fixes: #449