Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise constraints on name/ID. #483

Closed
wants to merge 2 commits into from
Closed

Conversation

marcenacp
Copy link
Contributor

@marcenacp marcenacp commented Feb 5, 2024

Fixes: #449

@marcenacp marcenacp requested a review from a team as a code owner February 5, 2024 18:31
Copy link

github-actions bot commented Feb 5, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@marcenacp marcenacp force-pushed the feature/spec-changes-4 branch from 8739811 to 3be28f5 Compare February 5, 2024 19:30
Copy link
Contributor

@goeffthomas goeffthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely an improvement on the dataset name, but still kind of concerning on the node name requirement for data repos as it puts us in some tough situations. Take the Economy (GDP per Capita) column. If parentheses aren't allowed, we have to somehow force it into shape by converting invalid characters to valid ones, which is why you see the name of Economy--GDP-per-Capita-. And then when you load and work with the data, that's what you're coding against. I understand we want end users to generate these semantically from what they know about the data and avoid these sorts of things, but the constraint still feels limiting. Did the JSON-LD @id field not prove beneficial (I admit I didn't look into it myself). It does seem a little weird to use a JSON-based standard the relies on XML naming standards.

Copy link
Contributor

@ccl-core ccl-core left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -178,9 +177,16 @@ def validate_name(self):
self.add_error(
f'The identifier "{name}" is too long (>{_MAX_ID_LENGTH} characters).'
)
regex = re.compile(rf"^{ID_REGEX}$")
if self.ctx.is_v0():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for the unfamiliar reader it might be useful to add a comment specifying that we don't enforce this on jsonld conforming to <1.0 ? ctx.is_v0 is not necessarily very explicative

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.is_v0 is likely to appear many times in the code, before we do the refacto of a config-based parsing of the JSON-LD. So I think it should be explicit without needing a comment each time we use it. What would you change to make it more explicit while still readable/short?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a comment line above? Not sure though, maybe it is unnecessary. You call :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledge.

I would rather change everywhere ctx.is_v0 is used by a more explicit wording. Otherwise, we'd have to add a comment at each occurrence. We can think of it and do the change in another PR.

if not regex.match(name):
self.add_error(f'The identifier "{name}" contains forbidden characters.')
self.add_error(
f'The identifier "{name}" contains forbidden characters. Make sure'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is correct only if conforms_to >= 1, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be correct in both cases as we check a regular expression in both cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I probably understand the code wrongly.

I thought only conforms_to => 1 would conform to the guidelines in https://www.w3.org/TR/xml/#sec-common-syn (e.g. r"^[:A-Z_a-z][:A-Z_a-z-.0-9]*$", etc).

While r"[a-zA-Z0-9\-_\.]+", which is checked for conforms_to<1, would not conform to https://www.w3.org/TR/xml-id

Is this not correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct! Didn't understand you were questioning the URL in the error.

' "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"'
" is too long (>255 characters)."
' "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"'
" is too long (>256 characters)."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂

@marcenacp marcenacp force-pushed the feature/spec-changes-4 branch from 3be28f5 to 38a2875 Compare February 6, 2024 14:59
@marcenacp marcenacp force-pushed the feature/spec-changes-4 branch from 38a2875 to 1274a28 Compare February 6, 2024 16:48
@marcenacp
Copy link
Contributor Author

Closing this PR as @id replaces name. We will re-open another PR to change constraints on names.

@marcenacp marcenacp closed this Feb 26, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Feb 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Release constraints on name
3 participants