Revise constraints on name/ID. #483

marcenacp · 2024-02-05T18:31:01Z

Fixes: #449

github-actions · 2024-02-05T18:31:16Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

goeffthomas

This is definitely an improvement on the dataset name, but still kind of concerning on the node name requirement for data repos as it puts us in some tough situations. Take the Economy (GDP per Capita) column. If parentheses aren't allowed, we have to somehow force it into shape by converting invalid characters to valid ones, which is why you see the name of Economy--GDP-per-Capita-. And then when you load and work with the data, that's what you're coding against. I understand we want end users to generate these semantically from what they know about the data and avoid these sorts of things, but the constraint still feels limiting. Did the JSON-LD @id field not prove beneficial (I admit I didn't look into it myself). It does seem a little weird to use a JSON-based standard the relies on XML naming standards.

ccl-core

Thanks!

ccl-core · 2024-02-06T08:55:43Z

python/mlcroissant/mlcroissant/_src/structure_graph/base_node.py

@@ -178,9 +177,16 @@ def validate_name(self):
            self.add_error(
                f'The identifier "{name}" is too long (>{_MAX_ID_LENGTH} characters).'
            )
-        regex = re.compile(rf"^{ID_REGEX}$")
+        if self.ctx.is_v0():


Maybe for the unfamiliar reader it might be useful to add a comment specifying that we don't enforce this on jsonld conforming to <1.0 ? ctx.is_v0 is not necessarily very explicative

ctx.is_v0 is likely to appear many times in the code, before we do the refacto of a config-based parsing of the JSON-LD. So I think it should be explicit without needing a comment each time we use it. What would you change to make it more explicit while still readable/short?

Maybe just a comment line above? Not sure though, maybe it is unnecessary. You call :)

Acknowledge.

I would rather change everywhere ctx.is_v0 is used by a more explicit wording. Otherwise, we'd have to add a comment at each occurrence. We can think of it and do the change in another PR.

ccl-core · 2024-02-06T08:56:18Z

python/mlcroissant/mlcroissant/_src/structure_graph/base_node.py

        if not regex.match(name):
-            self.add_error(f'The identifier "{name}" contains forbidden characters.')
+            self.add_error(
+                f'The identifier "{name}" contains forbidden characters. Make sure'


Although this is correct only if conforms_to >= 1, right?

This should be correct in both cases as we check a regular expression in both cases

Then I probably understand the code wrongly.

I thought only conforms_to => 1 would conform to the guidelines in https://www.w3.org/TR/xml/#sec-common-syn (e.g. r"^[:A-Z_a-z][:A-Z_a-z-.0-9]*$", etc).

While r"[a-zA-Z0-9\-_\.]+", which is checked for conforms_to<1, would not conform to https://www.w3.org/TR/xml-id

Is this not correct?

This is correct! Didn't understand you were questioning the URL in the error.

ccl-core · 2024-02-06T08:56:52Z

python/mlcroissant/mlcroissant/_src/structure_graph/base_node_test.py

-                ' "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"'
-                " is too long (>255 characters)."
+                ' "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"'
+                " is too long (>256 characters)."


marcenacp · 2024-02-26T08:48:11Z

Closing this PR as @id replaces name. We will re-open another PR to change constraints on names.

marcenacp requested a review from a team as a code owner February 5, 2024 18:31

marcenacp force-pushed the feature/spec-changes-4 branch from 8739811 to 3be28f5 Compare February 5, 2024 19:30

goeffthomas reviewed Feb 6, 2024

View reviewed changes

ccl-core approved these changes Feb 6, 2024

View reviewed changes

marcenacp force-pushed the feature/spec-changes-4 branch from 3be28f5 to 38a2875 Compare February 6, 2024 14:59

marcenacp added 2 commits February 6, 2024 16:48

Revise constraints on name/ID.

07ddb95

Change the error message between v0 and v1.

1274a28

marcenacp force-pushed the feature/spec-changes-4 branch from 38a2875 to 1274a28 Compare February 6, 2024 16:48

marcenacp closed this Feb 26, 2024

github-actions bot locked and limited conversation to collaborators Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise constraints on name/ID. #483

Revise constraints on name/ID. #483

marcenacp commented Feb 5, 2024 •

edited

Loading

github-actions bot commented Feb 5, 2024 •

edited

Loading

goeffthomas left a comment

ccl-core left a comment

ccl-core Feb 6, 2024

marcenacp Feb 6, 2024

ccl-core Feb 6, 2024

marcenacp Feb 6, 2024

ccl-core Feb 6, 2024

marcenacp Feb 6, 2024

ccl-core Feb 6, 2024

marcenacp Feb 6, 2024

ccl-core Feb 6, 2024

marcenacp commented Feb 26, 2024

Revise constraints on name/ID. #483

Revise constraints on name/ID. #483

Conversation

marcenacp commented Feb 5, 2024 • edited Loading

github-actions bot commented Feb 5, 2024 • edited Loading

goeffthomas left a comment

Choose a reason for hiding this comment

ccl-core left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcenacp commented Feb 26, 2024

marcenacp commented Feb 5, 2024 •

edited

Loading

github-actions bot commented Feb 5, 2024 •

edited

Loading