Define multiple record sets deriving from the same node #458

ccl-core · 2024-01-22T15:33:02Z

This is needed in order to have multiple RecordSets (deriving e.g. from different ReadFields operations) which derive from the same Read operation. As an example, refer to the coco2014-mini dataset below.

github-actions · 2024-01-22T15:33:19Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

marcenacp

Thanks, LGTM! I think there's possibly a simplification because Read operations should not have to know about the RecordSets (children)

marcenacp · 2024-01-22T15:53:09Z

python/mlcroissant/mlcroissant/_src/operation_graph/base_operation.py

+        # Multiple edges (e.g. to ReadFields operations) could generate from a Read
+        # operation.
+        other_candidates = [
+            operation for operation in self.nodes if str(operation).startswith("Read")


isinstance(operation, Read) or isinstance(operation, ReadFields)

(or something similar)

I thought about this when implementing, at the end I didn't do it because I didn't want to restructure the scripts to avoid cyclic dependencies (ReadFields and Read depend on Operation which is defined in the same script as Operations, i.e. base_operation.py).
But sure, I can move the Operations class to a new file, something like operations.py. WDYT?

marcenacp · 2024-01-22T15:57:04Z

python/mlcroissant/mlcroissant/_src/operation_graph/execute.py

@@ -59,7 +59,10 @@ def execute_operations_sequentially(record_set: str, operations: Operations):
                if previous_operation in results
            ]
            logging.info("Executing %s", operation)
-            results[operation] = operation(*previous_results)
+            if isinstance(operation, Read):
+                results[operation] = operation(*previous_results, record_set=record_set)  # type: ignore  # Force mypy types.


The Read operation should not need to know about the RecordSet, but only about the FileSets/Files it reads

I made some changes, WDYT?

marcenacp · 2024-01-22T16:00:10Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

    """Parsed all JSONs defined in the fields of RecordSet and outputs a pd.DF."""
    series = {}
    for field in fields:
+        # Only select fields relevant to the requested record_set (if given).
+        if record_set and not field.uid.startswith(record_set):


Here can you filter at the field level (fields) rather than adding another parameter (record_set)?

Good point, changed. WDYT?

This PR also fixes the GitHub Actions (some of them are broken and don't trigger anymore).

At the time when Kaggle/HuggingFace/OpenML implemented their respective APIs, hashes weren't checked so they didn't implement it in 0.8. This PR makes checking hashes optional in 0.8 and mandatory in 1.0.

…formsTo" or nothing. (#470)

We've fully launched our integration to the public and I'm preparing a forum post on Kaggle that will link to this documentation, so freshening it up a little bit. - Expand on feature details - Fix typo in name

…use `fileObject` / `fileSet` instead. (#473)

…s`, `rdf`, etc. (#474) This will allow to always know the version of Croissant (`ctx.conforms_to`) that is being used.

Added the `jsonl` output to support the Kaggle base64 example even though it can only be run with the correct environment variables configured. I have that setup locally and the tests pass after the change to `download.py`. If the preference is to remove the commented out test case, I can do that. We could setup a robot account for use in the GH actions/CI tests, but the tests would still break for developers unless they setup their Kaggle creds locally. Also include a drive-by fix as the logs from the failed tests (while I was developing) printed commands before we accounted for a parameterized `version`. Addresses #471

…roissant. (#476)

Before: 675.2771615982056 seconds After: 11.278749465942383 seconds

The notebook tests fail for TFDS CroissantBuilder, which is trying to get metadata.citation instead of metadata.cite_as to populate the citation field in its DatasetInfo. I will merge this PR despite the failing tests, and will fix the CroissantBuilder from TFDS side.

… script parameter (#492) Fixes: #453

The uploadCsv test was sometimes failing to verify that the Description field was modified. One of the possible issues was not finding the Description field to type into and check, as it wouldn't be visible in the current view. This was also made more likely due to warnings to migrate usages of `st.expermental_get_query_params` and `st.expermental_set_query_params` which are now replaced with `st.query_params`. In the test itself, now waits after clilcking the "Edit fields" button, so that the necessary input fields appear, instead of waiting after typing and modifies the element lookup.

Raise ValueError also for live datasets when hashes of downloaded files do not match

Changes

…eter (record_set)

marcenacp · 2024-02-08T13:15:36Z

python/mlcroissant/mlcroissant/_src/operation_graph/base_operation.py

+                    is_ancestor(field1, node2, ancestor_leaf) for field1 in node1.fields
+                )
+
+            if hasattr(ancestor_leaf, "is_read_operation") and isinstance(node2, Field):


If Read overwrote an is_ancestor method, we could avoid this workaround :)

marcenacp · 2024-02-08T13:15:58Z

python/mlcroissant/mlcroissant/_src/operation_graph/graph.py

+        A dictionary of RecordSet names, and the Fields relative tho that RecordSet.
+    """
+    recordset_to_fields = {}
+    for field in file_obejct.successors:


file_object

marcenacp · 2024-02-08T13:18:11Z

python/mlcroissant/mlcroissant/_src/operation_graph/graph.py

+    Returns:
+        A dictionary of RecordSet names, and the Fields relative tho that RecordSet.
+    """
+    recordset_to_fields = {}


Use a defaultdict?

marcenacp · 2024-03-12T10:10:53Z

Closing this PR as it is now possible to define multiple RecordSets deriving from the same node.

ccl-core added 2 commits January 22, 2024 15:31

Define multiple record sets deriving from the same node

222c3c1

Define multiple record sets deriving from the same node

aaed52c

ccl-core requested a review from a team as a code owner January 22, 2024 15:33

ccl-core requested a review from marcenacp January 22, 2024 15:50

marcenacp requested changes Jan 22, 2024

View reviewed changes

ccl-core and others added 24 commits January 23, 2024 09:08

Merge remote-tracking branch 'origin/ccl-core-26' into ccl-core-26

1f8d192

Fix formatting.

ba9a898

Commit (will need merge adjustements)

2bf7d80

Break circular dependency between source.py and field.py. (#459)

98cffb3

Break circular dependency between mlc.torch and mlc. (#461)

ebec857

Support mapping resources to files on disk (--mapping). (#462)

a0cd0d6

Move datasets/ to datasets/0.8/ and parametrize tests. (#463)

0ec91bf

Load JSON-LD from URL in addition to load from a file. (#465)

2c4c467

This PR also fixes the GitHub Actions (some of them are broken and don't trigger anymore).

Parse and standardize the Croissant version. (#464)

3d59615

Only check hashes for Croissant>0.8. (#467)

b087beb

At the time when Kaggle/HuggingFace/OpenML implemented their respective APIs, hashes weren't checked so they didn't implement it in 0.8. This PR makes checking hashes optional in 0.8 and mandatory in 1.0.

Bump version to 0.8.0. (#468)

e2b322a

Parse version from string if necessary. (#469)

c09332e

[Spec change #2] All Croissant datasets must specify a valid "dct:con…

0c67470

…formsTo" or nothing. (#470)

Update Kaggle's README info (#466)

8ef399e

We've fully launched our integration to the public and I'm preparing a forum post on Kaggle that will link to this documentation, so freshening it up a little bit. - Expand on feature details - Fix typo in name

[Spec change #6] Remove distribution property from DataSource, and …

1c57a0e

…use `fileObject` / `fileSet` instead. (#473)

Pass down unified ctx: Context in the application instead of `issue…

12bc899

…s`, `rdf`, etc. (#474) This will allow to always know the version of Croissant (`ctx.conforms_to`) that is being used.

Fix notebooks after PR #474. (#475)

6091bc5

[Spec change #8] Support both mlcommons.org/schema to mlcommons.org/c…

ca608e7

…roissant. (#476)

Publish mlcroissant==0.8.1. (#480)

341ee7a

Deactivate flaky test in the CI. (#482)

4357439

Install from source from GITHUB_HEAD_REF and default to main. (#481)

c3141d0

Migrate parameters that don't exist in schema.org. (#479)

1b7b8fe

Fix dev container problems (#484)

dc7296e

marcenacp and others added 17 commits February 7, 2024 16:25

Migrate --file to --jsonld to align with the API. (#485)

696181d

Memoize jsonpath_rw.parse for better performances. (#486)

a08e7c0

Before: 675.2771615982056 seconds After: 11.278749465942383 seconds

Align the API and add better documentation. (#488)

742aedb

AAU, I can override CROISSANT_CACHE with an environment variable or a…

525a05f

… script parameter (#492) Fixes: #453

Bump to 0.8.2. (#493)

65be081

Add isLiveDataset field (#494)

d536242

Define multiple record sets deriving from the same node

1bfe23a

Commit (will need merge adjustements)

b277056

Check live datasets checksums too (#496)

bb2e346

Raise ValueError also for live datasets when hashes of downloaded files do not match

Change warning to info (#498)

6458d5e

Changes

Added images_metadata recordset output to 1.0 and remove it from 0.8

a4cd815

Merge branch 'main' into ccl-core-26

9afce41

Add is_read_operation attribute to the Read operation.

26bf9f9

Do filter at the field level (fields) instead of adding another param…

bc146c1

…eter (record_set)

Change parse_json_test

de66d69

marcenacp reviewed Feb 8, 2024

View reviewed changes

marcenacp closed this Mar 12, 2024

github-actions bot locked and limited conversation to collaborators Mar 12, 2024

ccl-core deleted the ccl-core-26 branch June 24, 2024 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define multiple record sets deriving from the same node #458

Define multiple record sets deriving from the same node #458

ccl-core commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Jan 22, 2024 •

edited

Loading

marcenacp left a comment

marcenacp Jan 22, 2024

ccl-core Jan 22, 2024

marcenacp Jan 22, 2024

ccl-core Feb 8, 2024

marcenacp Jan 22, 2024

ccl-core Feb 8, 2024

marcenacp Feb 8, 2024

marcenacp Feb 8, 2024

marcenacp Feb 8, 2024

marcenacp commented Mar 12, 2024

Define multiple record sets deriving from the same node #458

Define multiple record sets deriving from the same node #458

Conversation

ccl-core commented Jan 22, 2024 • edited Loading

github-actions bot commented Jan 22, 2024 • edited Loading

marcenacp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcenacp commented Mar 12, 2024

ccl-core commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Jan 22, 2024 •

edited

Loading