Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define multiple record sets deriving from the same node #458

Closed
wants to merge 43 commits into from

Conversation

ccl-core
Copy link
Contributor

@ccl-core ccl-core commented Jan 22, 2024

This is needed in order to have multiple RecordSets (deriving e.g. from different ReadFields operations) which derive from the same Read operation. As an example, refer to the coco2014-mini dataset below.

image (7)

@ccl-core ccl-core requested a review from a team as a code owner January 22, 2024 15:33
Copy link

github-actions bot commented Jan 22, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@ccl-core ccl-core requested a review from marcenacp January 22, 2024 15:50
Copy link
Contributor

@marcenacp marcenacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM! I think there's possibly a simplification because Read operations should not have to know about the RecordSets (children)

# Multiple edges (e.g. to ReadFields operations) could generate from a Read
# operation.
other_candidates = [
operation for operation in self.nodes if str(operation).startswith("Read")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(operation, Read) or isinstance(operation, ReadFields)

(or something similar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this when implementing, at the end I didn't do it because I didn't want to restructure the scripts to avoid cyclic dependencies (ReadFields and Read depend on Operation which is defined in the same script as Operations, i.e. base_operation.py).
But sure, I can move the Operations class to a new file, something like operations.py. WDYT?

@@ -59,7 +59,10 @@ def execute_operations_sequentially(record_set: str, operations: Operations):
if previous_operation in results
]
logging.info("Executing %s", operation)
results[operation] = operation(*previous_results)
if isinstance(operation, Read):
results[operation] = operation(*previous_results, record_set=record_set) # type: ignore # Force mypy types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Read operation should not need to know about the RecordSet, but only about the FileSets/Files it reads

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes, WDYT?

"""Parsed all JSONs defined in the fields of RecordSet and outputs a pd.DF."""
series = {}
for field in fields:
# Only select fields relevant to the requested record_set (if given).
if record_set and not field.uid.startswith(record_set):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here can you filter at the field level (fields) rather than adding another parameter (record_set)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, changed. WDYT?

ccl-core and others added 24 commits January 23, 2024 09:08
This PR also fixes the GitHub Actions (some of them are broken and don't
trigger anymore).
At the time when Kaggle/HuggingFace/OpenML implemented their respective
APIs, hashes weren't checked so they didn't implement it in 0.8.

This PR makes checking hashes optional in 0.8 and mandatory in 1.0.
We've fully launched our integration to the public and I'm preparing a
forum post on Kaggle that will link to this documentation, so freshening
it up a little bit.

- Expand on feature details
- Fix typo in name
…use `fileObject` / `fileSet` instead. (#473)
…s`, `rdf`, etc. (#474)

This will allow to always know the version of Croissant
(`ctx.conforms_to`) that is being used.
Added the `jsonl` output to support the Kaggle base64 example even
though it can only be run with the correct environment variables
configured. I have that setup locally and the tests pass after the
change to `download.py`. If the preference is to remove the commented
out test case, I can do that. We could setup a robot account for use in
the GH actions/CI tests, but the tests would still break for developers
unless they setup their Kaggle creds locally.

Also include a drive-by fix as the logs from the failed tests (while I
was developing) printed commands before we accounted for a parameterized
`version`.

Addresses #471
marcenacp and others added 17 commits February 7, 2024 16:25
Before: 675.2771615982056 seconds
After: 11.278749465942383 seconds
The notebook tests fail for TFDS CroissantBuilder, which is trying to get metadata.citation instead of metadata.cite_as to populate the citation field in its DatasetInfo. I will merge this PR despite the failing tests, and will fix the CroissantBuilder from TFDS side.
The uploadCsv test was sometimes failing to verify that the Description
field was modified.

One of the possible issues was not finding the Description field to type
into and check, as it wouldn't be visible in the current view. This was
also made more likely due to warnings to migrate usages of
`st.expermental_get_query_params`
and `st.expermental_set_query_params` which are now replaced with
`st.query_params`.
In the test itself, now waits after clilcking the "Edit fields" button,
so that the necessary input fields appear, instead of waiting after
typing and modifies the element lookup.
Raise ValueError also for live datasets when hashes of downloaded files
do not match
is_ancestor(field1, node2, ancestor_leaf) for field1 in node1.fields
)

if hasattr(ancestor_leaf, "is_read_operation") and isinstance(node2, Field):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Read overwrote an is_ancestor method, we could avoid this workaround :)

A dictionary of RecordSet names, and the Fields relative tho that RecordSet.
"""
recordset_to_fields = {}
for field in file_obejct.successors:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_object

Returns:
A dictionary of RecordSet names, and the Fields relative tho that RecordSet.
"""
recordset_to_fields = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a defaultdict?

@marcenacp
Copy link
Contributor

Closing this PR as it is now possible to define multiple RecordSets deriving from the same node.

@marcenacp marcenacp closed this Mar 12, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 12, 2024
@ccl-core ccl-core deleted the ccl-core-26 branch June 24, 2024 14:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants