-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define multiple record sets deriving from the same node #458
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM! I think there's possibly a simplification because Read
operations should not have to know about the RecordSets (children)
# Multiple edges (e.g. to ReadFields operations) could generate from a Read | ||
# operation. | ||
other_candidates = [ | ||
operation for operation in self.nodes if str(operation).startswith("Read") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance(operation, Read) or isinstance(operation, ReadFields)
(or something similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this when implementing, at the end I didn't do it because I didn't want to restructure the scripts to avoid cyclic dependencies (ReadFields
and Read
depend on Operation
which is defined in the same script as Operations
, i.e. base_operation.py
).
But sure, I can move the Operations
class to a new file, something like operations.py
. WDYT?
@@ -59,7 +59,10 @@ def execute_operations_sequentially(record_set: str, operations: Operations): | |||
if previous_operation in results | |||
] | |||
logging.info("Executing %s", operation) | |||
results[operation] = operation(*previous_results) | |||
if isinstance(operation, Read): | |||
results[operation] = operation(*previous_results, record_set=record_set) # type: ignore # Force mypy types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Read
operation should not need to know about the RecordSet, but only about the FileSets/Files it reads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some changes, WDYT?
"""Parsed all JSONs defined in the fields of RecordSet and outputs a pd.DF.""" | ||
series = {} | ||
for field in fields: | ||
# Only select fields relevant to the requested record_set (if given). | ||
if record_set and not field.uid.startswith(record_set): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here can you filter at the field level (fields
) rather than adding another parameter (record_set
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, changed. WDYT?
This PR also fixes the GitHub Actions (some of them are broken and don't trigger anymore).
At the time when Kaggle/HuggingFace/OpenML implemented their respective APIs, hashes weren't checked so they didn't implement it in 0.8. This PR makes checking hashes optional in 0.8 and mandatory in 1.0.
…formsTo" or nothing. (#470)
We've fully launched our integration to the public and I'm preparing a forum post on Kaggle that will link to this documentation, so freshening it up a little bit. - Expand on feature details - Fix typo in name
…use `fileObject` / `fileSet` instead. (#473)
…s`, `rdf`, etc. (#474) This will allow to always know the version of Croissant (`ctx.conforms_to`) that is being used.
Added the `jsonl` output to support the Kaggle base64 example even though it can only be run with the correct environment variables configured. I have that setup locally and the tests pass after the change to `download.py`. If the preference is to remove the commented out test case, I can do that. We could setup a robot account for use in the GH actions/CI tests, but the tests would still break for developers unless they setup their Kaggle creds locally. Also include a drive-by fix as the logs from the failed tests (while I was developing) printed commands before we accounted for a parameterized `version`. Addresses #471
Before: 675.2771615982056 seconds After: 11.278749465942383 seconds
The notebook tests fail for TFDS CroissantBuilder, which is trying to get metadata.citation instead of metadata.cite_as to populate the citation field in its DatasetInfo. I will merge this PR despite the failing tests, and will fix the CroissantBuilder from TFDS side.
The uploadCsv test was sometimes failing to verify that the Description field was modified. One of the possible issues was not finding the Description field to type into and check, as it wouldn't be visible in the current view. This was also made more likely due to warnings to migrate usages of `st.expermental_get_query_params` and `st.expermental_set_query_params` which are now replaced with `st.query_params`. In the test itself, now waits after clilcking the "Edit fields" button, so that the necessary input fields appear, instead of waiting after typing and modifies the element lookup.
Raise ValueError also for live datasets when hashes of downloaded files do not match
Changes
…eter (record_set)
is_ancestor(field1, node2, ancestor_leaf) for field1 in node1.fields | ||
) | ||
|
||
if hasattr(ancestor_leaf, "is_read_operation") and isinstance(node2, Field): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Read
overwrote an is_ancestor
method, we could avoid this workaround :)
A dictionary of RecordSet names, and the Fields relative tho that RecordSet. | ||
""" | ||
recordset_to_fields = {} | ||
for field in file_obejct.successors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file_object
Returns: | ||
A dictionary of RecordSet names, and the Fields relative tho that RecordSet. | ||
""" | ||
recordset_to_fields = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a defaultdict?
Closing this PR as it is now possible to define multiple RecordSets deriving from the same node. |
This is needed in order to have multiple RecordSets (deriving e.g. from different ReadFields operations) which derive from the same Read operation. As an example, refer to the coco2014-mini dataset below.