Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add json schema validation for source data at the interface level #1090

Merged
merged 17 commits into from
Sep 20, 2024

Conversation

h-mayorquin
Copy link
Collaborator

Should close #42.

@h-mayorquin h-mayorquin marked this pull request as ready for review September 19, 2024 00:48
@h-mayorquin
Copy link
Collaborator Author

@CodyCBakerPhD

This is ready. Three things:

  1. Some of the typing was missing. There were some things marked as value: float = None where I had to mark them as optional because the JSON schema would fail to validate. I am surprised that Pydantic missed that. This adds to the point that you made before that both have different strengths, maybe?
  2. I created a way to clarify which source arguments are used to initialize an extractor in the extractor interfaces. The problem is that we pass everything to the super init where it is validated but sometimes we modify the arguments names within the interface and the JSON validation fails. I think it is better to have this explicit anyway (explicit is better than implicit).
  3. Some of the typing was infraspecific, for example in the MockRecording the typing for durations was tuple(int) but apparently it needs to be tuple(int, ...) to indicate there is more than one.

@@ -23,7 +23,7 @@ Convert LightningPose pose estimation data to NWB using :py:class:`~neuroconv.da
>>> labeled_video_file_path = str(folder_path / "labeled_videos/test_vid_labeled.mp4")

>>> converter = LightningPoseConverter(file_path=file_path, original_video_file_path=original_video_file_path, labeled_video_file_path=labeled_video_file_path, verbose=False)

Source data is valid!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I was thinking you despised verbosity in the testing suite 😉

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do, but I did not want to change there because I saw that @pauladkisson explicitly set verbose=True so I went with what I assumed was the intent.

@CodyCBakerPhD
Copy link
Member

  1. Some of the typing was missing. There were some things marked as value: float = None where I had to mark them as optional because the JSON schema would fail to validate. I am surprised that Pydantic missed that. This adds to the point that you made before that both have different strengths, maybe?

I too am surprised by this, but maybe not so much - docval in HDMF behaves similarly, where you don't need to indicate None is a possible value to be passed to an argument even if that is exactly its default value. I think Pydantic does not consider the typing vs. default a fundamental incompatability since the validation call only checks if the user specified (not default) input matches annotation type. If user specifies same as default value, I also don't know what would happen but suppose it would ignore

  1. I created a way to clarify which source arguments are used to initialize an extractor in the extractor interfaces. The problem is that we pass everything to the super init where it is validated but sometimes we modify the arguments names within the interface and the JSON validation fails. I think it is better to have this explicit anyway (explicit is better than implicit).

Nice - yeah that has always proved a subtle pain point, since we do utilize additional kwargs in the interface __init__ that are not a part of the extractor kwargs, usually to perform 'extra' metadata or neo related operations

Some of the typing was infraspecific, for example in the MockRecording the typing for durations was tuple(int) but apparently it needs to be tuple(int, ...) to indicate there is more than one.

Good catches - accurate validation must rely on accurate annotations, which have only ever been used so far as a form of developer communication

@CodyCBakerPhD
Copy link
Member

Just beware, this (like the Pydantic one before it) could potentially break custom conversion pipelines using latest/dev NeuroConv by requiring them to use accurate annotations on all interface __init__

@h-mayorquin
Copy link
Collaborator Author

Just beware, this (like the Pydantic one before it) could potentially break custom conversion pipelines using latest/dev NeuroConv by requiring them to use accurate annotations on all interface __init__

This is a very important point.

Just to be clear: if they were using NWBConverter before this PR should not change things, right? Validation was already done there:

@classmethod
def get_source_schema(cls) -> dict:
"""Compile input schemas from each of the data interface classes."""
source_schema = get_base_schema(
root=True,
id_="source.schema.json",
title="Source data schema",
description="Schema for the source data, files and directories",
version="0.1.0",
)
for interface_name, data_interface in cls.data_interface_classes.items():
source_schema["properties"].update({interface_name: unroot_schema(data_interface.get_source_schema())})
return source_schema
@classmethod
def validate_source(cls, source_data: dict[str, dict], verbose: bool = True):
"""Validate source_data against Converter source_schema."""
cls._validate_source_data(source_data=source_data, verbose=verbose)
def _validate_source_data(self, source_data: dict[str, dict], verbose: bool = True):
encoder = _NWBSourceDataEncoder()
# The encoder produces a serialized object, so we deserialized it for comparison
serialized_source_data = encoder.encode(source_data)
decoded_source_data = json.loads(serialized_source_data)
validate(instance=decoded_source_data, schema=self.get_source_schema())
if verbose:
print("Source data is valid!")
@validate_call
def __init__(self, source_data: dict[str, dict], verbose: bool = True):
"""Validate source_data against source_schema and initialize all data interfaces."""
self.verbose = verbose
self._validate_source_data(source_data=source_data, verbose=self.verbose)
self.data_interface_objects = {
name: data_interface(**source_data[name])
for name, data_interface in self.data_interface_classes.items()
if name in source_data
}

@CodyCBakerPhD
Copy link
Member

Just to be clear: if they were using NWBConverter before this PR should not change things, right? Validation was already done there:

I suppose that's true, but I've been guilty myself of using ConverterPipes sometimes

What is mildly annoying now is that JSON validation occurs in both NWBConverter initialization and individual interface initialization, but overall benefit is probably worth it in the end

@h-mayorquin
Copy link
Collaborator Author

What is mildly annoying now is that JSON validation occurs in both NWBConverter initialization and individual interface initialization, but overall benefit is probably worth it in the end

Agree with it, I would think long term on how to avoid the overlap.

@h-mayorquin h-mayorquin merged commit 9f67ec6 into main Sep 20, 2024
39 checks passed
@h-mayorquin h-mayorquin deleted the add_source_data_checks_to_interface branch September 20, 2024 00:36
Copy link

codecov bot commented Sep 20, 2024

Codecov Report

Attention: Patch coverage is 98.34711% with 2 lines in your changes missing coverage. Please review.

Project coverage is 90.57%. Comparing base (36464df) to head (da8396f).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
src/neuroconv/basedatainterface.py 92.30% 1 Missing ⚠️
...tainterfaces/ecephys/plexon/plexondatainterface.py 85.71% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1090      +/-   ##
==========================================
+ Coverage   90.44%   90.57%   +0.12%     
==========================================
  Files         129      129              
  Lines        8055     8155     +100     
==========================================
+ Hits         7285     7386     +101     
+ Misses        770      769       -1     
Flag Coverage Δ
unittests 90.57% <98.34%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/neuroconv/baseextractorinterface.py 100.00% <100.00%> (ø)
...aces/ecephys/alphaomega/alphaomegadatainterface.py 100.00% <100.00%> (ø)
...datainterfaces/ecephys/axona/axonadatainterface.py 90.32% <100.00%> (+1.29%) ⬆️
...erfaces/ecephys/baserecordingextractorinterface.py 93.81% <100.00%> (ø)
...rfaces/ecephys/blackrock/blackrockdatainterface.py 95.00% <100.00%> (+0.45%) ⬆️
.../ecephys/cellexplorer/cellexplorerdatainterface.py 85.71% <100.00%> (+0.32%) ⬆️
...datainterfaces/ecephys/intan/intandatainterface.py 100.00% <100.00%> (ø)
...rfaces/ecephys/neuralynx/neuralynxdatainterface.py 86.07% <100.00%> (+0.74%) ⬆️
...terfaces/ecephys/spikeglx/spikeglxdatainterface.py 95.00% <100.00%> (+0.66%) ⬆️
...terfaces/ecephys/spikeglx/spikeglxnidqinterface.py 98.03% <100.00%> (+0.36%) ⬆️
... and 9 more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discussion: Unify run_conversion interfaces between base classes and nwbconverter
2 participants