Add json schema validation for source data at the interface level #1090

h-mayorquin · 2024-09-17T21:42:29Z

Should close #42.

h-mayorquin · 2024-09-19T14:47:50Z

This is ready. Three things:

Some of the typing was missing. There were some things marked as value: float = None where I had to mark them as optional because the JSON schema would fail to validate. I am surprised that Pydantic missed that. This adds to the point that you made before that both have different strengths, maybe?
I created a way to clarify which source arguments are used to initialize an extractor in the extractor interfaces. The problem is that we pass everything to the super init where it is validated but sometimes we modify the arguments names within the interface and the JSON validation fails. I think it is better to have this explicit anyway (explicit is better than implicit).
Some of the typing was infraspecific, for example in the MockRecording the typing for durations was tuple(int) but apparently it needs to be tuple(int, ...) to indicate there is more than one.

CodyCBakerPhD · 2024-09-19T17:10:36Z

docs/conversion_examples_gallery/behavior/lightningpose.rst

@@ -23,7 +23,7 @@ Convert LightningPose pose estimation data to NWB using :py:class:`~neuroconv.da
    >>> labeled_video_file_path = str(folder_path / "labeled_videos/test_vid_labeled.mp4")

    >>> converter = LightningPoseConverter(file_path=file_path, original_video_file_path=original_video_file_path, labeled_video_file_path=labeled_video_file_path, verbose=False)
-
+    Source data is valid!


Here I was thinking you despised verbosity in the testing suite 😉

I do, but I did not want to change there because I saw that @pauladkisson explicitly set verbose=True so I went with what I assumed was the intent.

CodyCBakerPhD · 2024-09-19T17:16:45Z

Some of the typing was missing. There were some things marked as value: float = None where I had to mark them as optional because the JSON schema would fail to validate. I am surprised that Pydantic missed that. This adds to the point that you made before that both have different strengths, maybe?

I too am surprised by this, but maybe not so much - docval in HDMF behaves similarly, where you don't need to indicate None is a possible value to be passed to an argument even if that is exactly its default value. I think Pydantic does not consider the typing vs. default a fundamental incompatability since the validation call only checks if the user specified (not default) input matches annotation type. If user specifies same as default value, I also don't know what would happen but suppose it would ignore

I created a way to clarify which source arguments are used to initialize an extractor in the extractor interfaces. The problem is that we pass everything to the super init where it is validated but sometimes we modify the arguments names within the interface and the JSON validation fails. I think it is better to have this explicit anyway (explicit is better than implicit).

Nice - yeah that has always proved a subtle pain point, since we do utilize additional kwargs in the interface __init__ that are not a part of the extractor kwargs, usually to perform 'extra' metadata or neo related operations

Some of the typing was infraspecific, for example in the MockRecording the typing for durations was tuple(int) but apparently it needs to be tuple(int, ...) to indicate there is more than one.

Good catches - accurate validation must rely on accurate annotations, which have only ever been used so far as a form of developer communication

CodyCBakerPhD · 2024-09-19T17:18:53Z

Just beware, this (like the Pydantic one before it) could potentially break custom conversion pipelines using latest/dev NeuroConv by requiring them to use accurate annotations on all interface __init__

h-mayorquin · 2024-09-19T17:33:13Z

Just beware, this (like the Pydantic one before it) could potentially break custom conversion pipelines using latest/dev NeuroConv by requiring them to use accurate annotations on all interface __init__

This is a very important point.

Just to be clear: if they were using NWBConverter before this PR should not change things, right? Validation was already done there:

neuroconv/src/neuroconv/nwbconverter.py

Lines 45 to 85 in f284105

    
           @classmethod 
        
           def get_source_schema(cls) -> dict: 
        
               """Compile input schemas from each of the data interface classes.""" 
        
               source_schema = get_base_schema( 
        
                   root=True, 
        
                   id_="source.schema.json", 
        
                   title="Source data schema", 
        
                   description="Schema for the source data, files and directories", 
        
                   version="0.1.0", 
        
               ) 
        
               for interface_name, data_interface in cls.data_interface_classes.items(): 
        
                   source_schema["properties"].update({interface_name: unroot_schema(data_interface.get_source_schema())}) 
        
               return source_schema 
        
           @classmethod 
        
           def validate_source(cls, source_data: dict[str, dict], verbose: bool = True): 
        
               """Validate source_data against Converter source_schema.""" 
        
               cls._validate_source_data(source_data=source_data, verbose=verbose) 
        
           def _validate_source_data(self, source_data: dict[str, dict], verbose: bool = True): 
        
               encoder = _NWBSourceDataEncoder() 
        
               # The encoder produces a serialized object, so we deserialized it for comparison 
        
               serialized_source_data = encoder.encode(source_data) 
        
               decoded_source_data = json.loads(serialized_source_data) 
        
               validate(instance=decoded_source_data, schema=self.get_source_schema()) 
        
               if verbose: 
        
                   print("Source data is valid!") 
        
           @validate_call 
        
           def __init__(self, source_data: dict[str, dict], verbose: bool = True): 
        
               """Validate source_data against source_schema and initialize all data interfaces.""" 
        
               self.verbose = verbose 
        
               self._validate_source_data(source_data=source_data, verbose=self.verbose) 
        
               self.data_interface_objects = { 
        
                   name: data_interface(**source_data[name]) 
        
                   for name, data_interface in self.data_interface_classes.items() 
        
                   if name in source_data 
        
               }

CodyCBakerPhD · 2024-09-19T18:19:55Z

Just to be clear: if they were using NWBConverter before this PR should not change things, right? Validation was already done there:

I suppose that's true, but I've been guilty myself of using ConverterPipes sometimes

What is mildly annoying now is that JSON validation occurs in both NWBConverter initialization and individual interface initialization, but overall benefit is probably worth it in the end

h-mayorquin · 2024-09-19T18:27:30Z

What is mildly annoying now is that JSON validation occurs in both NWBConverter initialization and individual interface initialization, but overall benefit is probably worth it in the end

Agree with it, I would think long term on how to avoid the overlap.

codecov · 2024-09-20T00:37:43Z

Codecov Report

Attention: Patch coverage is 98.34711% with 2 lines in your changes missing coverage. Please review.

Project coverage is 90.57%. Comparing base (36464df) to head (da8396f).
Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
src/neuroconv/basedatainterface.py	92.30%	1 Missing ⚠️
...tainterfaces/ecephys/plexon/plexondatainterface.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1090      +/-   ##
==========================================
+ Coverage   90.44%   90.57%   +0.12%     
==========================================
  Files         129      129              
  Lines        8055     8155     +100     
==========================================
+ Hits         7285     7386     +101     
+ Misses        770      769       -1

Flag	Coverage Δ
unittests	`90.57% <98.34%> (+0.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/neuroconv/baseextractorinterface.py	`100.00% <100.00%> (ø)`
...aces/ecephys/alphaomega/alphaomegadatainterface.py	`100.00% <100.00%> (ø)`
...datainterfaces/ecephys/axona/axonadatainterface.py	`90.32% <100.00%> (+1.29%)`	⬆️
...erfaces/ecephys/baserecordingextractorinterface.py	`93.81% <100.00%> (ø)`
...rfaces/ecephys/blackrock/blackrockdatainterface.py	`95.00% <100.00%> (+0.45%)`	⬆️
.../ecephys/cellexplorer/cellexplorerdatainterface.py	`85.71% <100.00%> (+0.32%)`	⬆️
...datainterfaces/ecephys/intan/intandatainterface.py	`100.00% <100.00%> (ø)`
...rfaces/ecephys/neuralynx/neuralynxdatainterface.py	`86.07% <100.00%> (+0.74%)`	⬆️
...terfaces/ecephys/spikeglx/spikeglxdatainterface.py	`95.00% <100.00%> (+0.66%)`	⬆️
...terfaces/ecephys/spikeglx/spikeglxnidqinterface.py	`98.03% <100.00%> (+0.36%)`	⬆️
... and 9 more

h-mayorquin added 6 commits August 15, 2024 16:11

wip

2a43fb1

add source data validation for interfaces

64f11df

changelog and duplication from merge

5988298

merge

6a900cf

Merge branch 'main' into add_source_data_checks_to_interface

ba7dfa3

changelog

6172771

h-mayorquin added the enhancement label Sep 17, 2024

h-mayorquin self-assigned this Sep 17, 2024

h-mayorquin added 9 commits September 17, 2024 18:03

fix import

c892a46

merge

f7b0014

restore changelog

0c4f9b1

fix pydantic max elements

c5c1819

fix mapping between recording and source data

3016184

fix axona and ophys

6eb9c1f

fix icephys

04155dc

Merge branch 'main' into add_source_data_checks_to_interface

a963e01

fix docstring tests

f284105

h-mayorquin marked this pull request as ready for review September 19, 2024 00:48

h-mayorquin requested a review from CodyCBakerPhD September 19, 2024 14:47

Merge branch 'main' into add_source_data_checks_to_interface

be925a0

CodyCBakerPhD reviewed Sep 19, 2024

View reviewed changes

CodyCBakerPhD approved these changes Sep 19, 2024

View reviewed changes

Merge branch 'main' into add_source_data_checks_to_interface

da8396f

h-mayorquin merged commit 9f67ec6 into main Sep 20, 2024
39 checks passed

h-mayorquin deleted the add_source_data_checks_to_interface branch September 20, 2024 00:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add json schema validation for source data at the interface level #1090

Add json schema validation for source data at the interface level #1090

h-mayorquin commented Sep 17, 2024

h-mayorquin commented Sep 19, 2024

CodyCBakerPhD Sep 19, 2024

h-mayorquin Sep 19, 2024

CodyCBakerPhD commented Sep 19, 2024

CodyCBakerPhD commented Sep 19, 2024

h-mayorquin commented Sep 19, 2024

CodyCBakerPhD commented Sep 19, 2024

h-mayorquin commented Sep 19, 2024

codecov bot commented Sep 20, 2024

Add json schema validation for source data at the interface level #1090

Add json schema validation for source data at the interface level #1090

Conversation

h-mayorquin commented Sep 17, 2024

h-mayorquin commented Sep 19, 2024

CodyCBakerPhD Sep 19, 2024

Choose a reason for hiding this comment

h-mayorquin Sep 19, 2024

Choose a reason for hiding this comment

CodyCBakerPhD commented Sep 19, 2024

CodyCBakerPhD commented Sep 19, 2024

h-mayorquin commented Sep 19, 2024

CodyCBakerPhD commented Sep 19, 2024

h-mayorquin commented Sep 19, 2024

codecov bot commented Sep 20, 2024

Codecov Report