Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: write_recording() fails with NWBZarrIO backend #202

Closed
2 tasks done
alejoe91 opened this issue Nov 21, 2022 · 17 comments
Closed
2 tasks done

[Bug]: write_recording() fails with NWBZarrIO backend #202

alejoe91 opened this issue Nov 21, 2022 · 17 comments
Labels

Comments

@alejoe91
Copy link
Contributor

alejoe91 commented Nov 21, 2022

What happened?

When trying to write a recording object using the NWBZarrIO backend from hdmf-zarr, I get the following error related to the len of the SpikeInterfaceRecordingDataChunkIterator.

The same nwbfile does not produce any error when writing to HDF5.

@oruebel @rly I think this might be due to the way neuroconv handles the data iteration, but tagging you here in case it's a Zarr backend issue.

Steps to Reproduce

from pynwb import NWBFile
from hdmf_zarr.nwb import NWBZarrIO
from datetime import datetime
from neuroconv.tools.spikeinterface import write_recording

import spikeinterface.full as si


# create toy objects
rec, sort = si.toy_example()

# instantiate nwbfile
nwb_metadata = dict(session_description="toy_example", identifier="tpy", session_start_time=datetime.now())
nwbfile = NWBFile(**nwb_metadata)

# write recording
write_recording(rec, nwbfile=nwbfile)

# write to Zarr
with NWBZarrIO('test_recording_zarr.nwb', "w") as io:
    io.write(nwbfile)

Traceback

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf_zarr/backend.py:904, in ZarrIO.__list_fill__(self, parent, name, data, options)
    903 try:
--> 904     dset[:] = data  # If data is an h5py.Dataset then this will copy the data
    905 # For compound data types containing strings Zarr sometimes does not like wirting multiple values
    906 # try to write them one-at-a-time instead then

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/zarr/core.py:1373, in Array.__setitem__(self, selection, value)
   1372 else:
-> 1373     self.set_basic_selection(pure_selection, value, fields=fields)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/zarr/core.py:1468, in Array.set_basic_selection(self, selection, value, fields)
   1467 else:
-> 1468     return self._set_basic_selection_nd(selection, value, fields=fields)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/zarr/core.py:1772, in Array._set_basic_selection_nd(self, selection, value, fields)
   1770 indexer = BasicIndexer(selection, self)
-> 1772 self._set_selection(indexer, value, fields=fields)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/zarr/core.py:1800, in Array._set_selection(self, indexer, value, fields)
   1799         value = np.asanyarray(value, like=self._meta_array)
-> 1800     check_array_shape('value', value, sel_shape)
   1802 # iterate over chunks in range

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/zarr/util.py:547, in check_array_shape(param, array, shape)
    546 if array.shape != shape:
--> 547     raise ValueError('parameter {!r}: expected array with shape {!r}, got {!r}'
    548                      .format(param, shape, array.shape))

ValueError: parameter 'value': expected array with shape (300000, 4), got ()

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In [22], line 2
      1 with NWBZarrIO('nwb-test-files/test_recording_zarr.nwb', "w") as io:
----> 2     io.write(nwbfile)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf/utils.py:645, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    643 def func_call(*args, **kwargs):
    644     pargs = _check_args(args, kwargs)
--> 645     return func(args[0], **pargs)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf_zarr/backend.py:151, in ZarrIO.write(self, **kwargs)
    149 """Overwrite the write method to add support for caching the specification"""
    150 cache_spec = popargs('cache_spec', kwargs)
--> 151 super(ZarrIO, self).write(**kwargs)
    152 if cache_spec:
    153     self.__cache_spec()

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf/utils.py:645, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    643 def func_call(*args, **kwargs):
    644     pargs = _check_args(args, kwargs)
--> 645     return func(args[0], **pargs)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf/backends/io.py:51, in HDMFIO.write(self, **kwargs)
     49 container = popargs('container', kwargs)
     50 f_builder = self.__manager.build(container, source=self.__source, root=True)
---> 51 self.write_builder(f_builder, **kwargs)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf/utils.py:645, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    643 def func_call(*args, **kwargs):
    644     pargs = _check_args(args, kwargs)
--> 645     return func(args[0], **pargs)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf_zarr/backend.py:248, in ZarrIO.write_builder(self, **kwargs)
    246 f_builder, link_data, exhaust_dci = getargs('builder', 'link_data', 'exhaust_dci', kwargs)
    247 for name, gbldr in f_builder.groups.items():
--> 248     self.write_group(parent=self.__file,
    249                      builder=gbldr,
    250                      link_data=link_data,
    251                      exhaust_dci=exhaust_dci)
    252 for name, dbldr in f_builder.datasets.items():
    253     self.write_dataset(parent=self.__file,
    254                        builder=dbldr,
    255                        link_data=link_data,
    256                        exhaust_dci=exhaust_dci)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf/utils.py:645, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    643 def func_call(*args, **kwargs):
    644     pargs = _check_args(args, kwargs)
--> 645     return func(args[0], **pargs)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf_zarr/backend.py:283, in ZarrIO.write_group(self, **kwargs)
    281 if subgroups:
    282     for subgroup_name, sub_builder in subgroups.items():
--> 283         self.write_group(parent=group,
    284                          builder=sub_builder,
    285                          link_data=link_data,
    286                          exhaust_dci=exhaust_dci)
    288 datasets = builder.datasets
    289 if datasets:

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf/utils.py:645, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    643 def func_call(*args, **kwargs):
    644     pargs = _check_args(args, kwargs)
--> 645     return func(args[0], **pargs)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf_zarr/backend.py:291, in ZarrIO.write_group(self, **kwargs)
    289 if datasets:
    290     for dset_name, sub_builder in datasets.items():
--> 291         self.write_dataset(parent=group,
    292                            builder=sub_builder,
    293                            link_data=link_data,
    294                            exhaust_dci=exhaust_dci)
    296 # write all links (haven implemented)
    297 links = builder.links

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf/utils.py:645, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    643 def func_call(*args, **kwargs):
    644     pargs = _check_args(args, kwargs)
--> 645     return func(args[0], **pargs)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf_zarr/backend.py:736, in ZarrIO.write_dataset(self, **kwargs)
    734     self.__dci_queue.append(dataset=dset, data=data)
    735 elif hasattr(data, '__len__'):
--> 736     dset = self.__list_fill__(parent, name, data, options)
    737 else:
    738     dset = self.__scalar_fill__(parent, name, data, options)

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf_zarr/backend.py:908, in ZarrIO.__list_fill__(self, parent, name, data, options)
    905     # For compound data types containing strings Zarr sometimes does not like wirting multiple values
    906     # try to write them one-at-a-time instead then
    907     except ValueError:
--> 908         for i in range(len(data)):
    909             dset[i] = data[i]
    910 return dset

File ~/anaconda3/envs/nwb/lib/python3.9/site-packages/hdmf/data_utils.py:1028, in DataIO.__len__(self)
   1026 if not self.valid:
   1027     raise InvalidDataIOError("Cannot get length of data. Data is not valid.")
-> 1028 return len(self.data)

TypeError: object of type 'SpikeInterfaceRecordingDataChunkIterator' has no len()

Operating System

Linux

Python Executable

Python

Python Version

3.9

Package Versions

No response

Code of Conduct

@alejoe91 alejoe91 added the bug label Nov 21, 2022
@CodyCBakerPhD
Copy link
Member

Looking through the error stack, I expect this to be an issue more with hdmf-zarr than with anything on NeuroConv. But will try reproducing with much simpler examples to be sure. Definitely something about how the DCI communicates with the final writing attempts that differs from vanilla HDMF.

Not related to the primary issue, but note a nice new convenience we've added for

nwb_metadata = dict(session_description="toy_example", identifier="tpy", session_start_time=datetime.now())
nwbfile = NWBFile(**nwb_metadata)

which can be done as

from nwbinspector.tools import make_minimal_nwbfile

nwbfile = make_minimal_nwbfile()

@alejoe91
Copy link
Contributor Author

Nice! Thanks for the tip! Should we move the discussion on to hdmf-zarr then?

@CodyCBakerPhD
Copy link
Member

If you have the time (if not I'll try it later), try

(i) writing a TimeSeries directly to acquisition with the same data as the recording. Something like

nwbfile = make_minimal_nwbfile()
nwbfile.add_acquisition(TimeSeries(name="idk", data=recording.get_traces(), rate=1.))

if that works fine then

(ii) try the more fundamentally basic SliceableDataChunkIterator (this will end up in HDMF at some point, nothing specific about NeuroConv in how it's designed unlike the SIRecordingIterator)

from neuroconv.tools.hdmf import SliceableDataChunkIterator

nwbfile = make_minimal_nwbfile()
nwbfile.add_acquisition(TimeSeries(name="idk", data=SliceableDataChunkIterator(recording.get_traces()), rate=1.))

@alejoe91
Copy link
Contributor Author

I'll do it right now!

@CodyCBakerPhD
Copy link
Member

If that second one gives a similar error then yes, hdmf-zarr needs to change something to be compatible with the existing data iteration tools.

I guess to be extra-sure, you could try

(iii) the previous iterator, not as efficient but it 'always works' (? we'll see...)

from hdmf.data_utils import DataChunkIterator

nwbfile = make_minimal_nwbfile()
nwbfile.add_acquisition(TimeSeries(name="idk", data=DataChunkIterator(recording.get_traces()), rate=1.))

@alejoe91
Copy link
Contributor Author

If you have the time (if not I'll try it later), try

(i) writing a TimeSeries directly to acquisition with the same data as the recording. Something like

nwbfile = make_minimal_nwbfile()
nwbfile.add_acquisition(TimeSeries(name="idk", data=recording.get_traces(), rate=1.))

if that works fine then

(ii) try the more fundamentally basic SliceableDataChunkIterator (this will end up in HDMF at some point, nothing specific about NeuroConv in how it's designed unlike the SIRecordingIterator)

from neuroconv.tools.hdmf import SliceableDataChunkIterator

nwbfile = make_minimal_nwbfile()
nwbfile.add_acquisition(TimeSeries(name="idk", data=SliceableDataChunkIterator(recording.get_traces()), rate=1.))

All three examples run fine (the DataChunkIterator is extremely slow!). So I guess it might be related to the SpikeInterfaceRecordingDataChunkIterator?

@CodyCBakerPhD
Copy link
Member

So I guess it might be related to the SpikeInterfaceRecordingDataChunkIterator?

Can't quite conclude that just yet. Good to know the iterators work, though.

Ah... so the problem now is likely that we do a lot of wrapping of things in H5DataIO's in NeuroConv, but if you're using Zarr that is no longer necessary (what is the method of specifying compression options with the Zarr backend BTW?)

@alejoe91
Copy link
Contributor Author

For zarr you can use the following arguments:

  • compressor: any numcodecs compatible compressor
  • chunks: tuple with chunk dimensions
  • filters: list of other numcodecs objects that are run before the compressor

Not sure how this is wrapped in hdmf-zarr, we need to take a look

@CodyCBakerPhD
Copy link
Member

chunks: tuple with chunk dimensions

Are these equivalent to HDF5 'chunks'? (~1 MB in size, our iterator buffers multiple chunks, 1 GB by default)

Not sure how this is wrapped in hdmf-zarr, we need to take a look

Cool, let me know - we're used to just specifying the string name + dict of options through the other wrapper, but if we need to specify a specific codec class we can do the string to class mapping ourselves in NeuroConv (and optionally, of course, allow the user to specify the class explicitly)

@alejoe91
Copy link
Contributor Author

chunks: tuple with chunk dimensions

Are these equivalent to HDF5 'chunks'? (~1 MB in size, our iterator buffers multiple chunks, 1 GB by default)

No, they are not in size, but rather in shape: e.g. (1000, 100) or (1000, None)

Not sure how this is wrapped in hdmf-zarr, we need to take a look

Cool, let me know - we're used to just specifying the string name + dict of options through the other wrapper, but if we need to specify a specific codec class we can do the string to class mapping ourselves in NeuroConv (and optionally, of course, allow the user to specify the class explicitly)

Ok sounds good. I'll keep testing different options in the next days! Thanks for the help @CodyCBakerPhD

@oruebel
Copy link

oruebel commented Nov 21, 2022

Based on the error, I think this is probably an issue with processing of the DataChunkIterator in hdmf-zarr. What is the data type of the data? I won't be able to look at this until later today. I created the following issue on hdmf-zarr as a reminder hdmf-dev/hdmf-zarr#43

@alejoe91
Copy link
Contributor Author

For this toy example it's float

@CodyCBakerPhD
Copy link
Member

@oruebel @alejoe91 I just tested the example locally and determined it has nothing to do with the iterators - it's the fact that we automatically wrap all kinds of things in NeuroConv with H5DataIO which clearly has no meaning here and that's what was causing the length problem.

Fix in progress on #203 but might take some discussion as to how we want to properly go about it (could also factor into to a more general overhaul of our compression option specification, which we've wanted to do for a while)

As an aside, the resulting NWB file that I get out of this... is it supposed to look like a folder? This is my first time testing the ZARR backend

@alejoe91
Copy link
Contributor Author

Yep! Zarr is indeed a structured folder structure ;)

@oruebel
Copy link

oruebel commented Nov 22, 2022

the fact that we automatically wrap all kinds of things in NeuroConv with H5DataIO which clearly has no meaning here

That makes sense. Yes, ZarrIO does not know how to deal with H5DataIO. There is an equivalent ZarrDataIO class that works much in the same way as H5DataIO but some of the options, in particular for compressors, are different. A basic tutorial on ZarrDataIO is here: https://hdmf-zarr.readthedocs.io/en/latest/tutorials/plot_zarr_dataset_io.html#sphx-glr-tutorials-plot-zarr-dataset-io-py

For basic setups, e.g., chunking only, we could in principle think about having some form of a "translator" to take H5DataIO objects and translate them to ZarrDataIO, but I don't think that will be possible in all cases as the HDF5 probably has settings that Zarr does not support and vice versa.

Let me know if there is anything I can help with to resolve this issue.

@h-mayorquin
Copy link
Collaborator

We should close this after we merge:
#578

@h-mayorquin
Copy link
Collaborator

I am closing this as this now is supported through the backend configuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants