-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Standardizing Conversion Config Options #280
Comments
I favor the configuration class as well but the problem with it is that it will be slightly more complex to develop and, if we do it wrong, it could hamper usability. I am uncertain about whether this will reduce or increase or maintenance costs. I was trying to made the same point before but I don't think I explained myself very well. That said, I still think it is the best option. I think that a data class is not sufficient though. Take compression for example, given the inconsistent signature of This is also valid for the other conversion configurations. Here is the full table:
All of them have a validation part that is done internally and an operation that wraps or reshapes the data for the desired end which is called in our writing pipeline. Encapsulating this logic into out pipelines is easier to tests and I think it makes the code more readable (the programmer reading the code knows that this where stubbing, compression or iterative write configuration happens but is not shown the details). Having classess for configuration also allows a single object to be passed around working as a global configuration if that is desirable. Finally, this means that we only need to provide rich documentation in one place as mentioned in #278. |
Concerning this:
I think we can alleviate things on the usability front somehow by allowing a two tier level of configuration. For example, for stubbing. We can have two simple possible options where I think this division can work for compression where the simple tier is deciding between standard gzip level=4 compression and no compression at all whereas the fine tuning tier is implemented passing a I am less certain how this can work with the iterative data write specification. What is the thing that needs to be changed, @CodyCBakerPhD ? I thought that here the simple tier could be choosing between simple defaults with 1GB chunks and not using iterative write at all whereas the fine tuning tier would be allowing control of chunks, shapes, axis, etc. What do you think? Does this make sense or am I missing something important? |
This is precisely the way the The only reason I don't do that all the time is just to communicate more easily in the code 'what' that default equivocates to (otherwise you have to go and read through a bunch of upstream documentation to determine that it's level 4 GZIP).
Yep, this is the biggest point at the end of day to resolve the concerns that originated this discussion.
This goes beyond what I was imagining, which was just a data object that passes kwargs to the respective methods. But I can also see how this could provide an opportunity to standardize how those methods are wrapped through the Will have to think on that more... For now it might be best for a first PR to simply pass parameters and a follow-up PR to introduce the more data-interacting side.
The simple kwarg passing is especially true for iteration, so nothing would need to be changed. Each 'type' of iterator takes and passes the respective kwargs of either the |
I think we have spent too much time on this and I'd like to wrap it up. I think one challenge here is building a specification that works across the different use-cases: built-ins, hdf5plugin, and Zarr. Another is that the allowable args for the options change depending on the type of compression. We could formally specify the input options as they change for each different compressor, which would require us to build a class for each compressor, but honestly I don't think this is really complex enough to require the class approach. Besides, I think that we would need to create (n_compression methods x n_backends) classes, which I think is too many. Thanks @h-mayorquin for pointing out the class-based specification of the kernels in scikit-learn. That's an interesting case study, however I don't think that approach fits here. There, they have a few different methods with lots of interdependencies of the parameters. The complexity is in the parameterization of each of a few different configuration options, so there are a few classes with complex logic in each. Here, we have lots of different types of configs but not much complexity within each, which would translate to many small classes. I don't see much benefit in a single class for all compression options. I'll cede the point that a dictionary is more explicit than a list of args and a list of args may have been overfitting to the hdf5plugin API. How about this: AVAILABLE_COMPRESSION_METHODS = [
"gzip",
"blosc",
...
]
run_conversion(
...,
compression_method: Literal[True, False, *AVAILABLE_COMPRESSION_METHODS] = True
compression_options: Optional[dict] = None,
)
"""
Paremeters
-----------
...
compression_method: {"gzip", "blosc", ..., True, False}
compression_options: dict
changes depending on the compression_method. See docs on compression [here].
""" Then all of the complications regarding how compression_options changes based on data compression_method and data backend can be off-loaded to a docs page. It's too complicated to try to do in a docstring. I know most of you wanted to group the method and the options into one dictionary, but I prefer my way for a few reasons.
Here are a few examples of different configs and how they would be translated. Example 1: autocompression: True would translate to: H5DataIO(data, compression=True) # gzip, lvl 4 by default in h5py Example 2: gzip with custom levelcompression_method: gzip
compression_options:
clevel: 4 would translate to: H5DataIO(data, compression="gzip", compression_opt=4) Example 3: hdf5plugincompression_method: blosc
compression_options:
cname: zstd
clevel: 4
shuffle: 2 would translate to: For HDF5: filter = hdf5plugin.get_filters("blosc")[0]
H5DataIO(data, allow_plugin_filters=True, **filter(cname="zstd", clevel=4, shuffle=2))) For Zarr: from numcodecs import Blosc
ZarrDataIO(data, compressor=Blosc(cname="zstd", clevel=4, shuffle=2)) I don't think this would be too hard to do with a simple function and a few if/else clauses. Basically something like: from functools import partial
def configure_compression(compression_method, compression_options):
if output_type == "h5":
if compression_method in ("gzip",):
return partial(lambda x: H5DataIO(x, compression="gzip", compression_opt=compression_options["clevel"]))
elif compression_method in ("blosc", ...):
filter = hdf5plugin.get_filters("blosc")[0]
filter_kws = filter(**compression_options)
return partial(lambda x: H5DataIO(data, allow_plugin_filters=True, **filter_kws))
...
)
...
TimeSeries(
...
data=configure_compression(data)
) I am imagining for the GUI forms we could have a drop-down for the available compressor types, and do a little bit of custom JavaScript to load the compression_options form based on what compression method is selected. Am I missing something? Are there any problems with this approach that I am not seeing? Unless there are any problems, I would like to move forward with this. |
Many of these ideas and/or the demonstration of #281 could be incorporated into a @dataclass
class DatasetConfig:
stub_shape: Optional[Tuple[int]] = None # specifying this is the same as saying `stub=True`
compression: Union[bool, int, str] = True
compression_options: Optional[dict] = None
iteration: bool = True # TODO, remove old v1 type, arguments becomes simple enable/disable of wrapper
iterator_options: Optional[dict] = None this is a low-priority feature to add, however |
I am going to close this. With all the work of the backend configuration all the compression options are now specified: https://neuroconv.readthedocs.io/en/main/user_guide/backend_configuration.html There is still the issue of stubbing, metadata keys, iterative writing configuration and the lik but I think we can start those on issues of their own. |
What would you like to see added to NeuroConv?
@bendichter
Summary of conversation in PR #272.
I've broken it into several expandable sections as each point relates to multiple comments.
I've linked to original comments whenever they contain more information than the summary.
Conversations about this began back in #116 between @h-mayorquin and I.
The Problem
It has become annoying to compile various fields in what we refer to as the
conversion_options
ofrun_conversion
calls. These overlap with their corresponding parts of theneuroconv.tools
.They tended to evolve differently which lead to some splits in how identical fields are specified between certain types of interfaces
In our ecephys data interfaces and tools we have
In our ophys data interfaces and tools we have
the behavior interfaces can be all over the place in which convention they choose, as raised in an older issue #42.
NeuroConv is not always consistent with external APIs (nor should it be)
From #272 (comment) in response to #272 (comment)
NeuroConv, like SpikeInterface, is a universal API for data ingestion from multiple sources.
However, upstream methods that do similar things can have different call signatures across different libraries.
For consistency, SpikeInterface and NeuroConv alias upstream
kwargs
with common references.file_path
andfolder_path
+stream_name
filename
anddirname
file_path
andfolder_path
Even the name of
DataIO
arguments likecompression
/compression_opts
change between HDF5 and Zarr.Readability of argument names is important
Abbreviations in argument names (and variable names in general) can actually make code unnecessarily hard to read.
Example:
_opts
could mean any of:optics
,optical_series
,operations
, or it is itself an actual word. This would likely be even more confusing for a coding novice who may never have seen that abbreviation used that way before. This is also only one of many examples, I can find more if you like.It only takes a fraction of second to clarify this by writing the full name,
_options
. Use of tab completion during development also completely negates the drawback of using longer variable names.NeuroConv presents an opportunity for us to alias such things in a way that maximizes user-friendly readability.
The primary issue is not just about reaching an agreement on standard option names, but a reflection on the practice of adding more and more arguments every time we want to propagate something new
From Heberto,
With Szonja indicating a similar feeling at an in-person meeting.
If our own developers feel this way, it's quite likely that inexperienced users would as well.
NeuroConv is focusing more and more on not just being easy to use for Python novices, but also for non-coding individuals in general.
This echoes the code smell of too many parameters pointed out by Heberto.
Too many parameters is also the main cause of repeated docstrings #278
The total number of docstring repetitions for current arguments grows quadratically in the number of times that argument is repeated by the number of repeated arguments from the top level.
That amount of time is not trivial, and is the main thing SpikeInterface and matplotlib are trying to avoid with their dynamic docstrings.
Keyword arguments, as the YAML uses, are not ordered
From #272 (comment), the current design allows the possibility to do things like
which causes an error, and takes longer than necessary to debug due to how the inter-dependent options are not visually linked.
It only gets worse when we want to add new features
From #272 (comment)
Better support for a Zarr backend and data IO has been requested [Bug]:
write_recording()
fails withNWBZarrIO
backend #202 and would be a great thing to add in NeuroConv.filter_type
andfilter_options
(these apply before the compression in Zarr, such as a Delta).We'd also like to eventually add more stub options
For both of these additions, if we just add new pairs of arguments to the outer level, we'd make the above problems even worse.
In conclusion, @h-mayorquin, @weiglszonja, and I all see this as a problem for consistency of the internal NeuroConv API, maintainability of existing codebase and documentation, and scalability of adding new types of arguments as features.
Also, as per the argument that users are not expected to use any of these options anyway and instead defer to defaults, that may well be true for the compression field but it is definitely not true for iteration (which we often need to adjust based on the system resources and conversion setup) and stubbing (which we actively tell users to enable the first time they try to run a pipeline).
Proposed Solutions
Original: Multiple Nested Dictionaries
*_options
which is a dict-of-dicts withmethod
, which specifies the 'type' of whatever option being usedmethod_options
that is the dictionary of arguments that get passed dynamically into whatever that type isExample:
becomes
Pro's
{name_of_option}_options
, and you know how the sub-fields of that will be referencedmethod
andmethod_options
as catch-alls.source_data
,metadata
, andconversion_options
Con's
Heberto's Suggestion: Multiple Flat Dictionaries
From #272 (comment)
Similar to the nested dictionary, but instead of nesting just pass any options dynamically along with the
method
.This is how Pandas does things
Example:
becomes
Pro's
{name_of_option}_options
, and you know how the sub-fields of that will be referencedmethod
andmethod_options
as catch-alls.Con's
Config Class
This is a really good idea that @weiglszonja came up with in a meeting the other day.
Similar to the nested dictionary, but rather than it being a hard-coded dictionary-of-dictionary in the API, it's a Python dataclass (or Pydantic Model even, which is basically an enhanced version with self-validating features).
The YAML representation would appear the same as either of the above approaches.
The API would restrict available options to pre-specified literals
get_options
like howhdf5plugin.get_available_filters
)This is similar to how sci-kit learn handles classes that take complex configurations.
Pydantic itself uses Config classes just like this through its core model structure.
Additionally, this means that instead of passing a series of argument pairs (as we do right now on
main
), and instead of passing multiple*_options
arguments into our functions, each would simply take aConfig
as a single argument and that class would specify everything the conversion needs to know.We would only have to document the rich docstring once for the single Config class, which reduces duplication of that information across the rest of the repo.
Pro's
{name_of_option}_options
, and you know how the sub-fields of that will be referencedmethod
andmethod_options
as catch-alls.Con's
Do you have any interest in helping implement the feature?
Yes.
Code of Conduct
The text was updated successfully, but these errors were encountered: