-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with migration to zarr 3 #2689
Comments
h5py api compatibility is not a design goal for zarr v3. there are a few reasons for this, but (imo) the biggest is that the differences between zarr and hdf5 formats are large enough that maintaining the API compatibility was a maintenance burden we didn't want to bear. Even if it's not in Speaking as the author of Of course, the only reason people create zarr arrays is to eventually fill them with data, so we need an ergonomic path for this. I think |
see also this issue which is specifically about creating zarr arrays from other array-like objects. |
Hi @d-v-b and thanks for the feedback,
Just to clarify, we wouldn't need full api compatibility that maps all features (which indeed sounds difficult due to different concepts like sharding), but rather equivalence of basic functionality, reading data via Without this, we cannot migrate as is, since a lot of logic we have depends on this basic compatibility, in order to read and write data from different formats with a unified interface. This would require for adoption of v3:
So I would personally clearly prefer 1, and could also implement this in a PR. Otherwise I would see if you decide to implement 2 (but I don't have the resources to contribute to this myself, as this exceeds 1 in effort and complexity), and if this does not materialize go with 3. |
Beyond changing names of functions, we are also planning on deprecating the old h5py-compatibIe function signatures from zarr-python 2.x (namely, |
I just want to add my perspective that I agree with @constantinpape that 1 would be the least effort, both long term and short term. It would be completely fine to segregate this functionality into its own namespace (zarr.convenience/zarr.compatibility/zarr.grumpy_old_fogeys), but having it in its own repo adds a ton of overhead for some simple functions — keeping up to date with changes in this repo (which clearly wants to move at pace), maintaining separate GHA configurations, separate PyPI and conda-forge updates... This does not seem worthwhile.
Handling data, dtype, and shape together is really not a huge burden:
I agree in principle that overloading functionality is often undesirable. But "practicality beats purity." And, even once from_array exists, since it does not exist in zarr2, it becomes a problem to support zarr2 and zarr3 from a single code base. You can claim that it's not a design goal to be compatible with zarr2, but I think that's overstating the case — it's not a goal to have full compatibility, but it's been a goal to be compatible with a frequently used subset (hence I understand that supporting the full 2.x API would be far too much effort given all the (positive!) changes in zarr3. However, if a small amount of effort allows most libraries built on zarr to work with both zarr versions, I think that is a huge win for the community — especially if the effort comes from the community itself as @constantinpape is offering. 😉 |
Thanks for the replies and feedback everyone! Given the fixes @tomwhite pointed out, the current version of zarr v3 would be compatible with our needs. This is somehow not part of the version I currently get from conda-forge though. I think it would help to get at least a patch release for this (i.e. 3.0.1), in order to enable proper dependency specification in conda-forge (but this is just a minor issue). I 99% ( ;) ) agree with @jni ' s comments here. The only point I am not so sure about is whether a separate
Just to understand this point @d-v-b : I am not quite sure how And to be clear again, I am not demanding at all to limit zarr functionality to strict adherence to h5py signature / functionality. This is clearly not possible / would slow down zarr development a lot. The only goal I am advertising for is to keep around the convenience functions with some defaults to keep a base level of compatibility. |
Thanks for this context, that gives us a pretty specific chunk of code to talk about. If the concrete recommendation is that we have a |
I am not so familiar with the exact function signature of zarr v2; we used it with
# Create a dataset with shape and dtype, optionally specify compression and chunks, which should have reasonable defaults.
group.create_dataset(name, shape=shape, dtype=dtype, [compression=compression], [chunks=chunks])
# Create a dataset from numpy array (data), with optional compression and chunks.
group.create_dataset(name, data=data, [compression=compression], [chunks=chunks])
At least on our end, the only (very!) relevant zarr v3 feature is sharding. So I would expect that the function now has an optional parameter |
for reference, here is the current function signature for zarr-python/src/zarr/api/synchronous.py Lines 743 to 763 in b773b3e
There are a few parameters that would need to change to make this compatible with
I don't see how we can write a single function that smooths over these differences without hurting the experience for users who are only using zarr. Even Smoothing over the zarr v2 and zarr v3 differences was hard enough, and I don't think we are really done with that job. Abstracting over v2, v3, AND hdf5 with the same function(s) seems much more difficult. The simplest thing would be to write a function that takes all the keyword arguments for zarr's |
we could also consider a separate module for this |
what? create_dataset is deprecated anyway, so having any kind of consistent interface with create_array is not relevant at all. Only compatibility with zarr v2 is relevant. We could (a) make the create_dataset signature match the old signature, translating between the arguments appropriately (I have full faith that you know how to convert a string to a list of v3 compressor specs), (b) change the warning from a deprecation warning to a pending deprecation warning or a future warning that points users to create_array as the preferred method unless you need h5py compatibility, and (c) postpone the removal of create_dataset ~indefinitely. If you think create_dataset makes the object "ugly", it would be fine to hide it under getattr, not document it, whatever. For me the disadvantages of removing it just in terms of community good will far outweigh any advantages of removing it. |
I fully agree with this. How does keeping What I suggest then would be something like this: (I am sure not all the types are correct, so please don't take this too literal, it's just to illustrate the idea. We could also not expose any arguments at all and pop the relevant arguments from create_dataset(
self,
data: Optional[ArrayLike] = None,
dtype: Optional[str | np.ndtype] = None,
shape: Optional[Tuple[int, ...]] = None,
compression: Optional[str] = None,
fillvalue = Optional[Number],
**kwargs,
):
# Check that data, dtype and shape are consistent if all are passed
# Otherwise, ensure that one of data, (dtype, shape) are passed.
# NOTE: I am a bit confused if you are planning to support passing 'data' to 'create_array' in the future or not.
# If it's supported then this wouldn't be needed. (But it's totally fine to add it only here).
_check_data_consistency(data, dtype, shape)
# Translate the compression argument from str (zarr v2 / h5py compatible) to zarr v3.
compressors = _translate_compression(compression)
create_array(data=data, shape=shape, dtype=dtype, fill_value=fillvalue, compressors=compressors, **kwargs) This would addresses compatibility with minimal effort, except for the different treatment of filters in zarr v2 vs. zarr v3 (This is not part of h5py anyways, seehttps://docs.h5py.org/en/stable/high/group.html#h5py.Group.create_dataset). I think it's fine if compatibility for such an advanced feature is not supported, a very large fraction of use-cases will not need it.
I would lobby for a combination of (b) / (c). Raise some future warning that points to
Thanks for bringing this up. At least for us this removal would be quite a headache and would mean a major effort to keep supporting zarr-python. We have ca. 5 repositories that would be affected by this heavily. So we would either need to change a lot of code, to determine if it uses zarr or hdf5 and then use the specific library rather than our wrapper code, or write a zarr v3 wrapper class (which I would like to avoid as I think we would need to create our own entry point function and monitor future changes in zarr-python to keep it up-to-date). These libraries are also externally used, with maybe around hundred users (very hard to estimate properly). While the majority of these users are just using a napari plugin and wouldn't be really affected (only in the sense that zarr v3 was breaking installations, but this we have quickly patched already), there is some smaller fraction that use the libraries and may very well rely on the h5py compatibility. So changing the code on our end would not fully solve this issue and we would need the wrapper class. |
sorry for any confusion -- I was talking about the signature of To answer your question about hurting users: it's generally undesirable to have two functions that do the same thing, but with different signatures. This is confusing, and confusion is a cost we should minimize. Zarr users who are not working with hdf5 will get no value from Of course there would also be benefits to the proposal -- unbreaking your code would be a great outcome, and I'm motivated to get us there. To expand on your example, if we made all the h5py-compatible keyword arguments explicit (which we should), and we include zarr v2 keyword arguments then we get this (not sure if all of these types are real atm): create_dataset(
self,
data: ArrayLike | None = None,
dtype: DtypeLike | None = None,
shape: ShapeLike | None = None,
compression: CodecLike | str | None = None, # h5py
compressor: str | None = None,
compression_opts: dict[str, object] | Iterable[object] | None = None, # h5py
fillvalue: object | None = None, # h5py
fill_value: object | None = None,
dimension_separator,
... # zarr 2.x create_dataset kwargs
): Since hdf5 doesn't support sharding, and as our main goal in this issue is backwards compatibility, could we perhaps constrain |
h5py does support virtual datasets, which allows you to put parts of a chunked array in separate files. This is a very different construction paradigm, though. |
Hi @d-v-b , thanks for the follow-up and your explanation of the downsides of keeping around For the specific points:
Regarding the documentation I could see two ways to minimize confusion: not document it at all, or explicitly state that this is a legacy function for compatibility with h5py / zarr v2. I would prefer the second option, together with an appropriate future warning.
I agree that making the h5py / zarr v2 keywords explicit would be best, and your signature seems to capture the relevant ones.
Yes, I think that is fair. I would then suggest a warning that points to |
@constantinpape that all sounds good to me! So I propose we move forward with restoring the old, pre-3.0 behavior of @normanrz and @jhamman, does this seem like a good path to you? |
Hi all. Apologies for missing the start of this conversation - I'm traveling this week. Let me quote from the v3 design doc / roadmap:
We, admittedly, did not get to this before 3.0.0 or fully flesh out the design in the roadmap doc - in part because we didn't hear much interest in this during the development of v3. This issue shows me that there is still some interest in maintaining this feature set somehow. So let's have the conversation now. @constantinpape and @mkitti - would a dedicated compatibility layer suffice for your needs? Can we document in more detail what parts of the h5py API you can't live without? |
Hi zarr developers,
I have started with the migration to zarr 3, but ran into issues with the changes related to
create_dataset
/create_array
:data
as an argument directly anymore. So instead ofit now requires
I use this regularly in existing code, and also include it in many examples since it is quite convenient and also support by
h5py
(see also next point).create_dataset
raises a deprecation warning that it will be removed with version 3.1. Why did you make this decision? For me (and I assume others), compatibility withh5py
is quite important to enable seamless switching between the file formats for their respective advantages. Without it I will not be able to usezarr-python
(or rather would need to write an additional wrapper class around it, which I would like to avoid).The same considerations apply to
require_dataset
/require_array
.The text was updated successfully, but these errors were encountered: