Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with migration to zarr 3 #2689

Open
constantinpape opened this issue Jan 12, 2025 · 18 comments
Open

Issues with migration to zarr 3 #2689

constantinpape opened this issue Jan 12, 2025 · 18 comments

Comments

@constantinpape
Copy link

constantinpape commented Jan 12, 2025

Hi zarr developers,
I have started with the migration to zarr 3, but ran into issues with the changes related to create_dataset / create_array:

  1. The functions don't support passing data as an argument directly anymore. So instead of
create_dataset("some_name", data=data)

it now requires

ds = create_dataset("some_name", shape=data.shape, dtype=data.dtype)
ds[:] = data

I use this regularly in existing code, and also include it in many examples since it is quite convenient and also support by h5py (see also next point).

  1. create_dataset raises a deprecation warning that it will be removed with version 3.1. Why did you make this decision? For me (and I assume others), compatibility with h5py is quite important to enable seamless switching between the file formats for their respective advantages. Without it I will not be able to use zarr-python (or rather would need to write an additional wrapper class around it, which I would like to avoid).

The same considerations apply to require_dataset / require_array.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 12, 2025

h5py api compatibility is not a design goal for zarr v3. there are a few reasons for this, but (imo) the biggest is that the differences between zarr and hdf5 formats are large enough that maintaining the API compatibility was a maintenance burden we didn't want to bear.

Even if it's not in zarr-python, clearly there are lots of applications where h5py / zarr compatibility is very useful. We should figure out the best way to arrange this -- maybe a stand-alone library that depends on zarr-python and h5py? I think this would be in-scope for the zarr-developers repo. @mkitti has done interesting work creating hdf5 files that are valid zarr v3 shards, maybe we could find a place for that code in this hypothetical zarr+h5py repo.

Speaking as the author of create_array, the reason why create_array does not take a data kwarg is because the job of that function is only to create a zarr array, which is a distinct operation from creating a zarr array and then filling it with data. Accordingly it takes dtype and shape kwargs; if it also took a data kwarg, we would need to check that the shape and dtype of data match the shape and dtype that the user provided, but also shape and dtype are redundant if the user provides data, so we would need to allow shape and dtype to be None, etc.

Of course, the only reason people create zarr arrays is to eventually fill them with data, so we need an ergonomic path for this. I think from_array, which despite the PR description is moving toward supporting arbitrary array-like objects as input, will solve this problem.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 12, 2025

see also this issue which is specifically about creating zarr arrays from other array-like objects.

@constantinpape
Copy link
Author

Hi @d-v-b and thanks for the feedback,

h5py api compatibility is not a design goal for zarr v3. there are a few reasons for this, but (imo) the biggest is that the differences between zarr and hdf5 formats are large enough that maintaining the API compatibility was a maintenance burden we didn't want to bear.

Just to clarify, we wouldn't need full api compatibility that maps all features (which indeed sounds difficult due to different concepts like sharding), but rather equivalence of basic functionality, reading data via __getitem__ (this should be fine in the current form) and creating arrays / datasets and writing data via create_dataset / __setitem__.

Without this, we cannot migrate as is, since a lot of logic we have depends on this basic compatibility, in order to read and write data from different formats with a unified interface. This would require for adoption of v3:

  1. Having support for the minimal compatibility in zarr-python. Keeping support for create_dataset and require_dataset and adding a data parameter (it would not be required in create/require_array). This would be a thin wrapper around create/require_array, with some logic to handle passing shape, dtype and data together, and otherwise parsing kwargs along.
  2. Depending on a potential stand-alone library for compatibility. This sounds good in principle, but like much more effort, both for initial development and eventual maintenance compared to 1.
  3. Writing our own wrapper class to provide that compatibility. (Which I'd rather avoid, because it likely requires inheriting from zarr.Group to provide the respective methods and then copying zarr.open to use that class instead of the original group implementation but would do if there is no other option.)

So I would personally clearly prefer 1, and could also implement this in a PR. Otherwise I would see if you decide to implement 2 (but I don't have the resources to contribute to this myself, as this exceeds 1 in effort and complexity), and if this does not materialize go with 3.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 12, 2025

  1. Having support for the minimal compatibility in zarr-python. Keeping support for create_dataset and require_dataset and adding a data parameter (it would not be required in create/require_array). This would be a thin wrapper around create/require_array, with some logic to handle passing shape, dtype and data together, and otherwise parsing kwargs along.

Beyond changing names of functions, we are also planning on deprecating the old h5py-compatibIe function signatures from zarr-python 2.x (namely, create and open will be going away). So I think even if we kept a function named create_dataset, its altered function signature would still require a layer of wrapping, and I don't know of that's in-scope for zarr-python. Given that h5py compatibility is not one of our design goals, If the wrapping is simple enough to not be a maintenance burden for us, then I think it would also be simple enough to exist outside of zarr-python, which would be my recommendation. I'd be happy working on a small grant for this, if you have time for it.

@jni
Copy link
Contributor

jni commented Jan 13, 2025

I just want to add my perspective that I agree with @constantinpape that 1 would be the least effort, both long term and short term. It would be completely fine to segregate this functionality into its own namespace (zarr.convenience/zarr.compatibility/zarr.grumpy_old_fogeys), but having it in its own repo adds a ton of overhead for some simple functions — keeping up to date with changes in this repo (which clearly wants to move at pace), maintaining separate GHA configurations, separate PyPI and conda-forge updates... This does not seem worthwhile.

the job of that function is only to create a zarr array, which is a distinct operation from creating a zarr array and then filling it with data. Accordingly it takes dtype and shape kwargs; if it also took a data kwarg, we would need to check that the shape and dtype of data match the shape and dtype that the user provided, but also shape and dtype are redundant if the user provides data, so we would need to allow shape and dtype to be None, etc.

Handling data, dtype, and shape together is really not a huge burden:

  • if dtype is given and different from data, use it to cast the data. I think this isn't even extra lines of code because assignment already covers this?
  • if shape is given and incompatible, raise an error — or just let the writing error out.
  • if all are None, error

I agree in principle that overloading functionality is often undesirable. But "practicality beats purity." And, even once from_array exists, since it does not exist in zarr2, it becomes a problem to support zarr2 and zarr3 from a single code base. You can claim that it's not a design goal to be compatible with zarr2, but I think that's overstating the case — it's not a goal to have full compatibility, but it's been a goal to be compatible with a frequently used subset (hence Group.__setitem__), and for good reason, too: it is very user hostile to break backwards compatibility all at once, and forces a huge splinter in the community — those that have the resources and will to move to zarr3, and those that are by necessity stuck to zarr2 — and those libraries become mutually exclusive, to everyone's detriment.

I understand that supporting the full 2.x API would be far too much effort given all the (positive!) changes in zarr3. However, if a small amount of effort allows most libraries built on zarr to work with both zarr versions, I think that is a huge win for the community — especially if the effort comes from the community itself as @constantinpape is offering. 😉

@tomwhite
Copy link
Contributor

The functions don't support passing data as an argument directly anymore.

This was fixed in #2638 - and #2668 - which are in the latest v3 release.

@constantinpape
Copy link
Author

Thanks for the replies and feedback everyone!

Given the fixes @tomwhite pointed out, the current version of zarr v3 would be compatible with our needs. This is somehow not part of the version I currently get from conda-forge though. I think it would help to get at least a patch release for this (i.e. 3.0.1), in order to enable proper dependency specification in conda-forge (but this is just a minor issue).

I 99% ( ;) ) agree with @jni ' s comments here. The only point I am not so sure about is whether a separate zarr.convenience would be a good option. Given that create/require_dataset are methods of Group, this implies inheriting from Group in this convenience submodule and then implementing a separate entry point besides open_group / create_group to get access to it. To me that sounds more convoluted, both for maintenance and usage, than keeping these methods in the original Group. And I don't quite see how this would slow down development and maintenance much as it is a rather thin wrapper around create/require_array.

h5py-compatibIe function signatures from zarr-python 2.x (namely, create and open will be going away).

Just to understand this point @d-v-b : I am not quite sure how create and open are h5py-compatible. They are afaik not part of the h5py signature, which uses h5py.File as the main entry point. So the current strategy for h5py compatibility (at least how we use it) is to check whether the file path / URI that is given is an hdf5 file or a zarr group (by checking the file extension and potentially other checks for remote data, S3, that is only compatible with zarr) and then using h5py.File or zarr.open accordingly. If zarr.open/create is removed in favor of zarr.open_group/create_group this would not be a problem since we'd just need to replace these functions in the code that checks for file extensions etc. The big compatibility problem comes when the signature of Group changes too much (e.g. create/require_dataset are gone), since this is used in many more places and also by other libraries / code we don't have control over.

And to be clear again, I am not demanding at all to limit zarr functionality to strict adherence to h5py signature / functionality. This is clearly not possible / would slow down zarr development a lot. The only goal I am advertising for is to keep around the convenience functions with some defaults to keep a base level of compatibility.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 13, 2025

If zarr.open/create is removed in favor of zarr.open_group/create_group this would not be a problem since we'd just need to replace these functions in the code that checks for file extensions etc. The big compatibility problem comes when the signature of Group changes too much (e.g. create/require_dataset are gone), since this is used in many more places and also by other libraries / code we don't have control over.

Thanks for this context, that gives us a pretty specific chunk of code to talk about. If the concrete recommendation is that we have a Group.create_dataset method, would you need it to have the same function signature as it did in zarr-python 2? And if so, how would you support zarr v3 with that signature?

@constantinpape
Copy link
Author

constantinpape commented Jan 13, 2025

would you need it to have the same function signature as it did in zarr-python 2?

I am not so familiar with the exact function signature of zarr v2; we used it with h5py interchangeably and never ran into any issues. Specifically, there are two main use cases we have:

  1. Create a non-initialized dataset / array
# Create a dataset with shape and dtype, optionally specify compression and chunks, which should have reasonable defaults.
group.create_dataset(name, shape=shape, dtype=dtype, [compression=compression], [chunks=chunks])
  1. Create a initialized dataset / array
# Create a dataset from numpy array (data), with optional compression and chunks.
group.create_dataset(name, data=data, [compression=compression], [chunks=chunks])

And if so, how would you support zarr v3 with that signature?

At least on our end, the only (very!) relevant zarr v3 feature is sharding. So I would expect that the function now has an optional parameter shards, that defaults to no sharding.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 13, 2025

for reference, here is the current function signature for create_array:

def create_array(
store: str | StoreLike,
*,
name: str | None = None,
shape: ShapeLike,
dtype: npt.DTypeLike,
chunks: ChunkCoords | Literal["auto"] = "auto",
shards: ShardsLike | None = None,
filters: FiltersLike = "auto",
compressors: CompressorsLike = "auto",
serializer: SerializerLike = "auto",
fill_value: Any | None = None,
order: MemoryOrder | None = None,
zarr_format: ZarrFormat | None = 3,
attributes: dict[str, JSON] | None = None,
chunk_key_encoding: ChunkKeyEncoding | ChunkKeyEncodingLike | None = None,
dimension_names: Iterable[str] | None = None,
storage_options: dict[str, Any] | None = None,
overwrite: bool = False,
config: ArrayConfig | ArrayConfigLike | None = None,
) -> Array:

There are a few parameters that would need to change to make this compatible with create_dataset in h5py (ignoring keyword arguments that only h5py, or zarr uses exclusively, like shards):

  • create_dataset uses compression, create_array uses compressors, and even if the names were the same (and I do prefer compression), the keyword arguments don't take the same types of values. create_dataset will take a string for compression, but create_array will not take a string. So a caller has to know whether they are creating zarr arrays or hdf5 arrays, which defeats the whole purpose
  • create_dataset does not take a filters keyword argument; instead, a individual filters are exposed as specific keyword arguments; by contrast, create_array has a filters keyword argument.
  • create_dataset takes fillvalue, but create_array takes fill_value.

I don't see how we can write a single function that smooths over these differences without hurting the experience for users who are only using zarr. Even fill_value vs fillvalue is quite problematic -- there's no way of supporting this that isn't a mild maintenance / documentation headache and a source of confusion for the majority of users who don't use hdf5.

Smoothing over the zarr v2 and zarr v3 differences was hard enough, and I don't think we are really done with that job. Abstracting over v2, v3, AND hdf5 with the same function(s) seems much more difficult. The simplest thing would be to write a function that takes all the keyword arguments for zarr's create_array, AND h5py's create_dataset, and makes sense of them, depending on what format is being targeted. I think this would solve your problem, but I don't see where that would fit in zarr-python.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 13, 2025

we could also consider a separate module for this

@jni
Copy link
Contributor

jni commented Jan 14, 2025

I don't see how we can write a single function that smooths over these differences without hurting the experience for users who are only using zarr.

what? create_dataset is deprecated anyway, so having any kind of consistent interface with create_array is not relevant at all. Only compatibility with zarr v2 is relevant. We could (a) make the create_dataset signature match the old signature, translating between the arguments appropriately (I have full faith that you know how to convert a string to a list of v3 compressor specs), (b) change the warning from a deprecation warning to a pending deprecation warning or a future warning that points users to create_array as the preferred method unless you need h5py compatibility, and (c) postpone the removal of create_dataset ~indefinitely.

If you think create_dataset makes the object "ugly", it would be fine to hide it under getattr, not document it, whatever. For me the disadvantages of removing it just in terms of community good will far outweigh any advantages of removing it.

@constantinpape
Copy link
Author

what? create_dataset is deprecated anyway, so having any kind of consistent interface with create_array is not relevant at all.

I fully agree with this. How does keeping create_dataset around and potentially having a bit of logic to translate some of the function signature hurt the zarr users? Again, we are not demanding any changes to create_array for the sake of compatibility.

What I suggest then would be something like this: (I am sure not all the types are correct, so please don't take this too literal, it's just to illustrate the idea. We could also not expose any arguments at all and pop the relevant arguments from kwargs instead).

create_dataset(
  self,
  data: Optional[ArrayLike] = None, 
  dtype: Optional[str | np.ndtype] = None,
  shape: Optional[Tuple[int, ...]] = None,
  compression: Optional[str] = None, 
  fillvalue = Optional[Number],
  **kwargs,
):
  # Check that data, dtype and shape are consistent if all are passed   
  # Otherwise, ensure that one of data, (dtype, shape) are passed.
  # NOTE: I am a bit confused if you are planning to support passing 'data' to 'create_array' in the future or not.
  # If it's supported then this wouldn't be needed. (But it's totally fine to add it only here).
  _check_data_consistency(data, dtype, shape)
  
  # Translate the compression argument from str (zarr v2 / h5py compatible) to zarr v3.
  compressors = _translate_compression(compression)
  
 create_array(data=data, shape=shape, dtype=dtype, fill_value=fillvalue, compressors=compressors, **kwargs)

This would addresses compatibility with minimal effort, except for the different treatment of filters in zarr v2 vs. zarr v3 (This is not part of h5py anyways, seehttps://docs.h5py.org/en/stable/high/group.html#h5py.Group.create_dataset). I think it's fine if compatibility for such an advanced feature is not supported, a very large fraction of use-cases will not need it.

We could (a) make the create_dataset signature match the old signature, translating between the arguments appropriately (I have full faith that you know how to convert a string to a list of v3 compressor specs), (b) change the warning from a deprecation warning to a pending deprecation warning or a future warning that points users to create_array as the preferred method unless you need h5py compatibility, and (c) postpone the removal of create_dataset ~indefinitely.

I would lobby for a combination of (b) / (c). Raise some future warning that points to create_array for full support of zarr v3 features and postpone removal.

For me the disadvantages of removing it just in terms of community good will far outweigh any advantages of removing it.

Thanks for bringing this up. At least for us this removal would be quite a headache and would mean a major effort to keep supporting zarr-python. We have ca. 5 repositories that would be affected by this heavily. So we would either need to change a lot of code, to determine if it uses zarr or hdf5 and then use the specific library rather than our wrapper code, or write a zarr v3 wrapper class (which I would like to avoid as I think we would need to create our own entry point function and monitor future changes in zarr-python to keep it up-to-date).

These libraries are also externally used, with maybe around hundred users (very hard to estimate properly). While the majority of these users are just using a napari plugin and wouldn't be really affected (only in the sense that zarr v3 was breaking installations, but this we have quickly patched already), there is some smaller fraction that use the libraries and may very well rely on the h5py compatibility. So changing the code on our end would not fully solve this issue and we would need the wrapper class.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 14, 2025

I fully agree with this. How does keeping create_dataset around and potentially having a bit of logic to translate some of the function signature hurt the zarr users? Again, we are not demanding any changes to create_array for the sake of compatibility.

sorry for any confusion -- I was talking about the signature of create_array because our current implementation of Group.create_dataset essentially uses that signature (with the addition of a data kwarg), and you are right, we can definitely decouple the signature of create_dataset from create_array.

To answer your question about hurting users: it's generally undesirable to have two functions that do the same thing, but with different signatures. This is confusing, and confusion is a cost we should minimize. Zarr users who are not working with hdf5 will get no value from create_dataset, but it will take up space in the docs, and the source code, and be a maintenance burden. The Group class is valuable API real estate, so we should be deliberate about how we allocate it. Keeping a legacy create_dataset function there to benefit people who use a non-zarr format is thus also a cost. I'm not saying these things are decisive, just explaining what factors I'm considering here.

Of course there would also be benefits to the proposal -- unbreaking your code would be a great outcome, and I'm motivated to get us there.

To expand on your example, if we made all the h5py-compatible keyword arguments explicit (which we should), and we include zarr v2 keyword arguments then we get this (not sure if all of these types are real atm):

create_dataset(
  self,
  data: ArrayLike | None = None, 
  dtype: DtypeLike | None = None,
  shape: ShapeLike | None = None,
  compression: CodecLike | str | None = None,   # h5py
  compressor: str | None = None, 
  compression_opts: dict[str, object] | Iterable[object] | None = None,   # h5py
  fillvalue: object | None = None,   # h5py
  fill_value: object | None = None, 
  dimension_separator, 
  ... # zarr 2.x create_dataset kwargs
):

Since hdf5 doesn't support sharding, and as our main goal in this issue is backwards compatibility, could we perhaps constrain create_dataset to zarr v2 arrays only? I think this would make the problem a bit simpler.

@mkitti
Copy link

mkitti commented Jan 14, 2025

Since hdf5 doesn't support sharding, and as our main goal in this issue is backwards compatibility, could we perhaps constrain create_dataset to zarr v2 arrays only? I think this would make the problem a bit simpler.

h5py does support virtual datasets, which allows you to put parts of a chunked array in separate files. This is a very different construction paradigm, though.

https://docs.h5py.org/en/stable/vds.html

@constantinpape
Copy link
Author

Hi @d-v-b ,

thanks for the follow-up and your explanation of the downsides of keeping around create_dataset, but I think that the fast deprecation of a core API function is really disruptive, unlike a bit more bloated documentation and API. And I would assume that we are not the only users that would be quite affected by this.

For the specific points:

This is confusing, and confusion is a cost we should minimize

Regarding the documentation I could see two ways to minimize confusion: not document it at all, or explicitly state that this is a legacy function for compatibility with h5py / zarr v2. I would prefer the second option, together with an appropriate future warning.

To expand on your example, if we made all the h5py-compatible keyword arguments explicit (which we should) and we include zarr v2 keyword arguments then we get this (not sure if all of these types are real atm):

I agree that making the h5py / zarr v2 keywords explicit would be best, and your signature seems to capture the relevant ones.

Since hdf5 doesn't support sharding, and as our main goal in this issue is backwards compatibility, could we perhaps constrain create_dataset to zarr v2 arrays only? I think this would make the problem a bit simpler.

Yes, I think that is fair. I would then suggest a warning that points to create_array for zarr v3 support / general future warning.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 14, 2025

@constantinpape that all sounds good to me! So I propose we move forward with restoring the old, pre-3.0 behavior of Group.create_dataset, with clear documentation that the design is for compatibility with zarr-python 2.x and h5py.

@normanrz and @jhamman, does this seem like a good path to you?

@jhamman
Copy link
Member

jhamman commented Jan 15, 2025

Hi all. Apologies for missing the start of this conversation - I'm traveling this week. Let me quote from the v3 design doc / roadmap:

zarr.h5compat.Group – Zarr-Python 2.* made an attempt to align its API with that of h5py. With 3.0, we will relax this alignment in favor of providing an explicit compatibility module (zarr.h5py_compat). This module will expose the Group and Dataset APIs that map to Zarr-Python’s Group and Array objects.

We, admittedly, did not get to this before 3.0.0 or fully flesh out the design in the roadmap doc - in part because we didn't hear much interest in this during the development of v3. This issue shows me that there is still some interest in maintaining this feature set somehow. So let's have the conversation now.

@constantinpape and @mkitti - would a dedicated compatibility layer suffice for your needs? Can we document in more detail what parts of the h5py API you can't live without?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants