Skip to content

refactor v3 data types #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 164 commits into from
Jun 16, 2025
Merged

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 28, 2025

As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

In main,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 28, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

copying a comment @nenb made in this zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like: jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types like U* from parametrized instances of those types (like U3). Defining a class with the character code U means instances of the class can be initialized with a "length" parameter, and then we can make U2, U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class for U2, U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

@nenb
Copy link

nenb commented Mar 3, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 4, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

@d-v-b d-v-b mentioned this pull request Mar 5, 2025
6 tasks
@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 11, 2025

thanks @rabernat! I have a final cleanup in the works where I simplify the ZDType base class to make the from_* classmethods more type safe, and after that I will update and hit the merge button.

@ngoldbaum
Copy link

Hi! I'm a NumPy maintainer who has worked a bit on the new user DType system we released in NumPy 2.0.

@nenb let me know that this PR is a thing and it's getting merged soon.

One of the main things we (the NumPy developers) skipped before releasing user DTypes was adding a way to serialize user DTypes to npy files. That's still an open question.

I wonder if instead of working on improving the npy file format - which is simple on purpose and beholden to not breaking user parser implementations - we can just point people at zarr instead. Maybe even think about adding zarr-python as an optional NumPy dependency to handle more complicated serialization than what npy can handle.

This is all just me offering my personal opinion, not an opinion of anyone else involved with NumPy. Nick told me he's going on vacation for a while but he's interested in chatting more. I'm just commenting publicly because it's exciting to me 😃.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 15, 2025

Thanks for the interest @ngoldbaum! I think it would be super interesting to evaluate Zarr as a "standard" serialization scheme for NumPy arrays.

One of the main things we (the NumPy developers) skipped before releasing user DTypes was adding a way to serialize user DTypes to npy files. That's still an open question.

This is something Zarr has to address for every dtype, but we operate under some constraints that make things simpler for us: all dtypes have to serialize to JSON, and as of Zarr V3 the structure of the JSON form of a dtype is well-defined (either a string or a dict with restricted keys). But I imagine NumPy might reasonably not want to commit to JSON, which could complicate the serialization story a bit.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 15, 2025

I did a final pass of type safety stuff, and I made a substantial change to how zarr v2 data types are serialized to and from JSON.

As a reminder, Zarr python 2 supported multiple distinct array data types with the same dtype: "|O" field in array metadata. When creating an array with dtype "|O", you had to specify the "object codec" (a codec that could serialize arbitrary python objects) explicitly. Thus, the object codec is effectively part of the Zarr V2 dtype model.

I have formalized this by making ZDType.to_json(zarr_format=2) return a dict with the form {"name": <thing you would see in .zarray under dtype>, "object_codec_id": <name of the object codec> | None}, and similarly ZDType.from_json(data, zarr_format=2) expects a dict with the same form. This keeps the ZDType API simple while capturing the essential structure of a Zarr v2 data type in a single piece of data.

Unfortunately, that piece of data doesn't look exactly like what you would see in .zarray, and for most data types the object_codec_id field is None and conveying very little information. If this feels clunky, that's an honest reflection of the quite clunky data type model used by Zarr V2 :).

This is an internal change that doesn't affect the basic design, so I don't think we need new reviews. Unless any new objections surface, I will merge this sometime tomorrow, and we can get started with the remaining tasks (namely, docs).

@d-v-b d-v-b merged commit 6798466 into zarr-developers:main Jun 16, 2025
30 checks passed
@TomAugspurger
Copy link
Contributor

Nice work @d-v-b!

@jhamman
Copy link
Member

jhamman commented Jun 16, 2025

Huge props to @d-v-b for pulling this across the finish line. This was a huge and highly-impactful lift. 🙌 🙌 🙌 🙌

For those that are following this thread, expect this to come out in Zarr 3.1.


@ngoldbaum - thanks for all your work on NumPy. We'd love open the conversation about supporting writing to Zarr directly from NumPy. Can you advise on where the best place to do that is?

@ilan-gold
Copy link
Contributor

@d-v-b Not sure if it's a bug or not, but here is something interesting.

Run this using, say, zarr 3.0.8:

import zarr, numcodecs

z = zarr.open("foo.zarr", zarr_format=2)
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()], fill_value="")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=object)

Then after this PR:

import zarr, numcodecs

z = zarr.open("foo.zarr")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=StringDType())

i.e., the dtype is now a string dtype. For us it was a simple fix, but not sure if it is (a) intentional or (b) indicative of some other issue.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 16, 2025

@d-v-b Not sure if it's a bug or not, but here is something interesting.

Run this using, say, zarr 3.0.8:

import zarr, numcodecs

z = zarr.open("foo.zarr", zarr_format=2)
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()], fill_value="")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=object)

Then after this PR:

import zarr, numcodecs

z = zarr.open("foo.zarr")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=StringDType())

i.e., the dtype is now a string dtype. For us it was a simple fix, but not sure if it is (a) intentional or (b) indicative of some other issue.

Numpy 2.0 introduced a new data type specifically for variable length strings. that's the StringDtype() you are seeing in the numpy array. I suspect that prior to my PR, zarr-python was not using StringDtype when reading variable-length strings in zarr v2. But it should have been, IMO, so I think the current behavior is an improvement.

@ilan-gold
Copy link
Contributor

Agreed. The only reason it came up was that we were implicitly relying on the nullability of the object dtype in strings. So we'll transition now away from that, I think

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 16, 2025

I think the numpy string dtype has nullability semantics, that might be worth investigating

@dstansby
Copy link
Contributor

Was this intentionally merged into main, and not the 3.1.0 branch?

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 17, 2025

Was this intentionally merged into main, and not the 3.1.0 branch?

Yes, but we can revisit that decision. I was under the impression that we were going to merge the 3.1 prs directly into main, but if we want to do a 3.0.9 release first, then we can revert this merge.

@dstansby
Copy link
Contributor

👍 - I'll delete the 3.1.0 branch then, and create a 3.0.x branch from before when this was merged.

@ngoldbaum
Copy link

thanks for all your work on NumPy. We'd love open the conversation about supporting writing to Zarr directly from NumPy. Can you advise on where the best place to do that is?

Sorry for the delay on this. We can schedule a call, maybe?

We have a biweekly NumPy community call you could drop into or we come to as well but unfortunately I need to miss the first half hour of that. See the NumPy community calendar for meeting links.

There's also a NumPy developer slack I can send you an invite to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.