refactor v3 data types #2874

d-v-b · 2025-02-28T11:43:49Z

As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

In main,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

…into feat/fixed-length-strings

d-v-b · 2025-02-28T13:23:18Z

copying a comment @nenb made in this zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like: jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types like U* from parametrized instances of those types (like U3). Defining a class with the character code U means instances of the class can be initialized with a "length" parameter, and then we can make U2, U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class for U2, U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

src/zarr/core/metadata/dtype.py

src/zarr/codecs/sharding.py

…base

nenb · 2025-03-03T13:56:00Z

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

d-v-b · 2025-03-04T11:27:15Z

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

…at/fixed-length-strings

d-v-b · 2025-06-11T16:59:13Z

thanks @rabernat! I have a final cleanup in the works where I simplify the ZDType base class to make the from_* classmethods more type safe, and after that I will update and hit the merge button.

ngoldbaum · 2025-06-13T14:31:04Z

Hi! I'm a NumPy maintainer who has worked a bit on the new user DType system we released in NumPy 2.0.

@nenb let me know that this PR is a thing and it's getting merged soon.

One of the main things we (the NumPy developers) skipped before releasing user DTypes was adding a way to serialize user DTypes to npy files. That's still an open question.

I wonder if instead of working on improving the npy file format - which is simple on purpose and beholden to not breaking user parser implementations - we can just point people at zarr instead. Maybe even think about adding zarr-python as an optional NumPy dependency to handle more complicated serialization than what npy can handle.

This is all just me offering my personal opinion, not an opinion of anyone else involved with NumPy. Nick told me he's going on vacation for a while but he's interested in chatting more. I'm just commenting publicly because it's exciting to me 😃.

…guards for native dtype and json input

…into feat/fixed-length-strings

…zarr-python into feat/fixed-length-strings

…odec_id in a typeddict for zarr v2 metadata

d-v-b · 2025-06-15T21:28:23Z

Thanks for the interest @ngoldbaum! I think it would be super interesting to evaluate Zarr as a "standard" serialization scheme for NumPy arrays.

One of the main things we (the NumPy developers) skipped before releasing user DTypes was adding a way to serialize user DTypes to npy files. That's still an open question.

This is something Zarr has to address for every dtype, but we operate under some constraints that make things simpler for us: all dtypes have to serialize to JSON, and as of Zarr V3 the structure of the JSON form of a dtype is well-defined (either a string or a dict with restricted keys). But I imagine NumPy might reasonably not want to commit to JSON, which could complicate the serialization story a bit.

d-v-b · 2025-06-15T21:36:31Z

I did a final pass of type safety stuff, and I made a substantial change to how zarr v2 data types are serialized to and from JSON.

As a reminder, Zarr python 2 supported multiple distinct array data types with the same dtype: "|O" field in array metadata. When creating an array with dtype "|O", you had to specify the "object codec" (a codec that could serialize arbitrary python objects) explicitly. Thus, the object codec is effectively part of the Zarr V2 dtype model.

I have formalized this by making ZDType.to_json(zarr_format=2) return a dict with the form {"name": <thing you would see in .zarray under dtype>, "object_codec_id": <name of the object codec> | None}, and similarly ZDType.from_json(data, zarr_format=2) expects a dict with the same form. This keeps the ZDType API simple while capturing the essential structure of a Zarr v2 data type in a single piece of data.

Unfortunately, that piece of data doesn't look exactly like what you would see in .zarray, and for most data types the object_codec_id field is None and conveying very little information. If this feels clunky, that's an honest reflection of the quite clunky data type model used by Zarr V2 :).

This is an internal change that doesn't affect the basic design, so I don't think we need new reviews. Unless any new objections surface, I will merge this sometime tomorrow, and we can get started with the remaining tasks (namely, docs).

TomAugspurger · 2025-06-16T11:22:34Z

Nice work @d-v-b!

jhamman · 2025-06-16T15:17:57Z

Huge props to @d-v-b for pulling this across the finish line. This was a huge and highly-impactful lift. 🙌 🙌 🙌 🙌

For those that are following this thread, expect this to come out in Zarr 3.1.

@ngoldbaum - thanks for all your work on NumPy. We'd love open the conversation about supporting writing to Zarr directly from NumPy. Can you advise on where the best place to do that is?

ilan-gold · 2025-06-16T17:31:02Z

@d-v-b Not sure if it's a bug or not, but here is something interesting.

Run this using, say, zarr 3.0.8:

import zarr, numcodecs

z = zarr.open("foo.zarr", zarr_format=2)
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()], fill_value="")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=object)

Then after this PR:

import zarr, numcodecs

z = zarr.open("foo.zarr")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=StringDType())

i.e., the dtype is now a string dtype. For us it was a simple fix, but not sure if it is (a) intentional or (b) indicative of some other issue.

d-v-b · 2025-06-16T17:36:41Z

@d-v-b Not sure if it's a bug or not, but here is something interesting.

Run this using, say, zarr 3.0.8:
import zarr, numcodecs

z = zarr.open("foo.zarr", zarr_format=2)
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()], fill_value="")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=object)
Then after this PR:
import zarr, numcodecs

z = zarr.open("foo.zarr")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=StringDType())
i.e., the dtype is now a string dtype. For us it was a simple fix, but not sure if it is (a) intentional or (b) indicative of some other issue.

Numpy 2.0 introduced a new data type specifically for variable length strings. that's the StringDtype() you are seeing in the numpy array. I suspect that prior to my PR, zarr-python was not using StringDtype when reading variable-length strings in zarr v2. But it should have been, IMO, so I think the current behavior is an improvement.

ilan-gold · 2025-06-16T17:51:52Z

Agreed. The only reason it came up was that we were implicitly relying on the nullability of the object dtype in strings. So we'll transition now away from that, I think

d-v-b · 2025-06-16T18:01:39Z

I think the numpy string dtype has nullability semantics, that might be worth investigating

dstansby · 2025-06-17T07:33:49Z

Was this intentionally merged into main, and not the 3.1.0 branch?

d-v-b · 2025-06-17T07:40:38Z

Was this intentionally merged into main, and not the 3.1.0 branch?

Yes, but we can revisit that decision. I was under the impression that we were going to merge the 3.1 prs directly into main, but if we want to do a 3.0.9 release first, then we can revert this merge.

dstansby · 2025-06-17T07:59:21Z

👍 - I'll delete the 3.1.0 branch then, and create a 3.0.x branch from before when this was merged.

ngoldbaum · 2025-06-25T18:19:43Z

thanks for all your work on NumPy. We'd love open the conversation about supporting writing to Zarr directly from NumPy. Can you advise on where the best place to do that is?

Sorry for the delay on this. We can schedule a call, maybe?

We have a biweekly NumPy community call you could drop into or we come to as well but unfortunately I need to miss the first half hour of that. See the NumPy community calendar for meeting links.

There's also a NumPy developer slack I can send you an invite to.

d-v-b added 9 commits February 21, 2025 13:43

modernize typing

f5e3f78

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

b4e71e2

…into feat/fixed-length-strings

lint

3c50f54

new dtypes

d74e7a4

rename base dtype, change type to kind

5000dcb

start working on JSON serialization

9cd5c51

get json de/serialization largely working, and start making tests pass

042fac1

tweak json type guards

556e390

fix dtype sizes, adjust fill value parsing in from_dict, fix tests

b588f70

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 28, 2025

d-v-b added 2 commits March 2, 2025 12:54

mid-refactor commit

4ed41c6

working form for dtype classes

1b2c773

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/core/metadata/dtype.py Outdated Show resolved Hide resolved

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/core/metadata/dtype.py Outdated Show resolved Hide resolved

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/core/metadata/dtype.py Outdated Show resolved Hide resolved

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/codecs/sharding.py Outdated Show resolved Hide resolved

d-v-b added 3 commits March 2, 2025 21:55

remove unused code

24930b3

use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…

703e0e1

…base

push into v2

3c232a4

remove endianness kwarg to methods, make it an instance variable instead

b7fe986

d-v-b mentioned this pull request Mar 4, 2025

support for datetime and timedelta dtypes (#2616) #2884

Draft

6 tasks

d-v-b added 4 commits March 4, 2025 18:10

make wrapping safe by default

d9b44b4

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

bf24d69

…at/fixed-length-strings

dtype-specific tests

c1a8566

more tests, fix void type default value logic

2868994

d-v-b mentioned this pull request Mar 5, 2025

Fix fill_value serialization issues #2802

Merged

6 tasks

fix dtype mechanics in bytescodec

9ab0b1e

sjperkins mentioned this pull request Jun 12, 2025

Zarr Python v3 and Zarr v3 casangi/xradio#355

Open

d-v-b added 7 commits June 13, 2025 18:53

refactor wrapper to allow subclasses to freely define their own type …

b069d36

…guards for native dtype and json input

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

ae36dbf

…into feat/fixed-length-strings

Merge branch 'feat/fixed-length-strings' of https://github.com/d-v-b/…

a1f2c94

…zarr-python into feat/fixed-length-strings

make method definition order consistent

b2e56c8

allow structured scalars to be np.void

d26b695

use a common function signature for from_json by packing the object_c…

49f0062

…odec_id in a typeddict for zarr v2 metadata

fix dtype doc example

70da4da

Merge branch 'main' into feat/fixed-length-strings

16b4ac6

d-v-b merged commit 6798466 into zarr-developers:main Jun 16, 2025
30 checks passed

norlandrhagen mentioned this pull request Jun 16, 2025

Refactor codebase to support a new simplified Parser->ManifestStore model. zarr-developers/VirtualiZarr#601

Merged

7 tasks

ilan-gold mentioned this pull request Jun 17, 2025

(feat): new zarr dtypes scverse/anndata#1995

Draft

3 tasks

TomNicholas mentioned this pull request Jun 17, 2025

Zarr data types refactor compatibility zarr-developers/VirtualiZarr#618

Open

7 tasks

ngoldbaum mentioned this pull request Jun 18, 2025

ENH: Add a public API for generating hashable buffers numpy/numpy#29229

Open

ilan-gold mentioned this pull request Jun 23, 2025

(chore): handle new zarr dtype API zarrs/zarrs-python#100

Draft

d-v-b mentioned this pull request Jun 26, 2025

serializing not a time (NaT) to metadata #3028

Closed

Uh oh!

refactor v3 data types #2874

refactor v3 data types #2874

Uh oh!

Conversation

d-v-b commented Feb 28, 2025

Uh oh!

d-v-b commented Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nenb commented Mar 3, 2025

Uh oh!

d-v-b commented Mar 4, 2025

Uh oh!

d-v-b commented Jun 11, 2025

Uh oh!

ngoldbaum commented Jun 13, 2025

Uh oh!

d-v-b commented Jun 15, 2025

Uh oh!

d-v-b commented Jun 15, 2025

Uh oh!

Uh oh!

TomAugspurger commented Jun 16, 2025

Uh oh!

jhamman commented Jun 16, 2025

Uh oh!

ilan-gold commented Jun 16, 2025

Uh oh!

d-v-b commented Jun 16, 2025

Uh oh!

ilan-gold commented Jun 16, 2025

Uh oh!

d-v-b commented Jun 16, 2025

Uh oh!

dstansby commented Jun 17, 2025

Uh oh!

d-v-b commented Jun 17, 2025

Uh oh!

dstansby commented Jun 17, 2025

Uh oh!

ngoldbaum commented Jun 25, 2025

Uh oh!

Uh oh!