-
-
Notifications
You must be signed in to change notification settings - Fork 334
refactor v3 data types #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor v3 data types #2874
Conversation
…into feat/fixed-length-strings
copying a comment @nenb made in this zulip discussion:
A feature of the character code is that it provides a way to distinguish parametric types like |
Summarising from a zulip discussion: @nenb: How is the endianness of a dtype handled? @d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype. Proposed solution: Make |
Thanks for the summary! I have implemented the proposed solution. |
thanks @rabernat! I have a final cleanup in the works where I simplify the |
Hi! I'm a NumPy maintainer who has worked a bit on the new user DType system we released in NumPy 2.0. @nenb let me know that this PR is a thing and it's getting merged soon. One of the main things we (the NumPy developers) skipped before releasing user DTypes was adding a way to serialize user DTypes to I wonder if instead of working on improving the This is all just me offering my personal opinion, not an opinion of anyone else involved with NumPy. Nick told me he's going on vacation for a while but he's interested in chatting more. I'm just commenting publicly because it's exciting to me 😃. |
…guards for native dtype and json input
…into feat/fixed-length-strings
…zarr-python into feat/fixed-length-strings
…odec_id in a typeddict for zarr v2 metadata
Thanks for the interest @ngoldbaum! I think it would be super interesting to evaluate Zarr as a "standard" serialization scheme for NumPy arrays.
This is something Zarr has to address for every dtype, but we operate under some constraints that make things simpler for us: all dtypes have to serialize to JSON, and as of Zarr V3 the structure of the JSON form of a dtype is well-defined (either a string or a dict with restricted keys). But I imagine NumPy might reasonably not want to commit to JSON, which could complicate the serialization story a bit. |
I did a final pass of type safety stuff, and I made a substantial change to how zarr v2 data types are serialized to and from JSON. As a reminder, Zarr python 2 supported multiple distinct array data types with the same I have formalized this by making Unfortunately, that piece of data doesn't look exactly like what you would see in This is an internal change that doesn't affect the basic design, so I don't think we need new reviews. Unless any new objections surface, I will merge this sometime tomorrow, and we can get started with the remaining tasks (namely, docs). |
Nice work @d-v-b! |
Huge props to @d-v-b for pulling this across the finish line. This was a huge and highly-impactful lift. 🙌 🙌 🙌 🙌 For those that are following this thread, expect this to come out in Zarr 3.1. @ngoldbaum - thanks for all your work on NumPy. We'd love open the conversation about supporting writing to Zarr directly from NumPy. Can you advise on where the best place to do that is? |
@d-v-b Not sure if it's a bug or not, but here is something interesting. Run this using, say, zarr 3.0.8: import zarr, numcodecs
z = zarr.open("foo.zarr", zarr_format=2)
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()], fill_value="")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=object) Then after this PR: import zarr, numcodecs
z = zarr.open("foo.zarr")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=StringDType()) i.e., the dtype is now a string dtype. For us it was a simple fix, but not sure if it is (a) intentional or (b) indicative of some other issue. |
Numpy 2.0 introduced a new data type specifically for variable length strings. that's the |
Agreed. The only reason it came up was that we were implicitly relying on the nullability of the object dtype in strings. So we'll transition now away from that, I think |
I think the numpy string dtype has nullability semantics, that might be worth investigating |
Was this intentionally merged into main, and not the 3.1.0 branch? |
Yes, but we can revisit that decision. I was under the impression that we were going to merge the 3.1 prs directly into main, but if we want to do a 3.0.9 release first, then we can revert this merge. |
👍 - I'll delete the 3.1.0 branch then, and create a 3.0.x branch from before when this was merged. |
Sorry for the delay on this. We can schedule a call, maybe? We have a biweekly NumPy community call you could drop into or we come to as well but unfortunately I need to miss the first half hour of that. See the NumPy community calendar for meeting links. There's also a NumPy developer slack I can send you an invite to. |
As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.
In
main
,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.