Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic chunk padding #2755

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

brokkoli71
Copy link
Member

@brokkoli71 brokkoli71 commented Jan 23, 2025

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@brokkoli71 brokkoli71 marked this pull request as ready for review January 23, 2025 15:04
Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of suggestions (in particular, explaining in a bit more detail the user facing bug this fixes), otherwise looks good!

changes/2755.bugfix.rst Outdated Show resolved Hide resolved
tests/test_store/test_memory.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this as request changes, since it looks like it broke some tests - I had a look at the tests, and failure seems real (requesting a fill value of "", and getting 0 instead)

@normanrz normanrz enabled auto-merge (squash) January 24, 2025 13:44
@normanrz
Copy link
Member

Marking this as request changes, since it looks like it broke some tests - I had a look at the tests, and failure seems real (requesting a fill value of "", and getting 0 instead)

Fixed it.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this now means zarr.emtpy deafults to filling data with zeros, instead of undefined data, this is a behaviour change from zarr-python 2: https://zarr.readthedocs.io/en/v2.18.4/api/creation.html#zarr.creation.empty - I think this is fine, but we should document that change in the changelog.

The docstring of zarr.creation.empty is now also incorrect, and needs updating:

The contents of an empty Zarr array are not defined. On attempting to retrieve data from an empty Zarr array, any values may be returned, and these are not guaranteed to be stable from one access to the next.

I also had a suggested improvement to the changelog above.

Co-authored-by: David Stansby <[email protected]>
auto-merge was automatically disabled January 29, 2025 10:38

Head branch was pushed to by a user without write access

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 29, 2025
@brokkoli71 brokkoli71 requested a review from dstansby January 29, 2025 11:31
Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the docstrings!

I had another read and think about this, and this is quite a major change happening - the data being returned/created by a function is changing. I think in this case it's okay, because it's only filled data that the user hasn't specified, and it fixes a bad bug.

It also seems like this change makes empty() do the same thing as zeros()? (create an array filled with zeros) Is that correct, and do these two functions do the same thing now? If that's the case, should we deprecate and eventually remove zarr.empty()? That can happen in a follow up PR to keep this tightly scoped to the bug fix, but it's a consquence of this PR we should make sure we're happy with.

Because of the above issues I'm not 100% on just me reviewing this - I'd like someone else in @zarr-developers/python-core-devs to review and approve this (and explicitly say that they're okay with the data changing/potentially making emtpy() redundant).


Notes
-----
The contents of an empty Zarr array are not defined. On attempting to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a random thought, but this sentence could likely stay in the sense that not defined can also mean "with fill_value OR 0" if we are concerned that this isn't a constraint we want to impose in the future.

@normanrz
Copy link
Member

Both zarr.empty and zarr.zeros should be deprecated in favor of zarr.create_array.

The old docs might have said that zarr.empty returns unitialized data, but I don't think that was the case in the code. Here, entire are filled with fill_value even if outside the to-be-set selection: https://github.com/zarr-developers/zarr-python/pull/2755/files#diff-44efa3ae220ba9737fa8c5443b3c3af09a7c0f9549d5c267404f7fa3ce467318

@d-v-b
Copy link
Contributor

d-v-b commented Jan 30, 2025

i had a quick look through the v2 code and it looks like v2 was padding with 0s to ensure consistency and compressibility:

zarr-python/zarr/core.py

Lines 2309 to 2313 in 66e2982

# N.B., use zeros here so any region beyond the array has consistent
# and compressible data
chunk = np.zeros_like(
self._meta_array, shape=self._chunks, dtype=self._dtype, order=self._order
)

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chunk padding is non-deterministic with zarr_format=2
6 participants