-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bitinfo codec #503
base: main
Are you sure you want to change the base?
Add bitinfo codec #503
Conversation
numcodecs/bitround.py
Outdated
The number of bits of the mantissa to keep. The range allowed | ||
depends on the dtype input data. If keepbits is | ||
equal to the maximum allowed for the data type, this is equivalent | ||
to no transform. | ||
to no transform. Alternatively, pass a function to determine the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an example of where such a bitinformation function may be found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namely, xbitinfo, which was the impetus for bitround
codec. More recently, that group has advocated for computing bitinformation chunk by chunk, using something like
fn = 'air.zarr'
ds.to_compressed_zar(fn, compute=False, mode='w')
dims = ds.air.dims
len_dims = len(dims)
slices = slices_from_chunks(ds.air.chunks)
for b, block in enumerate(ds.air.data.to_delayed().ravel()):
ds_block = xr.Dataset({'air':(dims, block.compute())})
rounded_ds = bitrounding(ds_block)
rounded_ds.to_zarr(fn, region={dims[d]:s for (d,s) in enumerate(slices[b])})
Modifying the codec could make chunk-wise bitrounding workflows much simpler, and my proposed change would be the minimal first step. The passed function only applies to the encoder, so it shouldn't introduce dependencies or vulnerabilities for the decoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was suggesting information like this should go into the docstring, so that users can act on it
Thanks, @martindurant, I'll work on example, but as I dig into this, I realize that we can't pass a function in this way: https://github.com/zarr-developers/zarr-python/blob/main/zarr/util.py#L56-64 TypeError: Object of type function is not JSON serializable I'd hoped this would work without modifying |
Ah, the default serialisation of the codecs, as dumped into the zarr metadata, looks at the |
Those are both great suggestions. I'll advance one more: add a stripped down xbitinfo to the |
A heads up that depending on custom code will impact the portability of the codec across language implementations. |
observingClouds/xbitinfo/issues/257: their preference is my original suggestion. Potentially yielding something like: from xbitinfo import helper_function
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(helper_function)]
encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding) |
@martindurant, I tried your suggestion about prefixing, but I'm getting the general impression this approach won't work. from xbitinfo import helper_function
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(helper_function)]
encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding) the object created by If that impression is correct, then reimplementing a basic bitinformation algorithm in |
Ah, sorry |
13de912
to
e39a161
Compare
Hello @thodson-usgs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-04-30 21:18:13 UTC |
Rather than including this in |
87cd098
to
9b813bf
Compare
@observingClouds, @martindurant |
9fe6e66
to
36c9566
Compare
@martindurant, sorry to bother. Would you run these checks, or have I missed something? Thanks |
I have no idea why the checks aren't running :| |
Maybe it's waiting for me to tick all the boxes? Seems a bit of a catch 22, but I'll give it a try. |
36c9566
to
ac43b43
Compare
Let me know if this fixed it git commit --amend --no-edit
git push --force-with-lease otherwise, I'll open a new PR |
Windows and OSX builds failing with:
I'm not sure what happened here, but I'll try to replicate this locally. |
The Windows build failed because I had a test using |
Indeed, just osx left. |
a2faeb5
to
ff2ae77
Compare
Added a usage example and rebased. I have still not resolved why the OSX 3.10 build failed. Hopefully later this week. |
oh darn, my doc example gets tests too! 😞 |
Well, this passed just fine in a clean, local environment. But the GitHub runner is erroring with
Unless there is an obvious fix, I will keep the usage example but remove it from the doctest, then run some test builds on OSX, then resubmit. |
I suggest you remove the example, perhaps marking it as doctest skip. |
(investigating local OSX build now) |
@martindurant, the problem with OSX seems to be a broader issue. |
a21144c
to
ba38507
Compare
rebased on main |
The xbitinfo implementation uses a tolerance factor of 1.5. I lowered the tolerance to 1.1, because I was getting poor results with my test data, but it was pointed out that the problem was that my test data were quantized. Quantization is an open issue with xbitinfo too, and we should address it first there, then patch the codec.
ba38507
to
50dcea9
Compare
Add a new
bitinfo
codec, which reimplements anumpy
-based version of the bitrounding algorithm from the Julia packageBitInformation.jl
(and the Python packagexbitinfo
). When used withds.to_zarr
, the codec would compute the real information chunk-wise, which yields better results for data with variable information content.TODO: