Add bitinfo codec #503

thodson-usgs · 2024-01-28T16:57:52Z

Add a new bitinfo codec, which reimplements a numpy-based version of the bitrounding algorithm from the Julia package BitInformation.jl (and the Python package xbitinfo). When used with ds.to_zarr, the codec would compute the real information chunk-wise, which yields better results for data with variable information content.

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")

from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]

encoding = {"air": {"compressor": compressor, "filters": filters}}

ds.to_zarr('xbit.zarr', mode="w", encoding=encoding)

TODO:

Unit tests and/or doctests in docstrings
Tests pass locally
Docstrings and API docs for any new/modified user-facing classes and functions
Changes documented in docs/release.rst
Docs build locally
GitHub Actions CI passes
Test coverage to 100% (Codecov passes)

martindurant · 2024-01-29T15:31:40Z

numcodecs/bitround.py

        The number of bits of the mantissa to keep. The range allowed
        depends on the dtype input data. If keepbits is
        equal to the maximum allowed for the data type, this is equivalent
-        to no transform.
+        to no transform. Alternatively, pass a function to determine the


Can we have an example of where such a bitinformation function may be found?

Namely, xbitinfo, which was the impetus for bitround codec. More recently, that group has advocated for computing bitinformation chunk by chunk, using something like

fn = 'air.zarr' ds.to_compressed_zar(fn, compute=False, mode='w') dims = ds.air.dims len_dims = len(dims) slices = slices_from_chunks(ds.air.chunks) for b, block in enumerate(ds.air.data.to_delayed().ravel()): ds_block = xr.Dataset({'air':(dims, block.compute())}) rounded_ds = bitrounding(ds_block) rounded_ds.to_zarr(fn, region={dims[d]:s for (d,s) in enumerate(slices[b])})

Modifying the codec could make chunk-wise bitrounding workflows much simpler, and my proposed change would be the minimal first step. The passed function only applies to the encoder, so it shouldn't introduce dependencies or vulnerabilities for the decoder.

I was suggesting information like this should go into the docstring, so that users can act on it

thodson-usgs · 2024-01-30T04:54:41Z

Thanks, @martindurant, I'll work on example, but as I dig into this, I realize that we can't pass a function in this way: https://github.com/zarr-developers/zarr-python/blob/main/zarr/util.py#L56-64
will lead to

TypeError: Object of type function is not JSON serializable

I'd hoped this would work without modifying zarr-python, but for now, I'll go ahead and make a small modification and continue testing.

martindurant · 2024-01-30T14:24:22Z

Ah, the default serialisation of the codecs, as dumped into the zarr metadata, looks at the __dict__ of the instance, and so is catching the function. Prefixing the attribute with "_" would solve this, or you could override get_config. There is actually a good argument for wanting to serialise the full state, since you can create an array without filling it or fill only part and come back to it later. I don't know how you would approach that without storing some indicator of the function in the metadata.

thodson-usgs · 2024-01-30T16:30:19Z

Those are both great suggestions. I'll advance one more: add a stripped down xbitinfo to the bitround codec. The implementation looks fairly minimal. Let's see what comes from observingClouds/xbitinfo/issues/257 then reassess.

joshmoore · 2024-01-30T18:37:54Z

A heads up that depending on custom code will impact the portability of the codec across language implementations.

thodson-usgs · 2024-01-30T22:33:40Z

observingClouds/xbitinfo/issues/257: their preference is my original suggestion. Potentially yielding something like:

from xbitinfo import helper_function
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(helper_function)]

encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)

thodson-usgs · 2024-01-31T20:37:51Z

@martindurant, I tried your suggestion about prefixing, but I'm getting the general impression this approach won't work.
In this example

from xbitinfo import helper_function
from numcodecs import Blosc, BitRound
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitRound(helper_function)]

encoding = {"precip": {"compressor": compressor, "filters": filters}}
ds.to_zarr(<file_name>, encoding=encoding)

the object created by BitRound(helper_function) isn't actually passed through to the filtering step, only the dictionary. The filter is recreated from that dictionary before it is applied. In other words, if I prefix with _, that object won't get passed through the dictionary, and won't be accessible at filter time.

If that impression is correct, then reimplementing a basic bitinformation algorithm in numcodecs might be the simplest route.

martindurant · 2024-01-31T20:39:52Z

if I prefix with _, that object won't get passed through the dictionary, and won't be accessible at filter time.

Ah, sorry

pep8speaks · 2024-02-02T17:23:32Z

Hello @thodson-usgs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file numcodecs/__init__.py:

Line 94:1: E402 module level import not at top of file

Comment last updated at 2024-04-30 21:18:13 UTC

thodson-usgs · 2024-02-02T17:32:51Z

Rather than including this in bitround, lets create a bitinfo codec that calls bitround. I've drafted the PR and solicited feedback from the xbitinfo devs. I will begin writing tests.

numcodecs/bitinfo.py

thodson-usgs · 2024-03-01T19:36:31Z

@observingClouds, @martindurant
As my last commit indicates, I've reverted the codec to match the functionality of the xbitinfo implementation (albeit all computation will be done chunk-wise when called through ds.to_zarr()) I'm happy to continue working across libraries to improve functionality and reduce duplication; however, further development needs to come from xbitinfo, not the codec. In the meantime, I do not plan any further changes to this PR except any additional tests or bug fixes as needed.

thodson-usgs · 2024-04-19T14:47:32Z

@martindurant, sorry to bother. Would you run these checks, or have I missed something? Thanks

martindurant · 2024-04-19T20:08:23Z

I have no idea why the checks aren't running :|

thodson-usgs · 2024-04-19T20:14:47Z

Maybe it's waiting for me to tick all the boxes? Seems a bit of a catch 22, but I'll give it a try.

thodson-usgs · 2024-04-20T22:15:37Z

Let me know if this fixed it

git commit --amend --no-edit
git push --force-with-lease

otherwise, I'll open a new PR

thodson-usgs · 2024-04-22T20:18:40Z

Windows and OSX builds failing with:

900+ lines of errors and warnings...

  × Building editable for numcodecs (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /usr/local/miniconda/envs/env/bin/python /usr/local/miniconda/envs/env/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_editable /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmp4l4qy_42
  cwd: /Users/runner/work/numcodecs/numcodecs
  Building editable for numcodecs (pyproject.toml): finished with status 'error'
  ERROR: Failed building editable for numcodecs
Failed to build numcodecs
ERROR: Could not build wheels for numcodecs, which is required to install pyproject.toml-based projects
Error: Process completed with exit code 1.

I'm not sure what happened here, but I'll try to replicate this locally.

thodson-usgs · 2024-04-23T03:14:15Z

The Windows build failed because I had a test using dtype=float128, but I don't understand why OSX is failing to install numcodecs with python>=3.10. Can we run these one more time before I go digging into OSX?

martindurant · 2024-04-23T14:50:07Z

Indeed, just osx left.

thodson-usgs · 2024-04-30T17:22:14Z

Added a usage example and rebased. I have still not resolved why the OSX 3.10 build failed. Hopefully later this week.

thodson-usgs · 2024-04-30T17:34:28Z

oh darn, my doc example gets tests too! 😞

thodson-usgs · 2024-05-02T18:44:35Z

Well, this passed just fine in a clean, local environment.

But the GitHub runner is erroring with

...
043     >>> import xarray as xr
044     >>> ds = xr.tutorial.open_dataset("air_temperature")
045     >>> from numcodecs import Blosc, BitInfo
046     >>> compressor = Blosc(cname="zstd", clevel=3)
047     >>> filters = [BitInfo(info_level=0.99)]
048     >>> encoding = {"air": {"compressor": compressor, "filters": filters}}
049     >>> _ = ds.to_zarr('xbit.zarr', mode="w", encoding=encoding)
UNEXPECTED EXCEPTION: ImportError("cannot import name 'MutableMapping' from 'collections' (/usr/share/miniconda/envs/env/lib/python3.11/collections/__init__.py)")
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/env/lib/python3.11/doctest.py", line 1355, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest numcodecs.bitinfo.BitInfo[6]>", line 1, in <module>
  File "/usr/share/miniconda/envs/env/lib/python3.11/site-packages/xarray/core/dataset.py", line 2520, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...

Unless there is an obvious fix, I will keep the usage example but remove it from the doctest, then run some test builds on OSX, then resubmit.

martindurant · 2024-05-02T19:16:14Z

I suggest you remove the example, perhaps marking it as doctest skip.

thodson-usgs · 2024-05-17T15:40:56Z

(investigating local OSX build now)

thodson-usgs · 2024-05-17T16:05:54Z

@martindurant, the problem with OSX seems to be a broader issue.
Note the previous commit to main is also failing (#527).
Perhaps we rerun the tests and proceed with any other issues, while we wait for an OSX fix.

thodson-usgs · 2024-06-03T16:33:22Z

rebased on main

The xbitinfo implementation uses a tolerance factor of 1.5. I lowered the tolerance to 1.1, because I was getting poor results with my test data, but it was pointed out that the problem was that my test data were quantized. Quantization is an open issue with xbitinfo too, and we should address it first there, then patch the codec.

martindurant reviewed Jan 29, 2024

View reviewed changes

thodson-usgs mentioned this pull request Jan 31, 2024

Adding the python bitinfo to numcodecs observingClouds/xbitinfo#257

Open

thodson-usgs force-pushed the adaptive-bitrounding branch from 13de912 to e39a161 Compare February 2, 2024 17:23

thodson-usgs changed the title ~~WIP: Enable adaptive bitrounding (seeking feedback)~~ WIP: Add bitinfo codec Feb 2, 2024

observingClouds reviewed Feb 5, 2024

View reviewed changes

numcodecs/bitinfo.py Outdated Show resolved Hide resolved

thodson-usgs changed the title ~~WIP: Add bitinfo codec~~ Add bitinfo codec Feb 5, 2024

thodson-usgs force-pushed the adaptive-bitrounding branch from 87cd098 to 9b813bf Compare February 5, 2024 18:23

thodson-usgs force-pushed the adaptive-bitrounding branch from 9fe6e66 to 36c9566 Compare March 5, 2024 23:09

thodson-usgs force-pushed the adaptive-bitrounding branch from 36c9566 to ac43b43 Compare April 20, 2024 22:13

thodson-usgs force-pushed the adaptive-bitrounding branch from a2faeb5 to ff2ae77 Compare April 30, 2024 17:21

thodson-usgs force-pushed the adaptive-bitrounding branch from a21144c to ba38507 Compare June 3, 2024 16:32

dstansby added the New codec Suggestion for a new codec label Aug 11, 2024

thodson-usgs mentioned this pull request Sep 9, 2024

simplify get_bitinformation observingClouds/xbitinfo#262

Merged

thodson-usgs and others added 15 commits September 9, 2024 09:20

Draft bitinfo codec

6d7d792

Fix PEP 8 issues

e35221b

Fixing bugs

a97f7e8

Bugfix; now it works

f884113

Fix Codec.__init__

c04a6e3

Adjust tolerance but still needs fix

03128f6

Add basic tests

11ef603

Linting

f2e2394

Remove float128 test

673d86b

Add usage example

157c57c

Add xarray to test_extra dependencies

45ab3c4

Add xarray dependencies for doctest

78f93e3

Remove usage example from doctest

385c407

Rebase and update release.rst

50dcea9

thodson-usgs force-pushed the adaptive-bitrounding branch from ba38507 to 50dcea9 Compare September 9, 2024 14:28

Add bitinfo codec #503

Are you sure you want to change the base?

Add bitinfo codec #503

Uh oh!

Conversation

thodson-usgs commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

thodson-usgs Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martindurant Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

thodson-usgs commented Jan 30, 2024

Uh oh!

martindurant commented Jan 30, 2024

Uh oh!

thodson-usgs commented Jan 30, 2024

Uh oh!

joshmoore commented Jan 30, 2024

Uh oh!

thodson-usgs commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thodson-usgs commented Jan 31, 2024

Uh oh!

martindurant commented Jan 31, 2024

Uh oh!

pep8speaks commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2024-04-30 21:18:13 UTC

Uh oh!

thodson-usgs commented Feb 2, 2024

Uh oh!

Uh oh!

thodson-usgs commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thodson-usgs commented Apr 19, 2024

Uh oh!

martindurant commented Apr 19, 2024

Uh oh!

thodson-usgs commented Apr 19, 2024

Uh oh!

thodson-usgs commented Apr 20, 2024

Uh oh!

thodson-usgs commented Apr 22, 2024

Uh oh!

thodson-usgs commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Apr 23, 2024

Uh oh!

thodson-usgs commented Apr 30, 2024

Uh oh!

thodson-usgs commented Apr 30, 2024

Uh oh!

thodson-usgs commented May 2, 2024

Uh oh!

martindurant commented May 2, 2024

Uh oh!

thodson-usgs commented May 17, 2024

Uh oh!

thodson-usgs commented May 17, 2024

Uh oh!

thodson-usgs commented Jun 3, 2024

Uh oh!

Uh oh!

thodson-usgs commented Jan 28, 2024 •

edited

Loading

thodson-usgs Jan 29, 2024 •

edited

Loading

thodson-usgs commented Jan 30, 2024 •

edited

Loading

pep8speaks commented Feb 2, 2024 •

edited

Loading

thodson-usgs commented Mar 1, 2024 •

edited

Loading

thodson-usgs commented Apr 23, 2024 •

edited

Loading