Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using zarr v3 in a multiprocessing context fails with JSONDecodeError #2729

Open
MariusMeyerDraeger opened this issue Jan 18, 2025 · 6 comments
Labels
bug Potential issues with the zarr-python library

Comments

@MariusMeyerDraeger
Copy link

Zarr version

3.0.1

Numcodecs version

0.15.0

Python Version

3.12.2

Operating System

Windows 11 22H2

Installation

using pip into virtual environment

Description

Hi,

I discovered zarr a few days ago, just after v3 was published and I'm trying to use it in a multiprocessing context where one process writes numeric as well as variable length string data into a persistent file from which a reader process reads the newly arrived data.
The aim is to exchange data as well as store it persistently at the same time.
I tried to build a minimal working example (See steps to reproduce) but more often than not reading from the zarr files fails with the following exception:

Traceback (most recent call last):
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\multiprocessing\process.py", line 314, in _bootstrap
    self.run()
  File "...\scratch_3.py", line 26, in run
    text_dset = root['text_data']
                ~~~~^^^^^^^^^^^^^
  File "...\site-packages\zarr\core\group.py", line 1783, in __getitem__
    obj = self._sync(self._async_group.getitem(path))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\site-packages\zarr\core\sync.py", line 187, in _sync
    return sync(
           ^^^^^
  File "...\site-packages\zarr\core\sync.py", line 142, in sync
    raise return_result
  File "...\site-packages\zarr\core\sync.py", line 98, in _runner
    return await coro
           ^^^^^^^^^^
  File "...\site-packages\zarr\core\group.py", line 681, in getitem
    zarr_json = json.loads(zarr_json_bytes.to_bytes())
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Is this a bug in v3, is it not ready yet for multiprocessing or am I making as mistake here?
Sadly, the v3 docs don't really describe how to use zarr in a multiprocessing context so it might be possible I'm missing something.

Steps to reproduce

import sys
import time

import zarr
import numpy as np
import logging
from multiprocessing import Process, Event

class ZarrReader(Process):
    def __init__(self, event, fname, dsetname, timeout = 2.0):
        super().__init__()
        self._event = event
        self._fname = fname
        self._dsetname = dsetname
        self._timeout = timeout

    def run(self):
        self.log = logging.getLogger('reader')
        print("Reader: Waiting for initial event")
        assert self._event.wait( self._timeout )
        self._event.clear()

        print(f"Reader: Opening file {self._fname}")
        root = zarr.open_group(self._fname, mode='r')
        dset = root[self._dsetname]
        text_dset = root['text_data']
        # monitor and read loop
        while self._event.wait( self._timeout ):
            self._event.clear()
            print("Reader: Event received")
            dset = root[self._dsetname]
            text_dset = root['text_data']
            shape = dset.shape
            print("Reader: Read dset shape: %s"%str(shape))
            print(f"Reader: Text dataset shape: {text_dset.shape}")
            for i in range(text_dset.shape[0]):
                print(text_dset[i])

class ZarrWriter(Process):
    def __init__(self, event, fname, dsetname):
        super().__init__()
        self._event = event
        self._fname = fname
        self._dsetname = dsetname

    def run(self):
        self.log = logging.getLogger('writer')
        self.log.info("Creating file %s", self._fname)
        root = zarr.group(self._fname, overwrite=True)
        arr = np.array([1,2,3,4])
        dset = root.create_array(self._dsetname, shape=(4,), chunks=(2,), dtype=np.float64, fill_value=np.nan)
        dset[:] = arr
        text_dset = root.create_array('text_data', shape=(1,), chunks=(3,), dtype=str)
        text_arr = np.array(["Sample text 0"])
        text_dset[:] = text_arr

        print("Writer: Sending initial event")
        self._event.set()
        print("Writer: Waiting for the reader-opened-file event")
        # time.sleep(1.0)
        # Write loop
        for i in range(1, 6):
            new_shape = (i * len(arr), )
            print("Writer: Resizing dset shape: %s"%str(new_shape))
            dset.resize( new_shape )
            print("Writer: Writing data")
            dset[i*len(arr):] = arr
            text_dset.resize((text_dset.shape[0] + 1,))
            new_text_arr = np.array([f"Sample text {i}" * i])
            text_dset[-1:] = new_text_arr
            #dset.write_direct( arr, np.s_[:], np.s_[i*len(arr):] )
            print("Writer: Sending event")
            self._event.set()


if __name__ == "__main__":
    logging.basicConfig(format='%(levelname)10s  %(asctime)s  %(name)10s  %(message)s',level=logging.INFO)
    fname = 'measurements.zarr'
    dsetname = 'data'
    if len(sys.argv) > 1:
        fname = sys.argv[1]
    if len(sys.argv) > 2:
        dsetname = sys.argv[2]

    event = Event()
    reader = ZarrReader(event, fname, dsetname)
    writer = ZarrWriter(event, fname, dsetname)

    logging.info("Starting reader")
    reader.start()
    logging.info("Starting writer")
    writer.start()

    logging.info("Waiting for writer to finish")
    writer.join()
    logging.info("Waiting for reader to finish")
    reader.join()

Additional output

No response

@MariusMeyerDraeger MariusMeyerDraeger added the bug Potential issues with the zarr-python library label Jan 18, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Jan 19, 2025

We haven't explicitly tested zarr-python 3 with multiprocessing, but I don't see any reason why there should be any particular problems, because at least with the LocalStore zarr-python doesn't rely on holding any file handles open.

That being said, I don't really understand the architecture of your program. From the error, it looks like the reader is trying to access a zarr.json document that is empty. Since the run method on your ZarrWriter class opens root with overwrite=True, but the run method on your ZarrReader class opens the same group with mode = r, it's possible that you have a race condition here. You may need to poll the state of the zarr.json document before trying to open it.

To avoid these kinds of issues, I would create your zarr hierarchy in synchronous code as much as possible (because writing some JSON documents doesn't benefit from multiprocessing anyways).

@MariusMeyerDraeger
Copy link
Author

Hi,
as you can see the synchronisation of the two processes is done via an event.
So the order is:

  1. ZarrReader is started first and waits for the initial event from the Writer class before opening anything.
  2. ZarrWriter creates the zarr hierarchy with root = zarr.group(self._fname, overwrite=True)
  3. ZarrWriter set the initial event so ZarrReader opens the zarr directory with root = zarr.open_group(self._fname, mode='r').

My expectation is that the zarr directories and files are created and ready to be opened by the reader class, when the writer created them and set the event. Am I mistaken here?

Furthermore Im not writing or using any json files. These are the zarr internal metadata files that can not be read.
I have the impression you might be thinking that Im trying to write or read json files. which is not the case.
The error comes from zarr itself, as can be seen by the stacktrace.
So either zarr v3 is not multiprocessing compatible or I am making a mistake in this example program.
But again I don't see what Im doing wrong as the zarr files should be created and ready to be read when the writer creates them.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

Furthermore Im not writing or using any json files. These are the zarr internal metadata files that can not be read.

I am talking about json because zarr arrays and group use json for the metadata documents. Each time you create a zarr array or group, you are writing a json document to storage; each time you open a zarr array / group, you are reading a json document from storage. As your error message ended with JSONDecodeError, I think json is relevant here:

File "C:\tools\Python\3.12\3.12.2-win64\Lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I can trigger the same error by attempting to decode an empty bytestring as json:

>>> json.loads(b'')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/bennettd/.pyenv/versions/3.11.9/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bennettd/.pyenv/versions/3.11.9/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bennettd/.pyenv/versions/3.11.9/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

so the question is, how is a zarr metadata document (which should be valid json) getting read from disk as an empty bytestring. I don't really know.

So either zarr v3 is not multiprocessing compatible or I am making a mistake in this example program.

I'm not sure what you mean by this statement -- are you referring to zarr the format or this python library? the file format itself provides relatively limited concurrency guarantees -- in short, it is the responsibility of users / applications to structure their application to prevent race conditions / data corruption. zarr-python 2.x supported synchronization via file-based locking, which might be helpful for your use case, but we have not added these features to zarr-python 3 yet.

for your specific example, could you refactor it so that you create all the arrays and groups first, and then you run your separate writer / reader processes, and they write / read to the already-created arrays? this might make the logic easier to follow.

@MariusMeyerDraeger
Copy link
Author

MariusMeyerDraeger commented Jan 23, 2025

I am talking about json because zarr arrays and group use json for the metadata documents.

okay great! Just wanted to make sure we're talking about the same thing.

for your specific example, could you refactor it so that you create all the arrays and groups first, and then you run your separate writer / reader processes, and they write / read to the already-created arrays?

I tried this, but after a few iterations the same error appears:

import os
import sys
import time

import zarr
import numpy as np
import logging
from multiprocessing import Process, Event

class ZarrReader(Process):
    def __init__(self, event, fname, dsetname, timeout = 2.0):
        super().__init__()
        self._event = event
        self._fname = fname
        self._dsetname = dsetname
        self._timeout = timeout

    def run(self):
        self.log = logging.getLogger('reader')
        print("Reader: Waiting for initial event")
        assert self._event.wait( self._timeout )
        self._event.clear()

        print(f"Reader: Opening file {self._fname}")
        root = zarr.open_group(self._fname, mode='r')
        dset = root[self._dsetname]
        text_dset = root['text_data']
        # monitor and read loop
        while self._event.wait( self._timeout ):
            self._event.clear()
            print("Reader: Event received")
            dset = root[self._dsetname]
            text_dset = root['text_data']
            shape = dset.shape
            print("Reader: Read dset shape: %s"%str(shape))
            print(f"Reader: Text dataset shape: {text_dset.shape}")
            for i in range(text_dset.shape[0]):
                print(text_dset[i])

class ZarrWriter(Process):
    def __init__(self, event, fname, dsetname):
        super().__init__()
        self._event = event
        self._fname = fname
        self._dsetname = dsetname

    def run(self):
        self.log = logging.getLogger('writer')
        print(f"Writer: Creating file {self._fname}")
        root = zarr.open_group(self._fname, mode='r+')
        arr = np.array([1,2,3,4])
        dset = root.create_array(self._dsetname, shape=(4,), chunks=(2,), dtype=np.float64, fill_value=np.nan)
        dset[:] = arr
        text_dset = root.create_array('text_data', shape=(1,), chunks=(3,), dtype=str)
        text_arr = np.array(["Sample text 0"])
        text_dset[:] = text_arr

        print("Writer: Sending initial event")
        self._event.set()
        print("Writer: Waiting for the reader-opened-file event")
        # Write loop
        for i in range(1, 6):
            new_shape = (i * len(arr), )
            print("Writer: Resizing dset shape: %s"%str(new_shape))
            dset.resize( new_shape )
            print("Writer: Writing data")
            dset[i*len(arr):] = arr
            text_dset.resize((text_dset.shape[0] + 1,))
            new_text_arr = np.array([f"Sample text {i}" * i])
            text_dset[-1:] = new_text_arr
            #dset.write_direct( arr, np.s_[:], np.s_[i*len(arr):] )
            print("Writer: Sending event")
            self._event.set()


if __name__ == "__main__":
    logging.basicConfig(format='%(levelname)10s  %(asctime)s  %(name)10s  %(message)s',level=logging.INFO)
    fname = 'measurements.zarr'
    dsetname = 'data'
    if len(sys.argv) > 1:
        fname = sys.argv[1]
    if len(sys.argv) > 2:
        dsetname = sys.argv[2]

    if os.path.exists(fname):
        import shutil
        shutil.rmtree(fname)
    root = zarr.group(fname)
    event = Event()
    reader = ZarrReader(event, fname, dsetname)
    writer = ZarrWriter(event, fname, dsetname)

    logging.info("Starting reader")
    reader.start()
    logging.info("Starting writer")
    writer.start()

    logging.info("Waiting for writer to finish")
    writer.join()
    logging.info("Waiting for reader to finish")
    reader.join()
      INFO  2025-01-23 20:34:44,506        root  Starting reader
      INFO  2025-01-23 20:34:44,530        root  Starting writer
      INFO  2025-01-23 20:34:44,551        root  Waiting for writer to finish
Reader: Waiting for initial event
Writer: Creating file measurements.zarr
\site-packages\zarr\codecs\vlen_utf8.py:44: UserWarning: The codec `vlen-utf8` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  return cls(**configuration_parsed)
Writer: Sending initial event
Writer: Waiting for the reader-opened-file event
Writer: Resizing dset shape: (4,)
Reader: Opening file measurements.zarr
Writer: Writing data
Writer: Sending event
Writer: Resizing dset shape: (8,)
Writer: Writing data
Reader: Event received
Writer: Sending event
Writer: Resizing dset shape: (12,)
Writer: Writing data
Reader: Read dset shape: (12,)
Reader: Text dataset shape: (4,)
Writer: Sending event
Writer: Resizing dset shape: (16,)
Writer: Writing data
\site-packages\zarr\codecs\vlen_utf8.py:44: UserWarning: The codec `vlen-utf8` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  return cls(**configuration_parsed)
\site-packages\zarr\codecs\vlen_utf8.py:44: UserWarning: The codec `vlen-utf8` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  return cls(**configuration_parsed)
Sample text 0
Sample text 1
Sample text 2Sample text 2
Sample text 3Sample text 3Sample text 3
Reader: Event received
Writer: Sending event
Writer: Resizing dset shape: (20,)
Writer: Writing data
Process ZarrReader-1:
Traceback (most recent call last):
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\multiprocessing\process.py", line 314, in _bootstrap
    self.run()
  File "\JetBrains\PyCharmCE2024.3\scratches\scratch_3.py", line 32, in run
    dset = root[self._dsetname]
           ~~~~^^^^^^^^^^^^^^^^
  File "\site-packages\zarr\core\group.py", line 1783, in __getitem__
    obj = self._sync(self._async_group.getitem(path))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\zarr\core\sync.py", line 187, in _sync
    return sync(
           ^^^^^
  File "\site-packages\zarr\core\sync.py", line 142, in sync
    raise return_result
  File "\site-packages\zarr\core\sync.py", line 98, in _runner
    return await coro
           ^^^^^^^^^^
  File "\site-packages\zarr\core\group.py", line 681, in getitem
    zarr_json = json.loads(zarr_json_bytes.to_bytes())
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Writer: Sending event
      INFO  2025-01-23 20:34:44,987        root  Waiting for reader to finish

Process finished with exit code 0

I'm not sure what you mean by this statement -- are you referring to zarr the format or this python library?

Im actually not sure in which direction I should direct this report. I guess this library foremost.
I read the claims at the top of the Readme:

  • Read an array concurrently from multiple threads or processes.
  • Write to an array concurrently from multiple threads or processes.

but there isn't any tutorial or guidance on how to implement that with v3.
So i tried the minimal example to get it working but I get the mentioned JSONDecodeErrors.
I might be mistaken as a new evaluator of this library and im thankful for you taking the time to reply, but this makes me think that either the library or the new format doesn't support what is claimed in the Readme.
Especially if multiprocessing wasn't event tested.
Or am I misunderstanding that and only concurent reading from multiple processes without another process writing or vice versa is possible?

Can you get my example code to work or have an example which shows how to correctly use zarr-v3 with a writer and reader process?

@d-v-b
Copy link
Contributor

d-v-b commented Jan 25, 2025

You are resizing the arrays in your writer process. This will require writing new array metadata to disk; if the reader read the array metadata before the resize, then the reader will have an invalid copy of that metadata. That being said, I'm not sure what exactly is causing the error you see. For what it's worth, I could not replicate this on my machine, and I saw a different error, but the errors went away when I removed the resizing operation from the writer process.

If your application requires resizing the array from a writer process, while reading from separate processes, then I would recommend synchronizing around the resizing operation -- as soon as the array is resized, any reader with a reference to that array has invalid metadata that should be discarded. This should be feasible, but it would be much simpler to not resize the array while attempting to read from it.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 25, 2025

Im actually not sure in which direction I should direct this report. I guess this library foremost. I read the claims at the top of the Readme:

  • Read an array concurrently from multiple threads or processes.
  • Write to an array concurrently from multiple threads or processes.

to this point, we should be more clear in the docs about what degree of concurrency is supported.

  • Concurrently writing separate chunks of an array is supported, and indeed is the basic value proposition of zarr.
  • Concurrently writing to the same chunk is not supported.
  • Reading from a chunk while it is being modified is not supported.
  • Concurrently modifying array or group metadata is not supported, because the array or group metadata is just a JSON document and those can't in general be concurrently modified.

So basically, if you are trying to write to different chunks of the same array from multiple processes, that's going to work. If you try to modify array metadata, or modify the chunks of an array, while it is being read from another process, then this is not likely to work without some explicit synchronization between processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

2 participants