Can `zarr` work with PyTorch when `num_workers > 0`? #2542

adosar · 2024-12-07T17:53:22Z

adosar
Dec 7, 2024

Dear zarr-python team,

I am trying to build a PyTorch dataset for point clouds and I was wondering if zarr can natively play with PyTorch especially in the case where multiprocessing is used for data loading, i.e. in the case num_workers > 0. Since the datasets are generally large, I would like the storing method to have the following properties:

Lazy loading
Key-value access
Fast random access

Currently, I am storing the point clouds as arrays in a .npz, i.e. np.savez(file, key_1=pcd_1, key_2=pcd_2, ...). However, I am facing the problem as described here.

The proposed solution is to avoid loading the .npz file into the main process and load it individually for each worker. However, if the number of arrays stored in .npz are very large, it takes a significant amount of time to load it with np.load (I think the large loading time is due to metadata and not the actual arrays).

It is trivial to do the same thing with zarr since it has an interface very similar to np.savez. I am just not sure if it uses zipfile under the hood so I will run again into the same problems.

jhamman · 2024-12-08T06:53:59Z

jhamman
Dec 8, 2024
Maintainer

To your main question, we have found that Zarr works great with pytorch + multiprocessing (https://earthmover.io/blog/cloud-native-dataloader).

I am just not sure if it uses zipfile under the hood so I will run again into the same problems.

Zarr's storage container is flexible. You can use a zipfile if you want. Or you can use a directory store or cloud object store. You'll have to pick what works for your application.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Can `zarr` work with PyTorch when `num_workers > 0`? #2542

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Can zarr work with PyTorch when num_workers > 0? #2542

Uh oh!

adosar Dec 7, 2024

Replies: 1 comment

Uh oh!

jhamman Dec 8, 2024 Maintainer

Can `zarr` work with PyTorch when `num_workers > 0`? #2542

adosar
Dec 7, 2024

jhamman
Dec 8, 2024
Maintainer