Replies: 1 comment
-
To your main question, we have found that Zarr works great with pytorch + multiprocessing (https://earthmover.io/blog/cloud-native-dataloader).
Zarr's storage container is flexible. You can use a zipfile if you want. Or you can use a directory store or cloud object store. You'll have to pick what works for your application. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dear
zarr-python
team,I am trying to build a PyTorch dataset for point clouds and I was wondering if
zarr
can natively play with PyTorch especially in the case where multiprocessing is used for data loading, i.e. in the casenum_workers > 0
. Since the datasets are generally large, I would like the storing method to have the following properties:Currently, I am storing the point clouds as arrays in a
.npz
, i.e.np.savez(file, key_1=pcd_1, key_2=pcd_2, ...)
. However, I am facing the problem as described here.The proposed solution is to avoid loading the
.npz
file into the main process and load it individually for each worker. However, if the number of arrays stored in.npz
are very large, it takes a significant amount of time to load it withnp.load
(I think the large loading time is due to metadata and not the actual arrays).It is trivial to do the same thing with
zarr
since it has an interface very similar tonp.savez
. I am just not sure if it useszipfile
under the hood so I will run again into the same problems.Beta Was this translation helpful? Give feedback.
All reactions