transfer records to another file without decompress/recompress #3070

dkokron · 2025-01-08T20:57:16Z

dkokron
Jan 8, 2025

We have a workflow that copies the last 120 of 121 records from one file to another. The data are compressed and chunked. Each record is a chunk. Profiling shows the vast majority of time is spent decompressing then recompressing the data. A faster way would avoid decompressing and recompressing in the first place. I can see needing to decompress the data if the user wants to get at the real values, but I just want copy from one file to another. I was thinking of a low level block copy. Something like 'dd' command would do. Is that possible with NetCDF?

I'm using nco-5.2.4 built using spack and running on a zen2 chip.
Example usage:
ncrcat -7 -d time,1,120 -L 4 file.in file.out
ncrcat -7 -d time,1,120 --cmp='shf|zst,4' file.in file.out

DennisHeimbigner · 2025-01-09T00:00:08Z

DennisHeimbigner
Jan 9, 2025
Collaborator

Since you are already using NCO, I assume that does not solve your problem.
Assuming the file is stored as a netcdf-4/HDF5 file, you might check the HDF group
website to see if there is some HDF5 tool that can solve the problem.

1 reply

dkokron Jan 9, 2025
Author

Charlie suggested I contact HDF5 too.
https://sourceforge.net/p/nco/discussion/9830/thread/4a675c8d79/

I'll open a discussion with HDF5.

DennisHeimbigner · 2025-01-09T00:31:49Z

DennisHeimbigner
Jan 9, 2025
Collaborator

There is also a python package that wraps HDF5. You might see what it can do.

0 replies

dkokron · 2025-01-09T14:45:28Z

dkokron
Jan 9, 2025
Author

The original workflow is using the python package. I'm exploring other options. I've opened to discussion at https://forum.hdfgroup.org/c/hdf5/8

0 replies

dkokron · 2025-01-13T15:28:25Z

dkokron
Jan 13, 2025
Author

HDF5 responded at https://forum.hdfgroup.org/t/transfer-records-to-another-file-without-decompress-recompress/12960

"H5Dread_chunk() and H5Dwrite_chunk() allow raw chunk data to be accessed and written while bypassing some or all compression filters."

0 replies

edwardhartnett · 2025-01-14T14:43:35Z

edwardhartnett
Jan 14, 2025

OK, if I understand correctly, the idea is to copy one dataset to another (in a different file or the same file or both?) without decompressing/recompressing. Sure, why not.

The problem is, we try to keep the netcdf-c API small, and this would require an extra function. Just to game this out, what would that function look like?

Are we going to always copy the whole variable/dataset? (And subsetting would require decompression, since I don't know where the data are within the compressed chunks). In that case:

int
nc_copy_var(ncid1, varid1, ncid2, varid2);

Is that what we are talking about here?

@czender if such a function existed, could NCO make good use of it? Would it be useful? Presumably NCO allows subsetting when copying, and this only would work when there is no subsetting, right?

Another alternative would be to write a straight-up HDF5 program to do this. As we all know, netCDF-4 just writes regular HDF5 datasets, which can be opened are read with HDF5 as well as netCDF4. Also files created with HDF5 can be opened by netCDF-4. So perhaps that's the easiest path to a working implementation?

Even if we agreed that the above function prototype was correct, there is still the step of convincing the netCDF developers that this is worthy of a new function in the API. It's not clear to me that it is worth it, so the case has to be made.

0 replies

dkokron · 2025-01-14T15:13:06Z

dkokron
Jan 14, 2025
Author

My motivation is helping NCEP improve their resource utilization by speeding up various workflows. One of those workflows spends ~3 hours on the task described in the first posting (copies the last 120 of 121 time records from one file to another). I can only suspect anyone using nccopy would appreciate the speedup too.

What other data/information would make the case?

1 reply

edwardhartnett Jan 14, 2025

What is this were a special feature of nccopy? In other words, an internal function which was not part of the netcdf-c API but which nccopy used when appropriate?

czender · 2025-01-14T15:40:31Z

czender
Jan 14, 2025

If this function
int nc_copy_var(ncid1, varid1, ncid2, varid2);
existed then I would modify NCO to make use of it where possible. The resulting speedup could justify the development effort for this since copying entire variables is common yet unnecessarily time-consuming due to the unavoidable compression/decompression.

However, @dkokron's specific use case above involves hyperslabs so the minimum required prototype to (potentially) eliminate compression/uncompression would be
int nc_copy_vara(ncid1, varid1, ncid2, varid2,start,count);
and I agree with @edwardhartnett that this is problematic since the chunking of the stored variable would make it necessary, in general, to de-/re-compress any chunks that crossed the hyperslab boundaries, and figuring that out would not be a cakewalk.

0 replies

dkokron · 2025-01-14T16:18:47Z

dkokron
Jan 14, 2025
Author

I've attached an ncdump of one of the files that is the subject of my optimization efforts.
dump.txt

0 replies

DennisHeimbigner · 2025-01-14T18:36:31Z

DennisHeimbigner
Jan 14, 2025
Collaborator

Some time ago, I wrote a program to show the chunking layout for HDF5 datasets.
I was using it to test my Zarr implementation. That program is currently part of the
netcdf distribution in the nczarr_test/ncdumpchunks.c program. I have also attached
a copy to this message. It might help inform this discussion. Unfortunately, it is not well documented.
ncdumpchunks.c.txt

0 replies

dkokron · 2025-01-16T18:00:52Z

dkokron
Jan 16, 2025
Author

I put together a proof-of-concept code (see attached. pardon the mess) for testing the performance benefit from using H5Dread_chunk/H5Dwrite_chunk. To transfer one variable (wspd in the ncdump output attached above) takes
~22m:30s using the normal H5Dread/H5Dwrite approach (read gzipped, write zstd). ncrcat (version 5.2.4) also took about 22 minutes for the same variable and compression swap.

The H5Dread_chunk/H5Dwrite_chunk approach took 58s. We can't change the compression strategy with this approach.

h5ex_d_chunk.c.txt

0 replies

dkokron · 2025-01-21T16:03:38Z

dkokron
Jan 21, 2025
Author

What it the thinking on this? Is there a path forward?

0 replies

WardF · 2025-01-21T18:48:06Z

WardF
Jan 21, 2025
Maintainer

Getting caught up on this, I think that there is a path forward in the technical sense, although I'll need to think through what it would look like in the absence of hdf5 (or making it a feature only available for hdf5-based file storage).

In terms of @edwardhartnett's question 'is it worthwhile adding', e.g. the overhead of implementing and maintaining. The issue re: crossing hyperslabs is also a very real and relevant one.

The overhead of adding this in a robust way feels significant, although the approach and timings you provided are interesting. It makes sense that we can't change the compression strategy, given that would require the decompressing/recompressing you are trying to avoid.

@dkokron what are your thoughts re: the hyperslab issue?

0 replies

dkokron · 2025-01-21T22:53:19Z

dkokron
Jan 21, 2025
Author

More timings. I cleared the buffer cache before each test. I did not clear the cache in previous testing.

time ncrcat -7 -v wspd -d time,1,120 -L 4 File.in File.out (this scenario does the same thing as the proof-of-concept code; no change to compression)
real 38m41.790s
user 36m41.652s
sys 1m24.463s

time ncrcat -7 -v wspd -d time,1,120 --cmp='dfl,4' File.in File.out
real 38m46.688s
user 36m42.857s
sys 1m29.966s

time ncrcat -7 -v wspd -d time,1,120 --cmp='shf|zst,4' (same as previously posted.)
real 22m15.836s
user 20m23.179s
sys 1m20.467s

proof-of-concept.x (would be faster if it didn't transfer all 121 time records)
real 1m11.791s
user 0m19.173s
sys 0m40.714s

second run of proof-of-concept.x without clearing the cache
real 0m56.025s
user 0m15.983s
sys 0m39.541s

0 replies

dkokron · 2025-01-21T22:56:40Z

dkokron
Jan 21, 2025
Author

I look at this as an optimization strategy for a particular scenario (copy entire variables from one file to another). I can't imaging using H5Dread_chunk/H5Dwrite_chunk for anything else.

0 replies

edwardhartnett · 2025-01-21T23:02:53Z

edwardhartnett
Jan 21, 2025

What if this capability was wired into nc_copy_var()? That is, when nc_copy_var() is used, if the compression has not changed, etc., then this code is used as an optimization. So there would not have to be any API changes, and that would be a lot easier.

5 replies

dkokron Jan 22, 2025
Author

That sounds like a perfect fit.

czender Jan 22, 2025

It does indeed seem like a good fit. I have never used nc_copy_var(), for which the documentation states: "This will copy a variable that is an array of primitive type and its attributes from one file to another, assuming dimensions in the output file are already defined and have same dimension IDs and length." Demanding that the dimension IDs match is a pretty limiting restriction in the wild, where people often want to subset just a few variables from a larger file. It would be a much less onerous restriction if only the dimension names (not IDs) had to match between the files. Nevertheless, it would be a great speedup in the cases where the IDs did match.

dkokron Jan 22, 2025
Author

Does nc_copy_var() allow subsetting? The scenario I want to apply this to copies only the last 120 of 121 time records from one file to the next.

czender Jan 22, 2025

Unfortunately not. Meaning it does not allow hyperslabbing. It will only copy an entire variable. According to the documentation, it does allow extracting only one variable from a file of multiple variables. But only if the dimensions in the destination file were created in the exact same order as in the source file.

dkokron Jan 22, 2025
Author

While not a solution for my scenario, I suspect it would be useful for the community.

dkokron · 2025-01-22T21:09:04Z

dkokron
Jan 22, 2025
Author

I found a bug in my proof-of-concept. I was reading all the chunks, but not updating the offset to the target file. New timing is.

real 1m44.199s
user 0m18.931s
sys 0m53.803s

Updated code.
h5ex_d_chunk.c.txt

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transfer records to another file without decompress/recompress #3070

{{title}}

Replies: 16 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

transfer records to another file without decompress/recompress #3070

dkokron Jan 8, 2025

Replies: 16 comments · 7 replies

DennisHeimbigner Jan 9, 2025 Collaborator

dkokron Jan 9, 2025 Author

DennisHeimbigner Jan 9, 2025 Collaborator

dkokron Jan 9, 2025 Author

dkokron Jan 13, 2025 Author

edwardhartnett Jan 14, 2025

dkokron Jan 14, 2025 Author

edwardhartnett Jan 14, 2025

czender Jan 14, 2025

dkokron Jan 14, 2025 Author

DennisHeimbigner Jan 14, 2025 Collaborator

dkokron Jan 16, 2025 Author

dkokron Jan 21, 2025 Author

WardF Jan 21, 2025 Maintainer

dkokron Jan 21, 2025 Author

dkokron Jan 21, 2025 Author

edwardhartnett Jan 21, 2025

dkokron Jan 22, 2025 Author

czender Jan 22, 2025

dkokron Jan 22, 2025 Author

czender Jan 22, 2025

dkokron Jan 22, 2025 Author

dkokron Jan 22, 2025 Author

dkokron
Jan 8, 2025

Replies: 16 comments 7 replies

DennisHeimbigner
Jan 9, 2025
Collaborator

dkokron Jan 9, 2025
Author

DennisHeimbigner
Jan 9, 2025
Collaborator

dkokron
Jan 9, 2025
Author

dkokron
Jan 13, 2025
Author

edwardhartnett
Jan 14, 2025

dkokron
Jan 14, 2025
Author

czender
Jan 14, 2025

dkokron
Jan 14, 2025
Author

DennisHeimbigner
Jan 14, 2025
Collaborator

dkokron
Jan 16, 2025
Author

dkokron
Jan 21, 2025
Author

WardF
Jan 21, 2025
Maintainer

dkokron
Jan 21, 2025
Author

dkokron
Jan 21, 2025
Author

edwardhartnett
Jan 21, 2025

dkokron Jan 22, 2025
Author

dkokron Jan 22, 2025
Author

dkokron Jan 22, 2025
Author

dkokron
Jan 22, 2025
Author