You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the Jura run, we have encountered multiple cases of I/O errors of the form:
# from jura healpix/main/dark/176/17625/logs/redrock-main-dark-17625.log.0
...
--- Process 0 raised an exception ---
Proc 0: Traceback (most recent call last):
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/0.20.0/lib/python3.10/site-packages/redrock/external/desi.py", line 929, in rrdesi
targets = DistTargetsDESI(args.infiles, coadd=(not args.allspec),
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/0.20.0/lib/python3.10/site-packages/redrock/external/desi.py", line 570, in __init__
hdata = hdus[extname].data[rows]
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/utils/decorators.py", line 837, in __get__
val = self.fget(obj)
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/image.py", line 250, in data
data = self._get_scaled_image_data(self._data_offset, self.shape)
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/image.py", line 809, in _get_scaled_image_data
raw_data = self._get_raw_data(shape, code, offset)
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/base.py", line 559, in _get_raw_data
return self._file.readarray(offset=offset, dtype=code, shape=shape)
Proc 0: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/file.py", line 400, in readarray
data.shape = shape
Proc 0: ValueError: cannot reshape array of size 768 into shape (2875,11,2881)
The incorrect array size varies with different jobs, and the same files/code work when resubmitted, though admittedly due to checkpoint/restart the jobs are resuming only with the previously failed step and aren't exactly reproducing all prior history.
Other examples (some failing when qso_qn calls redrock, some during the original redrock run)
The infiles are read from /dvs_ro/cfs/cdirs/desi/spectro/redux/jura/..., but it is unclear whether this is a CFS bug, or possibly an astropy installation bug or (less likely?) some corner case with the rows slicing. Documenting it here for the search record.
@dmargala does this sound familiar with any other reports at NERSC? I could file a NERSC ticket too, but the DESI mpi+astropy combination is so specific I'm not sure how useful that would be.
The text was updated successfully, but these errors were encountered:
More info: I have only see this error when DistTargetsDESI is reading the 3D data of the resolution matrix. I have not seen this error when other pipeline steps are reading the equivalent HDUs of other upstream files, and I have not seen this for Redrock reading other 2D HDUs. So an intermittent I/O problem smells like something on the NERSC side, but having it isolated to a specific code reading a specific HDU smells like something on our side. Or a super-corner-case in how this particular code reads that particular HDU triggering some I/O system bug.
I haven't heard about other reports at NERSC that sound familiar.
My initial suspicion would be a job dependency issue. For the main-dark-17625 26153196 case, it looks like the timestamp on the input file is more recent than the job end time 2024-05-29T07:38:58. Is that expected?
-rw-r----- 1 desi desi 1.2G May 29 10:25 /global/cfs/cdirs/desi/spectro/redux/jura/healpix/main/dark/176/17625/coadd-main-dark-17625.fits
During the Jura run, we have encountered multiple cases of I/O errors of the form:
The incorrect array size varies with different jobs, and the same files/code work when resubmitted, though admittedly due to checkpoint/restart the jobs are resuming only with the previously failed step and aren't exactly reproducing all prior history.
Other examples (some failing when qso_qn calls redrock, some during the original redrock run)
The infiles are read from /dvs_ro/cfs/cdirs/desi/spectro/redux/jura/..., but it is unclear whether this is a CFS bug, or possibly an astropy installation bug or (less likely?) some corner case with the rows slicing. Documenting it here for the search record.
@dmargala does this sound familiar with any other reports at NERSC? I could file a NERSC ticket too, but the DESI mpi+astropy combination is so specific I'm not sure how useful that would be.
The text was updated successfully, but these errors were encountered: