Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent array size I/O error while reading DistTargetsDESI #309

Open
sbailey opened this issue May 30, 2024 · 2 comments
Open

Intermittent array size I/O error while reading DistTargetsDESI #309

sbailey opened this issue May 30, 2024 · 2 comments

Comments

@sbailey
Copy link
Collaborator

sbailey commented May 30, 2024

During the Jura run, we have encountered multiple cases of I/O errors of the form:

# from jura healpix/main/dark/176/17625/logs/redrock-main-dark-17625.log.0
...
--- Process 0 raised an exception ---
Proc 0: Traceback (most recent call last):
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/0.20.0/lib/python3.10/site-packages/redrock/external/desi.py", line 929, in rrdesi
    targets = DistTargetsDESI(args.infiles, coadd=(not args.allspec),
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/0.20.0/lib/python3.10/site-packages/redrock/external/desi.py", line 570, in __init__
    hdata = hdus[extname].data[rows]
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/utils/decorators.py", line 837, in __get__
    val = self.fget(obj)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/image.py", line 250, in data
    data = self._get_scaled_image_data(self._data_offset, self.shape)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/image.py", line 809, in _get_scaled_image_data
    raw_data = self._get_raw_data(shape, code, offset)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/base.py", line 559, in _get_raw_data
    return self._file.readarray(offset=offset, dtype=code, shape=shape)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/file.py", line 400, in readarray
    data.shape = shape
Proc 0: ValueError: cannot reshape array of size 768 into shape (2875,11,2881)

The incorrect array size varies with different jobs, and the same files/code work when resubmitted, though admittedly due to checkpoint/restart the jobs are resuming only with the previously failed step and aren't exactly reproducing all prior history.

Other examples (some failing when qso_qn calls redrock, some during the original redrock run)

healpix         jobid
main-dark-17625 26153196
main-dark-17352 26153083
main-dark-20239 26153880
main-dark-8676  26151991
main-dark-26147 26154319
main-dark-7272  26151594

The infiles are read from /dvs_ro/cfs/cdirs/desi/spectro/redux/jura/..., but it is unclear whether this is a CFS bug, or possibly an astropy installation bug or (less likely?) some corner case with the rows slicing. Documenting it here for the search record.

@dmargala does this sound familiar with any other reports at NERSC? I could file a NERSC ticket too, but the DESI mpi+astropy combination is so specific I'm not sure how useful that would be.

@sbailey
Copy link
Collaborator Author

sbailey commented May 31, 2024

More info: I have only see this error when DistTargetsDESI is reading the 3D data of the resolution matrix. I have not seen this error when other pipeline steps are reading the equivalent HDUs of other upstream files, and I have not seen this for Redrock reading other 2D HDUs. So an intermittent I/O problem smells like something on the NERSC side, but having it isolated to a specific code reading a specific HDU smells like something on our side. Or a super-corner-case in how this particular code reads that particular HDU triggering some I/O system bug.

@dmargala
Copy link
Contributor

I haven't heard about other reports at NERSC that sound familiar.

My initial suspicion would be a job dependency issue. For the main-dark-17625 26153196 case, it looks like the timestamp on the input file is more recent than the job end time 2024-05-29T07:38:58. Is that expected?

-rw-r----- 1 desi desi 1.2G May 29 10:25 /global/cfs/cdirs/desi/spectro/redux/jura/healpix/main/dark/176/17625/coadd-main-dark-17625.fits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants