You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From Chaskiel re: an error that recently occured on CMU prod
The error was triggered by an operating system error (an LDAP lookup failed when tango ssh'd to the docker node). There has only been one such failure since the beginning of the semester
For the devs:
the failure was in getImages, not waitVM (which has retry logic), or a later step (where retries are more complicated because you'd have to restart the process from the copyIn phase)
It may make sense for DistDocker.getImages (and maybe DistDocker.getVms) to skip backends that are unavailable.
The text was updated successfully, but these errors were encountered:
From Chaskiel re: an error that recently occured on CMU prod
The error was triggered by an operating system error (an LDAP lookup failed when tango ssh'd to the docker node). There has only been one such failure since the beginning of the semester
ERROR|2024-02-08 10:43:41,556|TangoREST|addJob request failed: Command '['ssh', '-o', 'BatchMode=yes', '-i', '/usr/local/lib/Tango/vmms/id_rsa', '-o', 'StrictHostKeyChecking=no', '-o', 'GSSAPIAuthentication=no', '[email protected]', '(docker images)']' returned non-zero exit status 255.
For the devs:
the failure was in getImages, not waitVM (which has retry logic), or a later step (where retries are more complicated because you'd have to restart the process from the copyIn phase)
It may make sense for DistDocker.getImages (and maybe DistDocker.getVms) to skip backends that are unavailable.
The text was updated successfully, but these errors were encountered: