Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

[Bug]: One SPS deployment can accidentally execute another SPS deployments deployed process containers #230

Open
ryanghunter opened this issue Aug 16, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@ryanghunter
Copy link
Contributor

ryanghunter commented Aug 16, 2023

Checked for duplicates

Yes - I've already checked

Describe the bug
When a process with the same id and processVersion is deployed at different times on two different SPS systems it can cause the first SPS system deployed to execute the second SPS system's process instead of it' own.

This is because the two SPS systems share an image repository, so they can overwrite each other's process images. This becomes an issue when workers from the first SPS system to deploy don't have the image downloaded locally before the second SPS system overwrites it in the remote image repository.

What did you expect?

The workers in the first SPS system should run the process deployed to the first SPS system, not the process deployed to the second SPS system.

Reproducible steps

  1. Deploy two instances of SPS or have two instances of SPS deployed, deployment A and deployment B
  2. Deploy a process, process-id:tag through deployment A's WPS-T server
  3. Use kubectl exec to log into the WPS-T server container
  4. Run docker image ls and note the image id of the deployed process
  5. Ensure these process images don't exist on deployment A's verdi workers by using a fresh worker deployment or manually removing images with docker image rm - f {image id}
  6. Deploy the same process with the same id and tag, process-id:tag, to deployment B's WPS-T server
  7. Repeat steps 3 & 4 for the WPS-T server in deployment B, noting the image id of the deployed process
  8. Send a job execution request to the process-id:tag WPS-T process endpoint of deployment A, note the job id
  9. Find the verdi worker that executed the job by using kubectl exec and searching for the job id in the verdi worker logs
  10. Run docker image ls and see that the image id of the process images matches deployment B's image id, not deployment A's image id as it should.
@ryanghunter ryanghunter added the bug Something isn't working label Aug 16, 2023
@ryanghunter
Copy link
Contributor Author

Have a working proof of concept for a fix to this.

I looked into getting config through SSM params like we had talked about, but it had some complications we didn't originally consider:

To look up values from SSM, the container would need to know which deployment it's a part of. For example, should the jobs database notification topic ARN be retrieved from /unity/sps/dev-deployment-ryan/jobsTopicArn or from /unity/sps/dev-deployment-drew/jobsTopicArn? So, SSM Param locations can't be hard-coded and in some way the job would need to get the sps deployment name. This is the same problem we're trying to solve in the first place by using SSM: getting deployment-specific config info to a job.

Thinking about options I see two:

  1. throw an sps config file on the workers and mount it in the job the same way we mount /stage and /tmp, consume it with create_cwl_yml like we consume the baked in env variables now
  2. add sps config params to process (i.e. to hysds job-spec) when submitting to ADES and add param values to job exec requests sent to ADES (i.e. to hysds-io)

At the end of the day, any solution ends with the config values being added to the job's workflow input yaml. I think option 2 is a cleaner, simpler, and more transparent way of doing that.

I wrote a proof of concept for option 2 and it's passing regression tests. If we decide we want to go this route I'll clean it up and open a PR.

@LucaCinquini
Copy link
Collaborator

Thanks Ryan, I assume this means the wrapper CWL needs to be modified? That might be ok. Let's discuss on Monday.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants