Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs could request additional managed resources #51

Open
jclulow opened this issue Mar 5, 2024 · 0 comments
Open

jobs could request additional managed resources #51

jclulow opened this issue Mar 5, 2024 · 0 comments
Assignees

Comments

@jclulow
Copy link
Collaborator

jclulow commented Mar 5, 2024

Sometimes a job requires additional resources beyond what can be created within a specific target environment. For example, a job may require access to create and destroy resources on an Oxide Rack.

The Oxide rack requires an authentication token (representing a user with particular permissions) to use. It allows resources to be isolated inside various containers, like silos or projects. We would like to provide a pristine user account to a job, with a token that it is (relatively) safe to leak, and to clean up any mess made by the job afterwards.

In the buildomat model, the system itself creates any tokens that are required, rather than providing a generic store for "secure" strings. These tokens are created with the minimal required permissions and in a way where their use is bounded to roughly the execution period of the job itself.

Today, the execution environment for a job (called a worker) is created by a factory, based on the target requested by the job; e.g., helios-2.0 or ubuntu-22.04. The factory is responsible for creating and tearing down the computing resources (e.g., VMs, or network booted physical hosts) required to provide these environments. The core buildomat server keeps track of each worker across its life cycle, so that we don't drop any resources until they are fully cleaned up.

Rather than create more targets that happen to provide resources like a token to use the Oxide rack, we should instead create a new orthogonal concept: the resource. A resource would be notionally similar to a worker in many respects:

  • jobs would request an instance of a particular resource in their declarative configuration, like they request a target today
  • some process analogous to a factory (a well?) would be responsible for creating resources based on instructions from the core server; a resource ID (like a worker ID) would be allocated to track the life cycle
  • specific resources may require specific named privileges, like some targets do today; e.g., the lab family of targets require the target.lab privilege so only certain repositories are allowed to use them
  • if only a limited number of instances of a resource can exist, the well would be responsible for managing that back pressure, like the factory does today for execution environments
  • jobs would wait in the queued state for all of the resources they need, like they do today for workers
  • the well would provide access to details about the resource in the form of metadata, like a factory may, and it would be obtained by the job program itself through the use of the bmat command

It will be important to consider the way resources will be acquired by waiting jobs, to avoid deadlock. Probably something like this policy would suffice:

  • all resources requested by a job must be acquired serially, in lexicographical order by their resource type ID
  • only once all job resources are acquired by the job can we begin to request a worker for the job to run in

If we were to allow parallel acquisition of resources and workers, it would seem pretty easy to end up with a scenario like:

  • job J/1 holds an instance of resource R/1, but needs a worker for target T/1
  • job J/2 has acquired a worker for target T/1 already, but needs an instance of resource R/1
  • either the factory for T/1, or the well for R/1, may only have a single available slot, at which point we would experience deadlock
@jclulow jclulow self-assigned this Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant