jobs could request additional managed resources #51

jclulow · 2024-03-05T21:25:48Z

Sometimes a job requires additional resources beyond what can be created within a specific target environment. For example, a job may require access to create and destroy resources on an Oxide Rack.

The Oxide rack requires an authentication token (representing a user with particular permissions) to use. It allows resources to be isolated inside various containers, like silos or projects. We would like to provide a pristine user account to a job, with a token that it is (relatively) safe to leak, and to clean up any mess made by the job afterwards.

In the buildomat model, the system itself creates any tokens that are required, rather than providing a generic store for "secure" strings. These tokens are created with the minimal required permissions and in a way where their use is bounded to roughly the execution period of the job itself.

Today, the execution environment for a job (called a worker) is created by a factory, based on the target requested by the job; e.g., helios-2.0 or ubuntu-22.04. The factory is responsible for creating and tearing down the computing resources (e.g., VMs, or network booted physical hosts) required to provide these environments. The core buildomat server keeps track of each worker across its life cycle, so that we don't drop any resources until they are fully cleaned up.

Rather than create more targets that happen to provide resources like a token to use the Oxide rack, we should instead create a new orthogonal concept: the resource. A resource would be notionally similar to a worker in many respects:

jobs would request an instance of a particular resource in their declarative configuration, like they request a target today
some process analogous to a factory (a well?) would be responsible for creating resources based on instructions from the core server; a resource ID (like a worker ID) would be allocated to track the life cycle
specific resources may require specific named privileges, like some targets do today; e.g., the lab family of targets require the target.lab privilege so only certain repositories are allowed to use them
if only a limited number of instances of a resource can exist, the well would be responsible for managing that back pressure, like the factory does today for execution environments
jobs would wait in the queued state for all of the resources they need, like they do today for workers
the well would provide access to details about the resource in the form of metadata, like a factory may, and it would be obtained by the job program itself through the use of the bmat command

It will be important to consider the way resources will be acquired by waiting jobs, to avoid deadlock. Probably something like this policy would suffice:

all resources requested by a job must be acquired serially, in lexicographical order by their resource type ID
only once all job resources are acquired by the job can we begin to request a worker for the job to run in

If we were to allow parallel acquisition of resources and workers, it would seem pretty easy to end up with a scenario like:

job J/1 holds an instance of resource R/1, but needs a worker for target T/1
job J/2 has acquired a worker for target T/1 already, but needs an instance of resource R/1
either the factory for T/1, or the well for R/1, may only have a single available slot, at which point we would experience deadlock

jclulow self-assigned this Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs could request additional managed resources #51

jobs could request additional managed resources #51

jclulow commented Mar 5, 2024

jobs could request additional managed resources #51

jobs could request additional managed resources #51

Comments

jclulow commented Mar 5, 2024