Project roadmap #71

cmelone · 2024-07-31T16:56:53Z

This is a tracking issue used to document the current set of features we would like to integrate into gantry.

This thread should also be used to discuss new directions for the project.

Plan

In the pilot phase, we will only be implementing predictions for requests, and ensuring that they will only increase compared to current allocations.
If we see success in the pilot, we'll implement functionality which retries jobs with higher memory allocations if they've been shown to fail due to OOM kills.
Then, we will "drop the floor" and allow the predictor to allocate less memory than the package is used to. At this step, requests will be fully implemented.
Limits for CPU and memory will be implemented.
Next, we want to introduce some experimentation in the system and perform a scaling study.
Design a scheduler that decides which instance type a job should be placed on based on cost and expected usage and runtime.

The success of this framework can be evaluated against a number of factors:

cmelone self-assigned this Jul 31, 2024

cmelone changed the title ~~Prioritized feature list~~ Project roadmap Jul 31, 2024

cmelone pinned this issue Aug 22, 2024