Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add limits for container resources like cpu. memory, GPU in addition to requests #3168

Closed
shalberd opened this issue Jun 27, 2023 · 9 comments · Fixed by #3202
Closed

add limits for container resources like cpu. memory, GPU in addition to requests #3168

shalberd opened this issue Jun 27, 2023 · 9 comments · Fixed by #3202
Labels
kind:enhancement New feature or request

Comments

@shalberd
Copy link
Contributor

shalberd commented Jun 27, 2023

Is your feature request related to a problem? Please describe.
@kevin-bates Currently, in the Elyra GUI, i.e. for generic pipelines and a component therein, you can set values for CPU, Memory, and GPU and GPU Vendor.

Bildschirmfoto 2023-06-27 um 16 35 51

https://elyra.readthedocs.io/en/stable/user_guide/pipelines.html#resources-cpu-gpu-and-ram

Those values are set as container request sizes, except GPU, which is set as GPU limit and used as GPU limit, as far as I see. That goes against the doc that speaks of all as "resources required".

Resources: CPU, GPU, and RAM

- Resources that the notebook or script requires. RAM takes units of gigabytes (109 bytes).
- Specify a custom Kubernetes GPU vendor, if desired. The default vendor is nvidia.com/gpu. See [this topic in the Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) for more information.
- The values are ignored when the pipeline is executed locally.
- Example: amd.com/gpu

The problem is that not setting any limits, and not giving the user any way to set limits, leads to cluster instability potentially on Kubernetes and Openshift.

Describe the solution you'd like
Add fields and properties for CPU, GPU, Memory limits in GUI, to be used later in processors and templates.
Possibly also add default factor of x1, overwritable of course, for the limits field values, based on requests fields.

Would have to add those new fields / properties to the generic properties template as well.

Describe alternatives you've considered
No real way around it if using the graphical pipeline editors with setting for runtime containers.

Additional context
Came across this when discussing Kubernetes / Openshift Resource Limits setting in an Airflow2 compatible way. Discussion quickly came to, wait a sec, no limits set ...

#3167 (comment)

@shalberd shalberd added the kind:enhancement New feature or request label Jun 27, 2023
@shalberd
Copy link
Contributor Author

shalberd commented Oct 9, 2023

@lresende what do you think? Do you know why historically, only requests, but not limits were added?

@kevin-bates
Copy link
Member

Hi @shalberd. First, I apologize for the delay in responding (new employer and different application).

I happen to be looking at this very topic for one of my other projects and have reached the conclusion that CPU limits should either not be set (assuming the container can be trusted to not go crazy with resources), or set to some value so as to prevent such a rogue action from consuming all CPU resources, but absolutely not the same as the request size. Memory request/limits, on the other hand, is different and it's probably fine to equate the two (so I would argue it could be set w/o UI/metadata changes).

This all has to do with the fact that CPU is a "compressible resource" while memory is not. Here are a couple articles I found really helpful:
https://sysdig.com/blog/kubernetes-limits-requests/
https://home.robusta.dev/blog/stop-using-cpu-limits?nocache=234

I think GPU limits are difference because, IIRC, there isn't a way to "request" GPU.

To introduce CPU limits, we'd either have to introduce a new value in the metadata (and forms) or come up with some kind of factor value (e.g., limit = 2x request) that can be set via an ENV or also exposed (although we should just expose the limit if this is appearing in the UI).

@shalberd
Copy link
Contributor Author

shalberd commented Oct 11, 2023

Hi @kevin-bates

Two new values / fields in forms / GUI would be the approach I suggest, for CPU and memory limits. What the, overridable, filled-in defaults (if even possible with the GUI technology here) are, can be discussed further, you've got good input. But for now, I'd even be happy just being able to set limits myself, at user discretion, non-required.

@lresende does Elyra use Django for its GUI framework? Or Bottle or ... I see jinja templates, but I cannot make heads or tails out of the Web Framework technology used. Maybe indirectly through some Jupyter extension mechanism.

@paloma-rebuelta
Copy link
Contributor

Hello @kevin-bates and @shalberd

I am also being affected by this issue. My kubernetes cluster has argo workflow resource defaults, which means that it will populate both request and limits if they are not specified. In the case of my Elyra pipeline, I am only able to specify the compute requests, which means that the argo controller will populate the limit defaults and the pipeline automatically fails because the requests surpasses those limits. Having the option to specify the limits (or at least have the both equate) would allows to trigger the pipelines as needed.

Thanks a lot for any help!

@kevin-bates
Copy link
Member

kevin-bates commented Dec 20, 2023

Hi @paloma-rebuelta. It is very disappointing to hear that Argo's resource limit handling is implemented that way! They should (at a minimum) provide an option to use the request value as the limit, rather than a blanket fixed limit. It is commonly recommended to NOT set limits, particularly for CPU, so that a node's resources can be fully utilized. (Memory is a bit of a different story.)

Given it appears this PR is hung up on the UI side of things, here's a suggestion that I believe could be easily implemented AND be both backward (and forward once the UI is in place) compatible.

  1. Introduce two new environment variables: ELYRA_NODE_CPU_LIMIT_FACTOR and ELYRA_NODE_MEM_LIMIT_FACTOR. These variables hold float values, default to 0, and are set prior to starting the Jupyter Lab/Elyra instance.
  2. Each env must have a value >= 1.0. Values < 1.0 will be treated as 0, which is equivalent to the variable not being set.
  3. When a node is processed, its request values are set (either from the UI or via defaults). At that time, the envs are checked for values >= 1.0. Each valid limit factor value is multiplied against the corresponding request value and the integer ceiling of the resultant value is used as the corresponding limit value.
  4. Any limit factor values < 1.0 trigger no additional logic than what is performed today. As a result, doing nothing results in today's behavior.

Once the UI portion of the changes have been done, we should fallback on the logic above to determine the limit values when the UI fields are empty. As a result, the logic above essentially becomes the default value for the limits, which could include NOT setting a limit when both the UI field and the envs are not set (or set to values < 1.0).

Would you be willing to contribute such a PR?

@shalberd
Copy link
Contributor Author

shalberd commented Dec 20, 2023

@kevin-bates I agree on your stepwise solution proposed here. If you want to, I can take a crack at it next year, let's say starting January.

@romeokienzler fyi

@paloma-rebuelta if you make a PR before me, that would be ok, but I will, if ok, start with the env var approach and PR in January. Also, since I have recently added GUI value fields myself and know now how those are handled and added I might even combine in the PR adding the new GUI fields, the env vars, and the overall new logic, as @kevin-bates described.

@romeokienzler
Copy link
Member

Thanks a lot @shalberd looking forward

@paloma-rebuelta
Copy link
Contributor

Hi @shalberd I created a PR just now, let me know if you are able to have a look! #3202

@shalberd
Copy link
Contributor Author

shalberd commented Dec 20, 2023

I'll only get to it by January, am traveling now and on holidays. Maybe the others can have a closer look, thank you very much for the work and effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants