|
| 1 | +# Workloads at Run:ai |
| 2 | + |
| 3 | +Run:ai enhances visibility and simplifies [management](../../Researcher/workloads/managing-workloads.md), by monitoring, presenting and orchestrating all AI workloads in the clusters it is installed on. Workloads are the fundamental building blocks for consuming resources, enabling AI practitioners such as researchers, data scientists and engineers to efficiently support the entire life cycle of an [AI initiative](../../platform-admin/aiinitiatives/overview.md). |
| 4 | + |
| 5 | +## Workloads across the AI lifecycle |
| 6 | + |
| 7 | +A typical AI initiative progresses through several key stages, each with distinct workloads and objectives. With Run:ai, research and engineering teams can host and manage all these workloads to achieve the following: |
| 8 | + |
| 9 | +* __Data preparation:__ Aggregating, cleaning, normalizing, and labeling data to prepare for training. |
| 10 | +* __Training:__ Conducting resource-intensive model development and iterative performance optimization. |
| 11 | +* __Fine-tuning:__ Adapting pre-trained models to domain-specific data sets while balancing efficiency and performance. |
| 12 | +* __Inference:__ Deploying models for real-time or batch predictions with a focus on low latency and high throughput. |
| 13 | +* __Monitoring and optimization:__ Ensuring ongoing performance by addressing data drift, usage patterns, and retraining as needed. |
| 14 | + |
| 15 | +## What is a workload? |
| 16 | + |
| 17 | +A workload runs in the cluster, is associated with a namespace, and operates to fulfill its targets, whether that is running to completion for a [batch job](workload-types.md#training-scaling-resources-for-model-development), allocating resources for [experimentation](workload-types.md#workspaces-the-experimentation-phase) in an integrated development environment (IDE)/notebook, or serving [inference](workload-types.md#inference-deploying-and-serving-models) requests in production. |
| 18 | + |
| 19 | +The workload, defined by the AI practitioner, consists of: |
| 20 | + |
| 21 | +* __Container images:__ This includes the application, its dependencies, and the runtime environment. |
| 22 | +* __Compute resources:__ CPU, GPU, and RAM to execute efficiently and address the workload’s needs. |
| 23 | +* __Data sets:__ The data needed for processing, such as training data sets or input from external databases. |
| 24 | +* __Credentials:__ The access to certain data sources or external services, ensuring proper authentication and authorization. |
| 25 | + |
| 26 | +## Workload scheduling and orchestration |
| 27 | + |
| 28 | +Run:ai’s core mission is to optimize AI resource usage at scale. This is achieved through efficient [scheduling and orchestrating](../../Researcher/scheduling/the-runai-scheduler.md) of all cluster workloads using the Run:ai Scheduler. The Scheduler allows the prioritization of workloads across different departments and projects within the organization at large scales, based on the resource distribution set by the system administrator. TBD links |
| 29 | + |
| 30 | +## Run:ai and third-party workloads |
| 31 | + |
| 32 | +* __Run:ai workloads:__ These workloads are submitted via the Run:ai platform. They are represented by Kubernetes Custom Resource Definitions (CRDs) and APIs. When using Run:ai workloads, a complete Workload and Scheduling Policy solution is offered for administrators to ensure optimizations, governance and security standards are applied. |
| 33 | +* __Third-party workloads:__ These workloads are submitted via third-party applications that use the Run:ai Scheduler. The Run:ai platform manages and monitors these workloads. They enable seamless integrations with external tools, allowing teams and individuals flexibility. |
| 34 | + |
| 35 | +### Levels of support |
| 36 | + |
| 37 | +Different types of workloads have different levels of support. Understanding what capabilities are needed before selecting the workload type to work with is important. The table below details the level of support for each workload type in Run:ai. Run:ai workloads are fully supported with all of Run:ai advanced features and capabilities. While third-party workloads are partially supported. The list of capabilities can change between different Run:ai versions. |
| 38 | + |
| 39 | +| Functionality | Workload Type | | | | | |
| 40 | +| ----- | :---: | :---: | :---: |:----------------------:| ----- | |
| 41 | +| | Run:ai workloads | | | | Third-party workloads | |
| 42 | +| | Training - Standard | Workspace | Inference | Training - distributed | | |
| 43 | +| [Fairness](../../Researcher/scheduling/the-runai-scheduler.md#fairness-fair-resource-distribution) | v | v | v | v | v | |
| 44 | +| [Priority and preemption](../../Researcher/scheduling/the-runai-scheduler.md#preemption) | v | v | v | v | v | |
| 45 | +| [Over quota](../../Researcher/scheduling/the-runai-scheduler.md#over-quota-priority) | v | v | v | v | v | |
| 46 | +| [Node pools](../../platform-admin/aiinitiatives/resources/node-pools.md) | v | v | v | v | v | |
| 47 | +| Bin packing / Spread | v | v | v | v | v | |
| 48 | +| Fractions | v | v | v | v | v | |
| 49 | +| Dynamic fractions | v | v | v | v | v | |
| 50 | +| Node level scheduler | v | v | v | v | v | |
| 51 | +| GPU swap | v | v | v | v | v | |
| 52 | +| Elastic scaling | NA | NA | v | v | v | |
| 53 | +| [Gang scheduling](../../Researcher/scheduling/the-runai-scheduler.md#gang-scheduling) | v | v | v | v | v | |
| 54 | +| [Monitoring](../../admin/maintenance/alert-monitoring.md) | v | v | v | v | v | |
| 55 | +| [RBAC](../../admin/authentication/authentication-overview.md#role-based-access-control-rbac-in-runai) | v | v | v | v | | |
| 56 | +| Workload awareness | v | v | v | v | | |
| 57 | +| [Workload submission](../../Researcher/workloads/managing-workloads.md) | v | v | v | v | | |
| 58 | +| [Workload actions (stop/run)](../../Researcher/workloads/managing-workloads.md) | v | v | v | v | | |
| 59 | +| [Workload Policies](../../platform-admin/workloads/policies/overview.md) | v | v | v | v | | |
| 60 | +| [Scheduling rules](../../platform-admin/aiinitiatives/org/scheduling-rules.md) | v | v | v | v | | |
| 61 | + |
| 62 | +!!! Note |
| 63 | + __Workload awareness__ |
| 64 | + |
| 65 | + Specific workload-aware visibility, so that different pods are identified and treated as a single workload (for example GPU utilization, workload view, dashboards). |
0 commit comments