Skip to content

Commit dfd92ab

Browse files
Merge pull request #1339 from run-ai/Copy_workloads_articles
Copy workloads articles
2 parents d354613 + 3c7d60d commit dfd92ab

File tree

9 files changed

+131
-73
lines changed

9 files changed

+131
-73
lines changed
Loading

docs/platform-admin/workloads/overviews/managing-workloads.md renamed to docs/Researcher/workloads/managing-workloads.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ The Workloads table consists of the following columns:
3939

4040
### Workload status
4141

42-
The following table describes the different phases in a workload life cycle.
42+
The following table describes the different phases in a workload life cycle. The UI provides additional details for some of the below workload statuses which can be viewed by clicking the icon next to the status.
4343

4444
| Status | Description | Entry Condition | Exit Condition |
4545
| :---- | :---- | :---- | :---- |
@@ -97,7 +97,7 @@ Click one of the values in the Data source(s) column, to view the list of data s
9797
* Search - Click SEARCH and type the value to search by
9898
* Sort - Click each column header to sort by
9999
* Column selection - Click COLUMNS and select the columns to display in the table
100-
* Download table - Click MORE and then Click Download as CSV
100+
* Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.
101101
* Refresh - Click REFRESH to update the table with the latest data
102102
* Show/Hide details - Click to view additional information on the selected row
103103

docs/Researcher/workloads/overviews

-1
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Run:ai workload types
2+
3+
In the world of machine learning (ML), the journey from raw data to actionable insights is a complex process that spans multiple stages. Each stage of the AI lifecycle requires different tools, resources, and frameworks to ensure optimal performance. Run:ai simplifies this process by offering specialized workload types tailored to each phase, facilitating a smooth transition across various stages of the ML workflows.
4+
5+
The ML lifecycle usually begins with the experimental work on data and exploration of different modeling techniques to identify the best approach for accurate predictions. At this stage, resource consumption is usually moderate as experimentation is done on a smaller scale. As confidence grows in the model's potential and its accuracy, the demand for compute resources increases. This is especially true during the training phase, where vast amounts of data need to be processed, particularly with complex models such as large language models (LLMs), with their huge parameter sizes, that often require distributed training across multiple GPUs to handle the intensive computational load.
6+
7+
Finally, once the model is ready, it moves to the inference stage, where it is deployed to make predictions on new, unseen data. Run:ai's workload types are designed to correspond with the natural stages of this lifecycle. They are structured to align with the specific resource and framework requirements of each phase, ensuring that AI researchers and data scientists can focus on advancing their models without worrying about infrastructure management.
8+
9+
Run:ai offers three workload types that correspond to a specific phase of the researcher’s work:
10+
11+
* __Workspaces__ – For experimentation with data and models.
12+
* __Training__ – For resource-intensive tasks such as model training and data preparation.
13+
* __Inference__ – For deploying and serving the trained model.
14+
15+
## Workspaces: the experimentation phase
16+
17+
The __Workspace__ is where data scientists conduct initial research, experiment with different data sets, and test various algorithms. This is the most flexible stage in the ML lifecycle, where models and data are explored, tuned, and refined. The value of workspaces lies in the flexibility they offer, allowing the researcher to iterate quickly without being constrained by rigid infrastructure.
18+
19+
* __Framework flexibility__
20+
21+
Workspaces support a variety of machine learning frameworks, as researchers need to experiment with different tools and methods.
22+
23+
* __Resource requirements__
24+
25+
Workspaces are often lighter on resources compared to the training phase, but they still require significant computational power for data processing, analysis, and model iteration.
26+
27+
Hence, the default for the Run:ai workspaces considerations is to allow scheduling those workloads without the ability to preempt them once the resources were allocated. However, this non-preemptable state doesn’t allow to utilize more resources outside of the project’s deserved quota.
28+
29+
See running workspaces to learn more about how to submit a workspace via the Run:ai platform. For quick starts, see Running Jupyter Notebook using workspaces. TBD links
30+
31+
## Training: scaling resources for model development
32+
33+
As models mature and the need for more robust data processing and model training increases, Run:ai facilitates this shift through the Training workload. This phase is resource-intensive, often requiring distributed computing and high-performance clusters to process vast data sets and train models.
34+
35+
* __Training architecture__
36+
37+
For training workloads Run:ai allows you to specify the architecture - standard or distributed. The distributed architecture is relevant for larger data sets and more complex models that require utilizing multiple nodes. For the distributed architecture, Run:ai allows you to specify different configurations for the master and workers and select which framework to use - PyTorch, XGBoost, MPI, and TensorFlow. In addition, as part of the distributed configuration, Run:ai enable the researchers to schedule their distributed workloads on nodes within the same region, zone, placement group, or any other topology.
38+
39+
* __Resource requirements__
40+
41+
Training tasks demand high memory, compute power, and storage. Run:ai ensures that the allocated resources match the scale of the task and allows those workloads to utilize more compute resources than the project’s deserved quota. Make sure that if you wish your training workload not to be preempted, specify the number of GPU’s that are in your quota.
42+
43+
See Standard training and Distributed training to learn more about how to submit a training workload via the Run:ai UI. For quick starts, see Run your first standard training and Run your first distributed training. TBD
44+
45+
## Inference: deploying and serving models
46+
47+
Once a model is trained and validated, it moves to the Inference stage, where it is deployed to make predictions (usually in a production environment). This phase is all about efficiency and responsiveness, as the model needs to serve real-time or batch predictions to end-users or other systems.
48+
49+
* __Inference-specific use cases__
50+
51+
Naturally, inference workloads are required to change and adapt to the ever-changing demands to meet SLA. For example, additional replicas may be deployed, manually or automatically, to increase compute resources as part of a horizontal scaling approach or a new version of the deployment may need to be rolled out without affecting the running services.
52+
53+
* __Resource requirements__
54+
55+
Inference models differ in size and purpose, leading to varying computational requirements. For example, small OCR models can run efficiently on CPUs, whereas LLMs typically require significant GPU memory for deployment and serving. Inference workloads are considered production-critical and are given the highest priority to ensure compliance with SLAs. Additionally, Run:ai ensures that inference workloads cannot be preempted, maintaining consistent performance and reliability.
56+
57+
See Deploy a custom inference workload to learn more about how to submit an inference workload via the Run:ai UI. For quick start, see Deploy Llama model. TBD
58+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Workloads at Run:ai
2+
3+
Run:ai enhances visibility and simplifies [management](../../Researcher/workloads/managing-workloads.md), by monitoring, presenting and orchestrating all AI workloads in the clusters it is installed on. Workloads are the fundamental building blocks for consuming resources, enabling AI practitioners such as researchers, data scientists and engineers to efficiently support the entire life cycle of an [AI initiative](../../platform-admin/aiinitiatives/overview.md).
4+
5+
## Workloads across the AI lifecycle
6+
7+
A typical AI initiative progresses through several key stages, each with distinct workloads and objectives. With Run:ai, research and engineering teams can host and manage all these workloads to achieve the following:
8+
9+
* __Data preparation:__ Aggregating, cleaning, normalizing, and labeling data to prepare for training.
10+
* __Training:__ Conducting resource-intensive model development and iterative performance optimization.
11+
* __Fine-tuning:__ Adapting pre-trained models to domain-specific data sets while balancing efficiency and performance.
12+
* __Inference:__ Deploying models for real-time or batch predictions with a focus on low latency and high throughput.
13+
* __Monitoring and optimization:__ Ensuring ongoing performance by addressing data drift, usage patterns, and retraining as needed.
14+
15+
## What is a workload?
16+
17+
A workload runs in the cluster, is associated with a namespace, and operates to fulfill its targets, whether that is running to completion for a [batch job](workload-types.md#training-scaling-resources-for-model-development), allocating resources for [experimentation](workload-types.md#workspaces-the-experimentation-phase) in an integrated development environment (IDE)/notebook, or serving [inference](workload-types.md#inference-deploying-and-serving-models) requests in production.
18+
19+
The workload, defined by the AI practitioner, consists of:
20+
21+
* __Container images:__ This includes the application, its dependencies, and the runtime environment.
22+
* __Compute resources:__ CPU, GPU, and RAM to execute efficiently and address the workload’s needs.
23+
* __Data sets:__ The data needed for processing, such as training data sets or input from external databases.
24+
* __Credentials:__ The access to certain data sources or external services, ensuring proper authentication and authorization.
25+
26+
## Workload scheduling and orchestration
27+
28+
Run:ai’s core mission is to optimize AI resource usage at scale. This is achieved through efficient [scheduling and orchestrating](../../Researcher/scheduling/the-runai-scheduler.md) of all cluster workloads using the Run:ai Scheduler. The Scheduler allows the prioritization of workloads across different departments and projects within the organization at large scales, based on the resource distribution set by the system administrator. TBD links
29+
30+
## Run:ai and third-party workloads
31+
32+
* __Run:ai workloads:__ These workloads are submitted via the Run:ai platform. They are represented by Kubernetes Custom Resource Definitions (CRDs) and APIs. When using Run:ai workloads, a complete Workload and Scheduling Policy solution is offered for administrators to ensure optimizations, governance and security standards are applied.
33+
* __Third-party workloads:__ These workloads are submitted via third-party applications that use the Run:ai Scheduler. The Run:ai platform manages and monitors these workloads. They enable seamless integrations with external tools, allowing teams and individuals flexibility.
34+
35+
### Levels of support
36+
37+
Different types of workloads have different levels of support. Understanding what capabilities are needed before selecting the workload type to work with is important. The table below details the level of support for each workload type in Run:ai. Run:ai workloads are fully supported with all of Run:ai advanced features and capabilities. While third-party workloads are partially supported. The list of capabilities can change between different Run:ai versions.
38+
39+
| Functionality | Workload Type | | | | |
40+
| ----- | :---: | :---: | :---: |:----------------------:| ----- |
41+
| | Run:ai workloads | | | | Third-party workloads |
42+
| | Training - Standard | Workspace | Inference | Training - distributed | |
43+
| [Fairness](../../Researcher/scheduling/the-runai-scheduler.md#fairness-fair-resource-distribution) | v | v | v | v | v |
44+
| [Priority and preemption](../../Researcher/scheduling/the-runai-scheduler.md#preemption) | v | v | v | v | v |
45+
| [Over quota](../../Researcher/scheduling/the-runai-scheduler.md#over-quota-priority) | v | v | v | v | v |
46+
| [Node pools](../../platform-admin/aiinitiatives/resources/node-pools.md) | v | v | v | v | v |
47+
| Bin packing / Spread | v | v | v | v | v |
48+
| Fractions | v | v | v | v | v |
49+
| Dynamic fractions | v | v | v | v | v |
50+
| Node level scheduler | v | v | v | v | v |
51+
| GPU swap | v | v | v | v | v |
52+
| Elastic scaling | NA | NA | v | v | v |
53+
| [Gang scheduling](../../Researcher/scheduling/the-runai-scheduler.md#gang-scheduling) | v | v | v | v | v |
54+
| [Monitoring](../../admin/maintenance/alert-monitoring.md) | v | v | v | v | v |
55+
| [RBAC](../../admin/authentication/authentication-overview.md#role-based-access-control-rbac-in-runai) | v | v | v | v | |
56+
| Workload awareness | v | v | v | v | |
57+
| [Workload submission](../../Researcher/workloads/managing-workloads.md) | v | v | v | v | |
58+
| [Workload actions (stop/run)](../../Researcher/workloads/managing-workloads.md) | v | v | v | v | |
59+
| [Workload Policies](../../platform-admin/workloads/policies/overview.md) | v | v | v | v | |
60+
| [Scheduling rules](../../platform-admin/aiinitiatives/org/scheduling-rules.md) | v | v | v | v | |
61+
62+
!!! Note
63+
__Workload awareness__
64+
65+
Specific workload-aware visibility, so that different pods are identified and treated as a single workload (for example GPU utilization, workload view, dashboards).

docs/platform-admin/aiinitiatives/overview.md

+3
Original file line numberDiff line numberDiff line change
@@ -117,3 +117,6 @@ The following organizational example consists of 5 optional scopes:
117117
!!! Note
118118
When a scope is selected, the very same unit, including all of its subordinates (both existing and any future subordinates, if added), are selected as well.
119119

120+
## Next Steps
121+
122+
Now that resources are grouped into node pools, organizational units or business initiatives are mapped into projects and departments, projects’ quota parameters are set per node pool, and users are assigned to projects, you can finally [submit workloads](../../Researcher/workloads/managing-workloads.md) from a project and use compute resources to run your AI initiatives.
Binary file not shown.

0 commit comments

Comments
 (0)