Skip to content

Commit

Permalink
Merge pull request #1339 from run-ai/Copy_workloads_articles
Browse files Browse the repository at this point in the history
Copy workloads articles
  • Loading branch information
SherinDaher-Runai authored Jan 3, 2025
2 parents d354613 + 3c7d60d commit dfd92ab
Show file tree
Hide file tree
Showing 9 changed files with 131 additions and 73 deletions.
Binary file added docs/Researcher/workloads/img/workload-table.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ The Workloads table consists of the following columns:

### Workload status

The following table describes the different phases in a workload life cycle.
The following table describes the different phases in a workload life cycle. The UI provides additional details for some of the below workload statuses which can be viewed by clicking the icon next to the status.

| Status | Description | Entry Condition | Exit Condition |
| :---- | :---- | :---- | :---- |
Expand Down Expand Up @@ -97,7 +97,7 @@ Click one of the values in the Data source(s) column, to view the list of data s
* Search - Click SEARCH and type the value to search by
* Sort - Click each column header to sort by
* Column selection - Click COLUMNS and select the columns to display in the table
* Download table - Click MORE and then Click Download as CSV
* Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.
* Refresh - Click REFRESH to update the table with the latest data
* Show/Hide details - Click to view additional information on the selected row

Expand Down
1 change: 0 additions & 1 deletion docs/Researcher/workloads/overviews

This file was deleted.

58 changes: 58 additions & 0 deletions docs/Researcher/workloads/workload-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Run:ai workload types

In the world of machine learning (ML), the journey from raw data to actionable insights is a complex process that spans multiple stages. Each stage of the AI lifecycle requires different tools, resources, and frameworks to ensure optimal performance. Run:ai simplifies this process by offering specialized workload types tailored to each phase, facilitating a smooth transition across various stages of the ML workflows.

The ML lifecycle usually begins with the experimental work on data and exploration of different modeling techniques to identify the best approach for accurate predictions. At this stage, resource consumption is usually moderate as experimentation is done on a smaller scale. As confidence grows in the model's potential and its accuracy, the demand for compute resources increases. This is especially true during the training phase, where vast amounts of data need to be processed, particularly with complex models such as large language models (LLMs), with their huge parameter sizes, that often require distributed training across multiple GPUs to handle the intensive computational load.

Finally, once the model is ready, it moves to the inference stage, where it is deployed to make predictions on new, unseen data. Run:ai's workload types are designed to correspond with the natural stages of this lifecycle. They are structured to align with the specific resource and framework requirements of each phase, ensuring that AI researchers and data scientists can focus on advancing their models without worrying about infrastructure management.

Run:ai offers three workload types that correspond to a specific phase of the researcher’s work:

* __Workspaces__ – For experimentation with data and models.
* __Training__ – For resource-intensive tasks such as model training and data preparation.
* __Inference__ – For deploying and serving the trained model.

## Workspaces: the experimentation phase

The __Workspace__ is where data scientists conduct initial research, experiment with different data sets, and test various algorithms. This is the most flexible stage in the ML lifecycle, where models and data are explored, tuned, and refined. The value of workspaces lies in the flexibility they offer, allowing the researcher to iterate quickly without being constrained by rigid infrastructure.

* __Framework flexibility__

Workspaces support a variety of machine learning frameworks, as researchers need to experiment with different tools and methods.

* __Resource requirements__

Workspaces are often lighter on resources compared to the training phase, but they still require significant computational power for data processing, analysis, and model iteration.

Hence, the default for the Run:ai workspaces considerations is to allow scheduling those workloads without the ability to preempt them once the resources were allocated. However, this non-preemptable state doesn’t allow to utilize more resources outside of the project’s deserved quota.

See running workspaces to learn more about how to submit a workspace via the Run:ai platform. For quick starts, see Running Jupyter Notebook using workspaces. TBD links

## Training: scaling resources for model development

As models mature and the need for more robust data processing and model training increases, Run:ai facilitates this shift through the Training workload. This phase is resource-intensive, often requiring distributed computing and high-performance clusters to process vast data sets and train models.

* __Training architecture__

For training workloads Run:ai allows you to specify the architecture - standard or distributed. The distributed architecture is relevant for larger data sets and more complex models that require utilizing multiple nodes. For the distributed architecture, Run:ai allows you to specify different configurations for the master and workers and select which framework to use - PyTorch, XGBoost, MPI, and TensorFlow. In addition, as part of the distributed configuration, Run:ai enable the researchers to schedule their distributed workloads on nodes within the same region, zone, placement group, or any other topology.

* __Resource requirements__

Training tasks demand high memory, compute power, and storage. Run:ai ensures that the allocated resources match the scale of the task and allows those workloads to utilize more compute resources than the project’s deserved quota. Make sure that if you wish your training workload not to be preempted, specify the number of GPU’s that are in your quota.

See Standard training and Distributed training to learn more about how to submit a training workload via the Run:ai UI. For quick starts, see Run your first standard training and Run your first distributed training. TBD

## Inference: deploying and serving models

Once a model is trained and validated, it moves to the Inference stage, where it is deployed to make predictions (usually in a production environment). This phase is all about efficiency and responsiveness, as the model needs to serve real-time or batch predictions to end-users or other systems.

* __Inference-specific use cases__

Naturally, inference workloads are required to change and adapt to the ever-changing demands to meet SLA. For example, additional replicas may be deployed, manually or automatically, to increase compute resources as part of a horizontal scaling approach or a new version of the deployment may need to be rolled out without affecting the running services.

* __Resource requirements__

Inference models differ in size and purpose, leading to varying computational requirements. For example, small OCR models can run efficiently on CPUs, whereas LLMs typically require significant GPU memory for deployment and serving. Inference workloads are considered production-critical and are given the highest priority to ensure compliance with SLAs. Additionally, Run:ai ensures that inference workloads cannot be preempted, maintaining consistent performance and reliability.

See Deploy a custom inference workload to learn more about how to submit an inference workload via the Run:ai UI. For quick start, see Deploy Llama model. TBD

65 changes: 65 additions & 0 deletions docs/Researcher/workloads/workloads-at-runai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Workloads at Run:ai

Run:ai enhances visibility and simplifies [management](../../Researcher/workloads/managing-workloads.md), by monitoring, presenting and orchestrating all AI workloads in the clusters it is installed on. Workloads are the fundamental building blocks for consuming resources, enabling AI practitioners such as researchers, data scientists and engineers to efficiently support the entire life cycle of an [AI initiative](../../platform-admin/aiinitiatives/overview.md).

## Workloads across the AI lifecycle

A typical AI initiative progresses through several key stages, each with distinct workloads and objectives. With Run:ai, research and engineering teams can host and manage all these workloads to achieve the following:

* __Data preparation:__ Aggregating, cleaning, normalizing, and labeling data to prepare for training.
* __Training:__ Conducting resource-intensive model development and iterative performance optimization.
* __Fine-tuning:__ Adapting pre-trained models to domain-specific data sets while balancing efficiency and performance.
* __Inference:__ Deploying models for real-time or batch predictions with a focus on low latency and high throughput.
* __Monitoring and optimization:__ Ensuring ongoing performance by addressing data drift, usage patterns, and retraining as needed.

## What is a workload?

A workload runs in the cluster, is associated with a namespace, and operates to fulfill its targets, whether that is running to completion for a [batch job](workload-types.md#training-scaling-resources-for-model-development), allocating resources for [experimentation](workload-types.md#workspaces-the-experimentation-phase) in an integrated development environment (IDE)/notebook, or serving [inference](workload-types.md#inference-deploying-and-serving-models) requests in production.

The workload, defined by the AI practitioner, consists of:

* __Container images:__ This includes the application, its dependencies, and the runtime environment.
* __Compute resources:__ CPU, GPU, and RAM to execute efficiently and address the workload’s needs.
* __Data sets:__ The data needed for processing, such as training data sets or input from external databases.
* __Credentials:__ The access to certain data sources or external services, ensuring proper authentication and authorization.

## Workload scheduling and orchestration

Run:ai’s core mission is to optimize AI resource usage at scale. This is achieved through efficient [scheduling and orchestrating](../../Researcher/scheduling/the-runai-scheduler.md) of all cluster workloads using the Run:ai Scheduler. The Scheduler allows the prioritization of workloads across different departments and projects within the organization at large scales, based on the resource distribution set by the system administrator. TBD links

## Run:ai and third-party workloads

* __Run:ai workloads:__ These workloads are submitted via the Run:ai platform. They are represented by Kubernetes Custom Resource Definitions (CRDs) and APIs. When using Run:ai workloads, a complete Workload and Scheduling Policy solution is offered for administrators to ensure optimizations, governance and security standards are applied.
* __Third-party workloads:__ These workloads are submitted via third-party applications that use the Run:ai Scheduler. The Run:ai platform manages and monitors these workloads. They enable seamless integrations with external tools, allowing teams and individuals flexibility.

### Levels of support

Different types of workloads have different levels of support. Understanding what capabilities are needed before selecting the workload type to work with is important. The table below details the level of support for each workload type in Run:ai. Run:ai workloads are fully supported with all of Run:ai advanced features and capabilities. While third-party workloads are partially supported. The list of capabilities can change between different Run:ai versions.

| Functionality | Workload Type | | | | |
| ----- | :---: | :---: | :---: |:----------------------:| ----- |
| | Run:ai workloads | | | | Third-party workloads |
| | Training - Standard | Workspace | Inference | Training - distributed | |
| [Fairness](../../Researcher/scheduling/the-runai-scheduler.md#fairness-fair-resource-distribution) | v | v | v | v | v |
| [Priority and preemption](../../Researcher/scheduling/the-runai-scheduler.md#preemption) | v | v | v | v | v |
| [Over quota](../../Researcher/scheduling/the-runai-scheduler.md#over-quota-priority) | v | v | v | v | v |
| [Node pools](../../platform-admin/aiinitiatives/resources/node-pools.md) | v | v | v | v | v |
| Bin packing / Spread | v | v | v | v | v |
| Fractions | v | v | v | v | v |
| Dynamic fractions | v | v | v | v | v |
| Node level scheduler | v | v | v | v | v |
| GPU swap | v | v | v | v | v |
| Elastic scaling | NA | NA | v | v | v |
| [Gang scheduling](../../Researcher/scheduling/the-runai-scheduler.md#gang-scheduling) | v | v | v | v | v |
| [Monitoring](../../admin/maintenance/alert-monitoring.md) | v | v | v | v | v |
| [RBAC](../../admin/authentication/authentication-overview.md#role-based-access-control-rbac-in-runai) | v | v | v | v | |
| Workload awareness | v | v | v | v | |
| [Workload submission](../../Researcher/workloads/managing-workloads.md) | v | v | v | v | |
| [Workload actions (stop/run)](../../Researcher/workloads/managing-workloads.md) | v | v | v | v | |
| [Workload Policies](../../platform-admin/workloads/policies/overview.md) | v | v | v | v | |
| [Scheduling rules](../../platform-admin/aiinitiatives/org/scheduling-rules.md) | v | v | v | v | |

!!! Note
__Workload awareness__

Specific workload-aware visibility, so that different pods are identified and treated as a single workload (for example GPU utilization, workload view, dashboards).
3 changes: 3 additions & 0 deletions docs/platform-admin/aiinitiatives/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,6 @@ The following organizational example consists of 5 optional scopes:
!!! Note
When a scope is selected, the very same unit, including all of its subordinates (both existing and any future subordinates, if added), are selected as well.

## Next Steps

Now that resources are grouped into node pools, organizational units or business initiatives are mapped into projects and departments, projects’ quota parameters are set per node pool, and users are assigned to projects, you can finally [submit workloads](../../Researcher/workloads/managing-workloads.md) from a project and use compute resources to run your AI initiatives.
Binary file not shown.
Loading

0 comments on commit dfd92ab

Please sign in to comment.