-
Notifications
You must be signed in to change notification settings - Fork 63
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1339 from run-ai/Copy_workloads_articles
Copy workloads articles
- Loading branch information
Showing
9 changed files
with
131 additions
and
73 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Run:ai workload types | ||
|
||
In the world of machine learning (ML), the journey from raw data to actionable insights is a complex process that spans multiple stages. Each stage of the AI lifecycle requires different tools, resources, and frameworks to ensure optimal performance. Run:ai simplifies this process by offering specialized workload types tailored to each phase, facilitating a smooth transition across various stages of the ML workflows. | ||
|
||
The ML lifecycle usually begins with the experimental work on data and exploration of different modeling techniques to identify the best approach for accurate predictions. At this stage, resource consumption is usually moderate as experimentation is done on a smaller scale. As confidence grows in the model's potential and its accuracy, the demand for compute resources increases. This is especially true during the training phase, where vast amounts of data need to be processed, particularly with complex models such as large language models (LLMs), with their huge parameter sizes, that often require distributed training across multiple GPUs to handle the intensive computational load. | ||
|
||
Finally, once the model is ready, it moves to the inference stage, where it is deployed to make predictions on new, unseen data. Run:ai's workload types are designed to correspond with the natural stages of this lifecycle. They are structured to align with the specific resource and framework requirements of each phase, ensuring that AI researchers and data scientists can focus on advancing their models without worrying about infrastructure management. | ||
|
||
Run:ai offers three workload types that correspond to a specific phase of the researcher’s work: | ||
|
||
* __Workspaces__ – For experimentation with data and models. | ||
* __Training__ – For resource-intensive tasks such as model training and data preparation. | ||
* __Inference__ – For deploying and serving the trained model. | ||
|
||
## Workspaces: the experimentation phase | ||
|
||
The __Workspace__ is where data scientists conduct initial research, experiment with different data sets, and test various algorithms. This is the most flexible stage in the ML lifecycle, where models and data are explored, tuned, and refined. The value of workspaces lies in the flexibility they offer, allowing the researcher to iterate quickly without being constrained by rigid infrastructure. | ||
|
||
* __Framework flexibility__ | ||
|
||
Workspaces support a variety of machine learning frameworks, as researchers need to experiment with different tools and methods. | ||
|
||
* __Resource requirements__ | ||
|
||
Workspaces are often lighter on resources compared to the training phase, but they still require significant computational power for data processing, analysis, and model iteration. | ||
|
||
Hence, the default for the Run:ai workspaces considerations is to allow scheduling those workloads without the ability to preempt them once the resources were allocated. However, this non-preemptable state doesn’t allow to utilize more resources outside of the project’s deserved quota. | ||
|
||
See running workspaces to learn more about how to submit a workspace via the Run:ai platform. For quick starts, see Running Jupyter Notebook using workspaces. TBD links | ||
|
||
## Training: scaling resources for model development | ||
|
||
As models mature and the need for more robust data processing and model training increases, Run:ai facilitates this shift through the Training workload. This phase is resource-intensive, often requiring distributed computing and high-performance clusters to process vast data sets and train models. | ||
|
||
* __Training architecture__ | ||
|
||
For training workloads Run:ai allows you to specify the architecture - standard or distributed. The distributed architecture is relevant for larger data sets and more complex models that require utilizing multiple nodes. For the distributed architecture, Run:ai allows you to specify different configurations for the master and workers and select which framework to use - PyTorch, XGBoost, MPI, and TensorFlow. In addition, as part of the distributed configuration, Run:ai enable the researchers to schedule their distributed workloads on nodes within the same region, zone, placement group, or any other topology. | ||
|
||
* __Resource requirements__ | ||
|
||
Training tasks demand high memory, compute power, and storage. Run:ai ensures that the allocated resources match the scale of the task and allows those workloads to utilize more compute resources than the project’s deserved quota. Make sure that if you wish your training workload not to be preempted, specify the number of GPU’s that are in your quota. | ||
|
||
See Standard training and Distributed training to learn more about how to submit a training workload via the Run:ai UI. For quick starts, see Run your first standard training and Run your first distributed training. TBD | ||
|
||
## Inference: deploying and serving models | ||
|
||
Once a model is trained and validated, it moves to the Inference stage, where it is deployed to make predictions (usually in a production environment). This phase is all about efficiency and responsiveness, as the model needs to serve real-time or batch predictions to end-users or other systems. | ||
|
||
* __Inference-specific use cases__ | ||
|
||
Naturally, inference workloads are required to change and adapt to the ever-changing demands to meet SLA. For example, additional replicas may be deployed, manually or automatically, to increase compute resources as part of a horizontal scaling approach or a new version of the deployment may need to be rolled out without affecting the running services. | ||
|
||
* __Resource requirements__ | ||
|
||
Inference models differ in size and purpose, leading to varying computational requirements. For example, small OCR models can run efficiently on CPUs, whereas LLMs typically require significant GPU memory for deployment and serving. Inference workloads are considered production-critical and are given the highest priority to ensure compliance with SLAs. Additionally, Run:ai ensures that inference workloads cannot be preempted, maintaining consistent performance and reliability. | ||
|
||
See Deploy a custom inference workload to learn more about how to submit an inference workload via the Run:ai UI. For quick start, see Deploy Llama model. TBD | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# Workloads at Run:ai | ||
|
||
Run:ai enhances visibility and simplifies [management](../../Researcher/workloads/managing-workloads.md), by monitoring, presenting and orchestrating all AI workloads in the clusters it is installed on. Workloads are the fundamental building blocks for consuming resources, enabling AI practitioners such as researchers, data scientists and engineers to efficiently support the entire life cycle of an [AI initiative](../../platform-admin/aiinitiatives/overview.md). | ||
|
||
## Workloads across the AI lifecycle | ||
|
||
A typical AI initiative progresses through several key stages, each with distinct workloads and objectives. With Run:ai, research and engineering teams can host and manage all these workloads to achieve the following: | ||
|
||
* __Data preparation:__ Aggregating, cleaning, normalizing, and labeling data to prepare for training. | ||
* __Training:__ Conducting resource-intensive model development and iterative performance optimization. | ||
* __Fine-tuning:__ Adapting pre-trained models to domain-specific data sets while balancing efficiency and performance. | ||
* __Inference:__ Deploying models for real-time or batch predictions with a focus on low latency and high throughput. | ||
* __Monitoring and optimization:__ Ensuring ongoing performance by addressing data drift, usage patterns, and retraining as needed. | ||
|
||
## What is a workload? | ||
|
||
A workload runs in the cluster, is associated with a namespace, and operates to fulfill its targets, whether that is running to completion for a [batch job](workload-types.md#training-scaling-resources-for-model-development), allocating resources for [experimentation](workload-types.md#workspaces-the-experimentation-phase) in an integrated development environment (IDE)/notebook, or serving [inference](workload-types.md#inference-deploying-and-serving-models) requests in production. | ||
|
||
The workload, defined by the AI practitioner, consists of: | ||
|
||
* __Container images:__ This includes the application, its dependencies, and the runtime environment. | ||
* __Compute resources:__ CPU, GPU, and RAM to execute efficiently and address the workload’s needs. | ||
* __Data sets:__ The data needed for processing, such as training data sets or input from external databases. | ||
* __Credentials:__ The access to certain data sources or external services, ensuring proper authentication and authorization. | ||
|
||
## Workload scheduling and orchestration | ||
|
||
Run:ai’s core mission is to optimize AI resource usage at scale. This is achieved through efficient [scheduling and orchestrating](../../Researcher/scheduling/the-runai-scheduler.md) of all cluster workloads using the Run:ai Scheduler. The Scheduler allows the prioritization of workloads across different departments and projects within the organization at large scales, based on the resource distribution set by the system administrator. TBD links | ||
|
||
## Run:ai and third-party workloads | ||
|
||
* __Run:ai workloads:__ These workloads are submitted via the Run:ai platform. They are represented by Kubernetes Custom Resource Definitions (CRDs) and APIs. When using Run:ai workloads, a complete Workload and Scheduling Policy solution is offered for administrators to ensure optimizations, governance and security standards are applied. | ||
* __Third-party workloads:__ These workloads are submitted via third-party applications that use the Run:ai Scheduler. The Run:ai platform manages and monitors these workloads. They enable seamless integrations with external tools, allowing teams and individuals flexibility. | ||
|
||
### Levels of support | ||
|
||
Different types of workloads have different levels of support. Understanding what capabilities are needed before selecting the workload type to work with is important. The table below details the level of support for each workload type in Run:ai. Run:ai workloads are fully supported with all of Run:ai advanced features and capabilities. While third-party workloads are partially supported. The list of capabilities can change between different Run:ai versions. | ||
|
||
| Functionality | Workload Type | | | | | | ||
| ----- | :---: | :---: | :---: |:----------------------:| ----- | | ||
| | Run:ai workloads | | | | Third-party workloads | | ||
| | Training - Standard | Workspace | Inference | Training - distributed | | | ||
| [Fairness](../../Researcher/scheduling/the-runai-scheduler.md#fairness-fair-resource-distribution) | v | v | v | v | v | | ||
| [Priority and preemption](../../Researcher/scheduling/the-runai-scheduler.md#preemption) | v | v | v | v | v | | ||
| [Over quota](../../Researcher/scheduling/the-runai-scheduler.md#over-quota-priority) | v | v | v | v | v | | ||
| [Node pools](../../platform-admin/aiinitiatives/resources/node-pools.md) | v | v | v | v | v | | ||
| Bin packing / Spread | v | v | v | v | v | | ||
| Fractions | v | v | v | v | v | | ||
| Dynamic fractions | v | v | v | v | v | | ||
| Node level scheduler | v | v | v | v | v | | ||
| GPU swap | v | v | v | v | v | | ||
| Elastic scaling | NA | NA | v | v | v | | ||
| [Gang scheduling](../../Researcher/scheduling/the-runai-scheduler.md#gang-scheduling) | v | v | v | v | v | | ||
| [Monitoring](../../admin/maintenance/alert-monitoring.md) | v | v | v | v | v | | ||
| [RBAC](../../admin/authentication/authentication-overview.md#role-based-access-control-rbac-in-runai) | v | v | v | v | | | ||
| Workload awareness | v | v | v | v | | | ||
| [Workload submission](../../Researcher/workloads/managing-workloads.md) | v | v | v | v | | | ||
| [Workload actions (stop/run)](../../Researcher/workloads/managing-workloads.md) | v | v | v | v | | | ||
| [Workload Policies](../../platform-admin/workloads/policies/overview.md) | v | v | v | v | | | ||
| [Scheduling rules](../../platform-admin/aiinitiatives/org/scheduling-rules.md) | v | v | v | v | | | ||
|
||
!!! Note | ||
__Workload awareness__ | ||
|
||
Specific workload-aware visibility, so that different pods are identified and treated as a single workload (for example GPU utilization, workload view, dashboards). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Oops, something went wrong.