opea-project · alexsin368 · Jun 13, 2025 · Jun 13, 2025 · Jun 13, 2025
@@ -0,0 +1,129 @@
+.. _EnterpriseInference_Guide:
+
+Enterprise Inference Guide
+##########################
+
+
+Overview
+********
+`Intel® AI for Enterprise Inference <https://github.com/opea-project/Enterprise-Inference>`_ is aimed to streamline and enhance the deployment and management of AI inference services on Intel hardware. 
+Utilizing the power of Kubernetes Orchestration, this solution automates the deployment of LLM models to run faster inference, provision compute resources, and configure the optimal settings to minimize the complexities and reduce manual efforts.
+
+It supports a broad range of Intel hardware platforms, including Intel® Xeon® Scalable processors and Intel® Gaudi® AI Accelerators, ensuring flexibility and scalability to meet diverse enterprise needs.
+
+Intel® AI for Enterprise Inference, powered by OPEA, is compatible with OpenAI standard APIs, enabling seamless integration to enterprise applications both on-premises and in cloud-native environments. 
+This compatibility allows businesses to leverage the full capabilities of Intel hardware while deploying AI models with ease. 
+With this suite, enterprises can efficiently configure and evolve their AI infrastructure, adapting to new models and growing demands effortlessly.
+
+.. image:: assets/Enterprise-Inference-Architecture.png
+  :width: 800
+  :alt: Enterprise Inference Architecture
+
+
+How It Works
+************
+With a single click, Intel® AI for Enterprise Inference leverages several key components to streamline AI inference deployment and management:
+
+- **Kubernetes**: Acts as the backbone of the solution, providing container orchestration to automate the deployment, scaling, and management of AI inference services. Kubernetes ensures high availability and efficient resource utilization across the cluster.
+
+- **Intel Gaudi Base Operator**: For deployments utilizing Intel® Gaudi® AI Accelerators, this operator manages the lifecycle of Habana AI resources within the Kubernetes cluster. It optimizes the utilization of Gaudi hardware for AI workloads, ensuring peak performance. (Applicable only to Gaudi-based deployments.)
+
+- **Ingress NGINX Controller**: Serves as a high-performance reverse proxy and load balancer, routing incoming requests to the appropriate services within the cluster. This ensures seamless access to deployed AI models and efficient traffic management.
+
+- **Keycloak**: Provides robust identity and access management capabilities, enabling secure authentication and authorization for accessing AI services and resources within the cluster.
+
+- **APISIX**: Functions as a cloud-native API gateway, handling API traffic with advanced features such as caching and authentication. It ensures efficient and secure access to deployed AI models.
+
+- **Observability**: Offers comprehensive monitoring and visibility into the performance, health, and resource utilization of deployed applications and cluster components. It provides metrics, visualization, and alerting capabilities to maintain operational excellence.
+
+- **Model Deployments**: Automates the deployment and management of AI LLM models within the Kubernetes inference cluster. This enables scalable and reliable AI inference capabilities, allowing enterprises to adapt to growing demands and new models effortlessly.
+
+After deployment, the models can be accessed using an OpenAI API with an https endpoint. This endpoint and API key are all that is needed to do inference with these deployed models.
+
+.. image:: assets/API-Endpoints.png
+    :width: 800
+    :alt: API Endpoints
+
+
+Setting Up a Remote Server or Cluster
+*************************************
+The first step is to get access to the hardware platform of choice:
+- Intel® Gaudi® AI Accelerators
+- ntel® Xeon® Scalable processors
+
+This can be an on-premises machine or from a cloud service provider.
+
+Next, deploy `Intel® AI for Enterprise Inference <https://github.com/opea-project/Enterprise-Inference>`_ with the desired models. 
+Note down the HTTPS endpoint and generate an access token. 
+The HTTPS endpoint may look something like this: https://api.inference.example.com.
+This access token will be used as the API key to securely access the deployed models.
+If the inference stack is already deployed, then just note down the HTTPS endpoint and access token.
+
+
+Using Remote Endpoints on OPEA GenAIExamples
+********************************************
+OPEA GenAIExamples by default will download and deploy the models on the hardware platform locally.
+To use remote endpoints on OPEA GenAIExamples, configure the application to instead interact with models deployed on a remote server or cluster by specifying the HTTPS endpoint and providing the API key.
+
+For all GenAIExamples, set the following environment variables:
+
+.. code-block:: bash
+
+    export OPENAI_API_KEY=<your-api-key>
+    export REMOTE_ENDPOINT=<your-http-endpoint>
+
+Depending on the GenAIExample, the next steps may be different. For each section, see the list for the GenAIExample of interest.
+
+1. Endpoints with Megaservices
+++++++++++++++++++++++++++++++
+This section applies to the following GenAIExamples:
+    - ChatQnA
+
+Set additional environment variable(s):
+
+.. code-block:: bash
+
+    export LLM_MODEL_ID=<model-id-deployed-on-remote-server>
+
+Run *docker compose* for the example using *compose_remote.yaml* **instead** of the default *compose.yaml*. 
+
+For example,
+.. code-block:: bash
+
+    docker compose -f compose_remote.yaml up -d
+
+This will pass in the additional environment variables to the **megaservice** to access models on the remote server. 
+The deployment will also be quicker because the LLM microservice is not deployed.
+
+
+2. Endpoints with Microservices
++++++++++++++++++++++++++++++++
+This section applies to the following GenAIExamples:
+    - AgentQnA
+
+Set additional environment variable(s) depending on the example:
+
+.. list-table:: Environment Variables for GenAIExamples
+    :header-rows: 1
+
+    * - GenAIExample
+      - Environment Variable(s)
+    * - AgentQnA
+      - export model=<model-id-deployed-on-remote-server>
+
+Run *docker compose* for the example by **appending** the *compose_remote.yaml* file to the original command.
+
+For example,
+.. code-block:: bash
+
+    docker compose -f compose.yaml -f compose_remote.yaml up -d
+
+This will pass in the additional environment variables to the **microservices** to access models on the remote server. 
+The deployment will also be quicker because the LLM microservice is not deployed.
+
+
+Next Steps
+**********
+Go back to the GenAIExample's tutorial or README to access the UI and interact with it. 
+The LLM text generation is now handled on a remote server. 
+Consider trying other models and using these remote endpoints to power multiple GenAI applications simultaneously.
@@ -1,9 +1,9 @@
-OPEA Tutorial
+OPEA Tutorials
 ##########################
 
-This tutorial will help user learn to deploy and use OPEA quickly.
+Tutorials are created to help users learn how to deploy and use OPEA resources quickly.
 
-Provide following tutorials to cover common user cases:
+The following tutorials are provided to cover common use cases:
 
 .. toctree::
    :maxdepth: 1
@@ -17,14 +17,15 @@ Provide following tutorials to cover common user cases:
    DocIndexRetriever/DocIndexRetriever_Guide
    VideoQnA/VideoQnA_Guide
 
-Provide following tutorials to cover more advanced features like OPEA Open Telemetry:
+These tutorials cover more advanced features:
 
 .. toctree::
    :maxdepth: 1
 
+   EnterpriseInference/EnterpriseInference_Guide
    OpenTelemetry/OpenTelemetry_OPEA_Guide
 
 -----
 
 
-If you want to learn more, please refer to :doc:`/GenAIExamples/README`.
+To learn more, refer to :doc:`/GenAIExamples/README`.