[AKS] kube-scheduler static POD not running for Aliyun GPU Scheduler Extender #224

dsatizabal · 2024-05-30T17:14:51Z

Scenario
I am trying to use Aliyun scheduler extender to be able to use a T4 nVidia GPU with multiple PODs, I have a managed AKS cluster with a default NodePool with standard VMs (Standard_D2_v3) and added an User NodePool with Standard_NC4as_T4_v3 instances, all running Ubuntu 22.04.4, I enabled default nVidia driver installation and have the driver and nvidia-smi running:

I am following instructions given here for the Aliyun installation, I have already activated the SSH access to the GPU nodes, placed the scheduler-policy-config.yaml file into /etc/kubernetes and the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder.

My cluster runs K8S 1.28:

Problem
My problem is that when I put the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder the PODs does not run and I get Auth failure logs of the POD that remains in CrashLoopBackoff status:

I tried setting the KUBERNETES_MASTER env variable to the Cluster's DNS including the port but no luck, I see that those variables get injected when the POD runs.

I've noticed that the /etc/kubernetes/scheduler.conf file, used to run the command in this file, is empty, I tried to generate certs to get a valid scheduler configuration file, using the token of a ServiceAccount and using the Kubeconfig of the Kubelet but I've failed.

Wanted to ask if someone has managed to sucessfully install Aliyun on a managed AKS Cluster with User NodePools.

Thanks in advance!

dsatizabal · 2024-07-21T20:32:54Z

I wrote an article on how to properly get Aliyun scheduler extender to work in Azure Kubernetes Service (AKS):

https://medium.com/@diego.satizabal_81239/using-aliyun-gpu-share-in-an-azure-aks-717cf7392d05

hope it helps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AKS] kube-scheduler static POD not running for Aliyun GPU Scheduler Extender #224

[AKS] kube-scheduler static POD not running for Aliyun GPU Scheduler Extender #224

dsatizabal commented May 30, 2024

dsatizabal commented Jul 21, 2024

[AKS] kube-scheduler static POD not running for Aliyun GPU Scheduler Extender #224

[AKS] kube-scheduler static POD not running for Aliyun GPU Scheduler Extender #224

Comments

dsatizabal commented May 30, 2024

dsatizabal commented Jul 21, 2024