You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scenario
I am trying to use Aliyun scheduler extender to be able to use a T4 nVidia GPU with multiple PODs, I have a managed AKS cluster with a default NodePool with standard VMs (Standard_D2_v3) and added an User NodePool with Standard_NC4as_T4_v3 instances, all running Ubuntu 22.04.4, I enabled default nVidia driver installation and have the driver and nvidia-smi running:
I am following instructions given here for the Aliyun installation, I have already activated the SSH access to the GPU nodes, placed the scheduler-policy-config.yaml file into /etc/kubernetes and the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder.
My cluster runs K8S 1.28:
Problem
My problem is that when I put the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder the PODs does not run and I get Auth failure logs of the POD that remains in CrashLoopBackoff status:
I tried setting the KUBERNETES_MASTER env variable to the Cluster's DNS including the port but no luck, I see that those variables get injected when the POD runs.
I've noticed that the /etc/kubernetes/scheduler.conf file, used to run the command in this file, is empty, I tried to generate certs to get a valid scheduler configuration file, using the token of a ServiceAccount and using the Kubeconfig of the Kubelet but I've failed.
Wanted to ask if someone has managed to sucessfully install Aliyun on a managed AKS Cluster with User NodePools.
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Scenario
I am trying to use Aliyun scheduler extender to be able to use a T4 nVidia GPU with multiple PODs, I have a managed AKS cluster with a default NodePool with standard VMs (Standard_D2_v3) and added an User NodePool with Standard_NC4as_T4_v3 instances, all running Ubuntu 22.04.4, I enabled default nVidia driver installation and have the driver and nvidia-smi running:
I am following instructions given here for the Aliyun installation, I have already activated the SSH access to the GPU nodes, placed the scheduler-policy-config.yaml file into /etc/kubernetes and the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder.
My cluster runs K8S 1.28:
Problem
My problem is that when I put the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder the PODs does not run and I get Auth failure logs of the POD that remains in CrashLoopBackoff status:
I tried setting the KUBERNETES_MASTER env variable to the Cluster's DNS including the port but no luck, I see that those variables get injected when the POD runs.
I've noticed that the /etc/kubernetes/scheduler.conf file, used to run the command in this file, is empty, I tried to generate certs to get a valid scheduler configuration file, using the token of a ServiceAccount and using the Kubeconfig of the Kubelet but I've failed.
Wanted to ask if someone has managed to sucessfully install Aliyun on a managed AKS Cluster with User NodePools.
Thanks in advance!
The text was updated successfully, but these errors were encountered: