Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AKS] kube-scheduler static POD not running for Aliyun GPU Scheduler Extender #224

Open
dsatizabal opened this issue May 30, 2024 · 1 comment

Comments

@dsatizabal
Copy link

Scenario
I am trying to use Aliyun scheduler extender to be able to use a T4 nVidia GPU with multiple PODs, I have a managed AKS cluster with a default NodePool with standard VMs (Standard_D2_v3) and added an User NodePool with Standard_NC4as_T4_v3 instances, all running Ubuntu 22.04.4, I enabled default nVidia driver installation and have the driver and nvidia-smi running:

image

I am following instructions given here for the Aliyun installation, I have already activated the SSH access to the GPU nodes, placed the scheduler-policy-config.yaml file into /etc/kubernetes and the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder.

My cluster runs K8S 1.28:

image

Problem
My problem is that when I put the kube-scheduler.yaml file into the /etc/kubernetes/manifests folder the PODs does not run and I get Auth failure logs of the POD that remains in CrashLoopBackoff status:

image

I tried setting the KUBERNETES_MASTER env variable to the Cluster's DNS including the port but no luck, I see that those variables get injected when the POD runs.

I've noticed that the /etc/kubernetes/scheduler.conf file, used to run the command in this file, is empty, I tried to generate certs to get a valid scheduler configuration file, using the token of a ServiceAccount and using the Kubeconfig of the Kubelet but I've failed.

Wanted to ask if someone has managed to sucessfully install Aliyun on a managed AKS Cluster with User NodePools.

Thanks in advance!

@dsatizabal
Copy link
Author

I wrote an article on how to properly get Aliyun scheduler extender to work in Azure Kubernetes Service (AKS):

https://medium.com/@diego.satizabal_81239/using-aliyun-gpu-share-in-an-azure-aks-717cf7392d05

hope it helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant