You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've just started looking at setting up my own HPC cluster using your useful cycloud slurm workspace tool.
I'm trying to figure out how I can best optimise the setup for my own use case, specifically my compute process requiring a large docker image. In previous HPC environments I've created custom machine images and had the compute nodes boot from these in order to avoid having to download the docker image before being able to run any compute process.
How might I achieve this using your toolkit?
I presume I can select a custom image during the partition creation process, which I could create by booting up a VM separately, downloading the docker image, capturing the VM as a managed image and then specify it during partition creation via the resource ID. However I imagine there are other packages I'll need to load into this VM image to have it work with the rest of the slurm cluster, such as the slurm-installer? Which other packages might I need to install?
I've found this blog post but I'm wondering if its a little out of date now.
Thanks,
Noah
The text was updated successfully, but these errors were encountered:
You don't have to install the slurm packages as they are automatically install at node startup, but when building your custom image make sure to use one of our Azure HPC Image as a base so you will have all the HPC components already setup (MPI, IB drivers, GPU drivers and more).
Another option would be to directly use your container image with slurm. Below is an example on how to run an interactive bash session in a pytorch container stored in the Nvidia container registry.
srun -N1 -p gpu --gpus-per-node=8 --mem=0 --container-image nvcr.io#nvidia/pytorch:24.03-py3 --pty bash
Here are some example regarding the container naming convention for various registry https://github.com/NVIDIA/enroot/blob/master/doc/cmd/import.md
Thanks for the reply.
It seems to me like I cannot add a custom image unless it is in a compute gallery. When I go to add a custom image via its ID I get this message:
Please ensure that URNs follow the format of publisher:offer:sku:version and that image IDs follow the format of /subscriptions/{{subscription_id}}/resourceGroups/{{resource_group}}/providers/Microsoft.Compute/galleries/{{sig_name}}/images/{{image_name}}/versions/{{version_number}}
Would it be possible to use an image not in a compute gallery?
Thanks,
Noah
Yes using a compute gallery will provide better scalability. I will check if we can allow your scenario. In the meantime, please create a compute gallery and upload your image in it.
Hi
I've just started looking at setting up my own HPC cluster using your useful cycloud slurm workspace tool.
I'm trying to figure out how I can best optimise the setup for my own use case, specifically my compute process requiring a large docker image. In previous HPC environments I've created custom machine images and had the compute nodes boot from these in order to avoid having to download the docker image before being able to run any compute process.
How might I achieve this using your toolkit?
I presume I can select a custom image during the partition creation process, which I could create by booting up a VM separately, downloading the docker image, capturing the VM as a managed image and then specify it during partition creation via the resource ID. However I imagine there are other packages I'll need to load into this VM image to have it work with the rest of the slurm cluster, such as the slurm-installer? Which other packages might I need to install?
I've found this blog post but I'm wondering if its a little out of date now.
Thanks,
Noah
The text was updated successfully, but these errors were encountered: