Using custom virtual machine images for compute nodes #126

noahharrison64 · 2024-10-14T14:56:48Z

Hi

I've just started looking at setting up my own HPC cluster using your useful cycloud slurm workspace tool.
I'm trying to figure out how I can best optimise the setup for my own use case, specifically my compute process requiring a large docker image. In previous HPC environments I've created custom machine images and had the compute nodes boot from these in order to avoid having to download the docker image before being able to run any compute process.

How might I achieve this using your toolkit?

I presume I can select a custom image during the partition creation process, which I could create by booting up a VM separately, downloading the docker image, capturing the VM as a managed image and then specify it during partition creation via the resource ID. However I imagine there are other packages I'll need to load into this VM image to have it work with the rest of the slurm cluster, such as the slurm-installer? Which other packages might I need to install?

I've found this blog post but I'm wondering if its a little out of date now.

Thanks,
Noah

xpillons · 2024-10-15T09:01:48Z

Hi Noah,
If not done please review the documentation on how to deploy using your own custom image from here https://learn.microsoft.com/en-us/azure/cyclecloud/qs-deploy-ccws?view=cyclecloud-8

You don't have to install the slurm packages as they are automatically install at node startup, but when building your custom image make sure to use one of our Azure HPC Image as a base so you will have all the HPC components already setup (MPI, IB drivers, GPU drivers and more).

Another option would be to directly use your container image with slurm. Below is an example on how to run an interactive bash session in a pytorch container stored in the Nvidia container registry.

srun -N1 -p gpu --gpus-per-node=8 --mem=0 --container-image nvcr.io#nvidia/pytorch:24.03-py3 --pty bash
Here are some example regarding the container naming convention for various registry https://github.com/NVIDIA/enroot/blob/master/doc/cmd/import.md

Best

noahharrison64 · 2024-11-12T16:32:47Z

Hi,

Thanks for the reply.
It seems to me like I cannot add a custom image unless it is in a compute gallery. When I go to add a custom image via its ID I get this message:

Please ensure that URNs follow the format of publisher:offer:sku:version and that image IDs follow the format of /subscriptions/{{subscription_id}}/resourceGroups/{{resource_group}}/providers/Microsoft.Compute/galleries/{{sig_name}}/images/{{image_name}}/versions/{{version_number}}

Would it be possible to use an image not in a compute gallery?
Thanks,
Noah

xpillons · 2024-11-14T08:29:51Z

Yes using a compute gallery will provide better scalability. I will check if we can allow your scenario. In the meantime, please create a compute gallery and upload your image in it.

noahharrison64 added the discussion label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using custom virtual machine images for compute nodes #126

Using custom virtual machine images for compute nodes #126

noahharrison64 commented Oct 14, 2024 •

edited

Loading

xpillons commented Oct 15, 2024

noahharrison64 commented Nov 12, 2024

xpillons commented Nov 14, 2024

Using custom virtual machine images for compute nodes #126

Using custom virtual machine images for compute nodes #126

Comments

noahharrison64 commented Oct 14, 2024 • edited Loading

xpillons commented Oct 15, 2024

noahharrison64 commented Nov 12, 2024

xpillons commented Nov 14, 2024

noahharrison64 commented Oct 14, 2024 •

edited

Loading