Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using custom virtual machine images for compute nodes #126

Open
noahharrison64 opened this issue Oct 14, 2024 · 3 comments
Open

Using custom virtual machine images for compute nodes #126

noahharrison64 opened this issue Oct 14, 2024 · 3 comments

Comments

@noahharrison64
Copy link

noahharrison64 commented Oct 14, 2024

Hi

I've just started looking at setting up my own HPC cluster using your useful cycloud slurm workspace tool.
I'm trying to figure out how I can best optimise the setup for my own use case, specifically my compute process requiring a large docker image. In previous HPC environments I've created custom machine images and had the compute nodes boot from these in order to avoid having to download the docker image before being able to run any compute process.

How might I achieve this using your toolkit?

I presume I can select a custom image during the partition creation process, which I could create by booting up a VM separately, downloading the docker image, capturing the VM as a managed image and then specify it during partition creation via the resource ID. However I imagine there are other packages I'll need to load into this VM image to have it work with the rest of the slurm cluster, such as the slurm-installer? Which other packages might I need to install?

I've found this blog post but I'm wondering if its a little out of date now.

Thanks,
Noah

@xpillons
Copy link
Collaborator

Hi Noah,
If not done please review the documentation on how to deploy using your own custom image from here https://learn.microsoft.com/en-us/azure/cyclecloud/qs-deploy-ccws?view=cyclecloud-8

You don't have to install the slurm packages as they are automatically install at node startup, but when building your custom image make sure to use one of our Azure HPC Image as a base so you will have all the HPC components already setup (MPI, IB drivers, GPU drivers and more).

Another option would be to directly use your container image with slurm. Below is an example on how to run an interactive bash session in a pytorch container stored in the Nvidia container registry.

srun -N1 -p gpu --gpus-per-node=8 --mem=0 --container-image nvcr.io#nvidia/pytorch:24.03-py3 --pty bash
Here are some example regarding the container naming convention for various registry https://github.com/NVIDIA/enroot/blob/master/doc/cmd/import.md

Best

@noahharrison64
Copy link
Author

Hi,

Thanks for the reply.
It seems to me like I cannot add a custom image unless it is in a compute gallery. When I go to add a custom image via its ID I get this message:

Please ensure that URNs follow the format of publisher:offer:sku:version and that image IDs follow the format of /subscriptions/{{subscription_id}}/resourceGroups/{{resource_group}}/providers/Microsoft.Compute/galleries/{{sig_name}}/images/{{image_name}}/versions/{{version_number}}

Would it be possible to use an image not in a compute gallery?
Thanks,
Noah

@xpillons
Copy link
Collaborator

Yes using a compute gallery will provide better scalability. I will check if we can allow your scenario. In the meantime, please create a compute gallery and upload your image in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants