Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image on Docker Hub is out of date #180

Open
dpkirchner opened this issue Jan 19, 2024 · 12 comments
Open

Image on Docker Hub is out of date #180

dpkirchner opened this issue Jan 19, 2024 · 12 comments

Comments

@dpkirchner
Copy link

I'm just getting started with clearml (learning the ropes). Per the README section describing Kubernetes integration I tried using the image found on dockerhub, running it outside of k8s: docker run --gpus all -it --rm -v $HOME/clearml-agent.conf:/clearml.conf -v /var/run/docker.sock:/var/run/docker.sock --network clearml_backend --user root. (clearml_backend comes from https://github.com/allegroai/clearml-server/blob/702b6dc9c804165b192a042253ad1d1690c5f0ed/docker/docker-compose.yml` and clearml-agent.conf was created by clearml-agent init).

The output of this command is just: CLEARML_AGENT_UPDATE_VERSION = and the worker does not register. clearml-agent appears to be version 0.17.1 FWIW.

Then I noticed that the image was last updated about 3 years ago. Upgrading the clearml-agent package using pip install --upgrade clearml-agent and bind-mounting the configuration file in the /root directory resolved the problem, however I'm sure there'll be a lot of other issues when using such an old base image (e.g., old CUDA).

I think this might just be a matter of updating the dockerfile to pin a version of nvidia/cuda (base image) and pushing to hub.

@jkhenning
Copy link
Member

Hi @dpkirchner , the link you provided does not seem to work - I didn't quite understand which image you used

@dpkirchner
Copy link
Author

My bad, I added an extra backtick in the link: https://github.com/allegroai/clearml-server/blob/702b6dc9c804165b192a042253ad1d1690c5f0ed/docker/docker-compose.yml

The image I used was linked from here: https://github.com/allegroai/clearml-agent/blob/c9fc092f4eea9c3890d582aa2a098c3c2f39ce72/README.md#kubernetes-integration-optional (scroll down to Spin ClearML-Agent as a long-lasting service pod).

@jkhenning
Copy link
Member

Oh, I see it now. Honestly I think we should remove this option - this option basically spawns tasks as processes inside the agent's pod, which is not a good pattern in k8s - I would recommend using the helm chart

@dpkirchner
Copy link
Author

I see, ok. I'll check out the helm chart. Thanks.

@dpkirchner
Copy link
Author

dpkirchner commented Jan 26, 2024

It looks like the docker container used by the helm chart is also out of date -- it's running clearml-agent 1.2.4rc3 and using python 3.6. The image that is closest to being up to date is allegroai/clearml:1.14.0-431, however you'll need to install docker and the clearml-agent python package to use it, and it's still a bit out of date.

Through experimentation I've found that if you want to use the latest version, you can check out https://github.com/allegroai/clearml-agent, go to the docker/agent directory and edit Dockerfile, replace FROM nvidia/cuda with FROM nvidia/cuda:12.0.0-devel-ubuntu22.04 (can't use 12.3.1 because of a cuda-related bug in nvidia's image), and then build the image locally (I'm using docker build -t clearml-agent:latest . in the docker/agent directory). Following these steps will get you version 1.7.0.

I'm reopening because I'm not sure if this is all intended -- is the allegro/clearml-agent docker image deprecated in general?

(I should note that the clearml-agent build command run in this image does not result in a docker image, but I think that's unrelated, and something to be tracked in a different issue.)

@jkhenning
Copy link
Member

Hi @dpkirchner,

The docker image used by clearml-helm-charts/clearml-agent chart is indeed pretty old (we're supposed to update it soon) and it's the allegroai/clearml-agent-k8s-base image. However, it is not related to the allegroai/clearml-agent image

@thomsmoreau
Copy link

thomsmoreau commented Apr 8, 2024

Hi @dpkirchner,
Do you have an info about the docker image update on the docker HUB ? There is a lot of outdated elements in it like the "k8s_glue_example.py" not taking list of queues for example

I cannot find a proper way to build the image even with the 'docker' folder from the repository, is that possible to provide a README to build it in local ?

@dpkirchner
Copy link
Author

I wasn't able to figure out how to use clearml properly, unfortunately, so I moved on to another project.

@surya9teja
Copy link
Contributor

@dpkirchner Frankly, I have been hopping into different kinds of MLOps started with airflow + mlflow but it lack dataset versioning. So i moved to clearml and we use k8s (EKS) for most of our ETL pipelines. So I deploy clearml-server which works fine but now I have tried to deploy clearml-agent in cluster but it seems having issues with accessing api server
clearml_agent.backend_api.session.session.LoginError: Failed getting token (error 401 from https://api.clear.ml): Unauthorized (invalid credentials) (failed to locate provided credentials)
As the clearml documentations are not clear about helm charts deployment, it's really hard to understand the code and do PRs.

@surya9teja
Copy link
Contributor

@thomsmoreau As I can see there a folder k8s-glue which seems have a various versions of docker images. Based on your cloud you can modify Dockerfile and update the outdated packages.

Note: During the build you have to modify/ add clearml.conf with your credentials as per the Dockerfile script.

I am not a fan of putting credentials into the docker image build but at the same time helm chart value has an option to pass the credentials as a secret which is not working now.

In terms of passing list of queues for k8s_glu_example.py you can pass it as 'queue1,queue2' check here in values.json of helm chart make sure there won't be any spaces between the strings.

@thomsmoreau
Copy link

thomsmoreau commented May 8, 2024

@surya9teja I belived that the k8s_glu_example.py file into the docker image was up to date but it is not. The version of it into the docker image provided by the chart does not take into account the separator "," into the string (passed by the "queue"argument) so I had to update it manually, firstly by doing a curl on the raw link you provided (I pulled the chart and changed the templates manually) and then by building a custom image for my company into which I just changed the script and it works fine ! . I did it about a month ago

Since then I didn't check for updates on the docker images but I think we can have better outputs in terms of udated content and performances if devs could push themselves an update.

Thank for your message, I should have commented earlier to maybe help other people stuck as I was

@thomsmoreau
Copy link

@jkhenning Do you have any info about the update of the chart with an up to date docker image ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants