forked from Azure/azureml-examples
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix deepspeed examples bugs (Azure#2044)
* add timeout for deepspeed jobs * re format readme with black * change timeout length * change dockerfile to use acpt image * add training custom env * fix hostfile bug * fix bash generation * address comments * increase number of gpus being used * make sure deepspeed is upgraded to latest version * write to hostfile in single process
- Loading branch information
1 parent
e9c6241
commit 160461b
Showing
8 changed files
with
61 additions
and
80 deletions.
There are no files selected for viewing
42 changes: 3 additions & 39 deletions
42
cli/jobs/deepspeed/deepspeed-autotuning/docker-context/Dockerfile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,39 +1,3 @@ | ||
FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu113-py38-torch1110:biweekly.202301.1 | ||
RUN pip install git+https://github.com/microsoft/DeepSpeed.git@master | ||
|
||
# Install pip dependencies | ||
RUN pip install 'ipykernel~=6.0' \ | ||
'azureml-core==1.48.0' \ | ||
'azureml-dataset-runtime==1.48.0' \ | ||
'azureml-defaults==1.48.0' \ | ||
'azure-ml==0.0.1' \ | ||
'azure-ml-component==0.9.16.post2' \ | ||
'azureml-mlflow==1.48.0' \ | ||
'azureml-telemetry==1.48.0' \ | ||
'azureml-contrib-services==1.48.0' \ | ||
'torch-tb-profiler~=0.4.0' \ | ||
'py-spy==0.3.12' \ | ||
'debugpy~=1.6.3' | ||
|
||
RUN pip install \ | ||
azure-ai-ml==1.2.0 \ | ||
azureml-inference-server-http~=0.7.0 \ | ||
inference-schema~=1.4.2.1 \ | ||
MarkupSafe==2.0.1 \ | ||
regex \ | ||
pybind11 | ||
|
||
# Inference requirements | ||
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/ | ||
RUN /var/requirements/install_system_requirements.sh && \ | ||
cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \ | ||
cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \ | ||
ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \ | ||
rm -f /etc/nginx/sites-enabled/default | ||
ENV SVDIR=/var/runit | ||
ENV WORKER_TIMEOUT=400 | ||
EXPOSE 5001 8883 8888 | ||
|
||
# support Deepspeed launcher requirement of passwordless ssh login | ||
RUN apt-get update | ||
RUN apt-get install -y openssh-server openssh-client | ||
FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:latest | ||
# Need latest deepspeed version | ||
RUN pip install deepspeed -U |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3 changes: 3 additions & 0 deletions
3
cli/jobs/deepspeed/deepspeed-training/docker-context/Dockerfile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:latest | ||
# Need latest deepspeed version | ||
RUN pip install deepspeed -U |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters