You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
I have tried both with a Tesla P4, a P100 and a GTX 1060. I can only make it work using CPU only.
I have tried many configs with setting useActiveGpu to True or False, trialGpuNumber to 1, gpuIndices: '0'. However it always couldn't complete a single architecture training.
I have tried both outside and inside a Docker container.
[2024-05-03 10:54:56] WARNING (pythonScript) Python command [nni.tools.nni_manager_scripts.collect_gpu_info] has stderr: Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/nni/tools/nni_manager_scripts/collect_gpu_info.py", line 174, in <module>
main()
File "/opt/conda/lib/python3.10/site-packages/nni/tools/nni_manager_scripts/collect_gpu_info.py", line 34, in main
print(json.dumps(data), flush=True)
File "/opt/conda/lib/python3.10/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/opt/conda/lib/python3.10/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/opt/conda/lib/python3.10/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/opt/conda/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type bytes is not JSON serializable
[2024-05-03 10:54:56] INFO (ShutdownManager) Initiate shutdown: training service initialize failed
[2024-05-03 10:54:56] ERROR (GpuInfoCollector) Failed to collect GPU info, collector output:
[2024-05-03 10:54:56] ERROR (TrainingServiceCompat) Training srevice initialize failed: Error: TaskScheduler: Failed to collect GPU info
at TaskScheduler.init (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/task_scheduler/scheduler.js:16:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async TaskSchedulerClient.start (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/task_scheduler_client.js:20:13)
at async Promise.all (index 0)
at async TrialKeeper.start (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/keeper.js:48:9)
at async LocalTrainingServiceV3.start (/opt/conda/lib/python3.10/site-packages/nni_node/training_service/local_v3/local.js:28:9)
at async V3asV1.start (/opt/conda/lib/python3.10/site-packages/nni_node/training_service/v3/compat.js:235:29
There, the GPU's infos cannot be retreived.
experiment.log
[2024-05-03 13:52:31] INFO (nni.experiment) Starting web server...
[2024-05-03 13:52:32] INFO (nni.experiment) Setting up...
[2024-05-03 13:52:33] INFO (nni.experiment) Web portal URLs: http://127.0.0.1:8081 http://10.164.0.8:8081 http://172.17.0.1:8081
[2024-05-03 13:53:03] INFO (nni.experiment) Stopping experiment, please wait...
[2024-05-03 13:53:03] INFO (nni.experiment) Saving experiment checkpoint...
[2024-05-03 13:53:03] INFO (nni.experiment) Stopping NNI manager, if any...
[2024-05-03 13:53:23] ERROR (nni.experiment) HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)
Traceback (most recent call last):
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 537, in _make_request
response = conn.getresponse()
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connection.py", line 466, in getresponse
httplib_response = super().getresponse()
File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 1377, in getresponse
response.begin()
File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 281, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/opt/conda/envs/nni/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/util/retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/util/util.py", line 39, in reraise
raise value
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 539, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 370, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/experiment.py", line 171, in _stop_nni_manager
rest.delete(self.port, '/experiment', self.url_prefix)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/rest.py", line 52, in delete
request('delete', port, api, prefix=prefix)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/rest.py", line 31, in request
resp = requests.request(method, url, timeout=timeout)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/adapters.py", line 532, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)
[2024-05-03 13:53:23] WARNING (nni.experiment) Cannot gracefully stop experiment, killing NNI process...
There is a timeout since data cannot be retreived.
Inside a Docker container
Dockerfile
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04
ARG NNI_RELEASE
LABEL maintainer='Microsoft NNI Team<[email protected]>'
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get -y update
RUN apt-get -y install \
automake \
build-essential \
cmake \
curl \
git \
openssh-server \
python3 \
python3-dev \
python3-pip \
sudo \
unzip \
wget \
zip
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/*
RUN ln -s python3 /usr/bin/python
RUN python3 -m pip --no-cache-dir install pip==22.0.3 setuptools==60.9.1 wheel==0.37.1
RUN python3 -m pip --no-cache-dir install \
lightgbm==3.3.2 \
numpy==1.22.2 \
pandas==1.4.1 \
scikit-learn==1.0.2 \
scipy==1.8.0
RUN python3 -m pip --no-cache-dir install \
torch==1.10.2+cu113 \
torchvision==0.11.3+cu113 \
torchaudio==0.10.2+cu113 \
-f https://download.pytorch.org/whl/cu113/torch_stable.html
RUN python3 -m pip --no-cache-dir install pytorch-lightning==1.6.1
RUN python3 -m pip --no-cache-dir install tensorflow==2.9.1
RUN python3 -m pip --no-cache-dir install azureml==0.2.7 azureml-sdk==1.38.0
# COPY dist/nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl .
# RUN python3 -m pip install nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl
# RUN rm nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl
ENV PATH=/root/.local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/usr/sbin
WORKDIR /root
RUN pip install nni
RUN git clone https://github.com/microsoft/nni.git
RUN apt-get -y update
RUN apt-get -y install nano
I would both outside and inside a Docker container modify the file from /nni/example/trials/mnist-pytorch/config.yml in order to set the process on GPU.
Then I would run the following command so I could see the logs in direct.
Description of the issue
I cannot run any experiment on GPU.
I have tried both with a Tesla P4, a P100 and a GTX 1060. I can only make it work using CPU only.
I have tried many configs with setting useActiveGpu to True or False, trialGpuNumber to 1, gpuIndices: '0'. However it always couldn't complete a single architecture training.
I have tried both outside and inside a Docker container.
Configuration
nni/examples/trials/mnist-pytorch/config.yml
Outside a Docker container
Environment
Log message
nnimanager.log
There, the GPU's infos cannot be retreived.
experiment.log
There is a timeout since data cannot be retreived.
Inside a Docker container
Dockerfile
Log message
nnimanager.log
experiment.log
When I'm using CPU only:
I obtain what I want using the GPU, the WebUI, the experiments trials, and so on...
How to reproduce it?
If from a Docker container:
Then in both cases:
As a result, the WebUI wouldn't start due to a timeout trying to retrive data, since the experiment won't load on GPU.
Notes
The text was updated successfully, but these errors were encountered: