-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Does HugeCtr support H800 GPU? #414
Comments
Hi, thanks for trying HugeCTR. |
Yes. The problem above has solved when I changed the image version to 23.06; [HCTR][06:01:46.761][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/2.txt-------------------- I found the method in hugectr readme doc: NOTE: HugeCTR uses NCCL to share data between ranks, and NCCL may requires shared memory for IPC and pinned (page-locked) system memory resources. It is recommended that you increase these resources by issuing the following options in the docker run command. -shm-size=1g -ulimit memlock=-1 I have tried the method, but the problem doesn't disappear. |
Is there any progresses for this question? |
hi @sparkling9809 which training script are you using? From the log I can tell you are trying to use If you want to try some sample and not require |
Thanks for your reply! The script for trainnign as follows:
The script runs ok on 8 H800 GPU in single machine. But there is something wrong when the source include files greater than 3.
the exception as follows: [HCTR][06:01:46.761][INFO][RK0][main]: --------------------Epoch 0, source file: /root/gq/2.txt-------------------- |
closed as it's a duplication of #417 . |
I run the embedding_test in HugeCtr on H800, but it failed, the exception follow is :
root@jupyuterlab-nb-1691543551529-ddf9dcb96-jr455:/usr/local/hugectr/bin# ./embedding_test
Running main() from /hugectr/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 278 tests from 28 test suites.
[----------] Global test environment set-up.
[----------] 28 tests from distributed_sparse_embedding_hash_test
[ RUN ] distributed_sparse_embedding_hash_test.fp32_sgd_1gpu
MpiInitService: MPI was already initialized by another (non-HugeCTR) mechanism.
[HCTR][09:27:01.919][INFO][RK0][main]: Global seed is 1544237699
[HCTR][09:27:01.994][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 1
[HCTR][09:27:02.470][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][09:27:02.470][DEBUG][RK0][main]: [device 0] allocating 0.0000 GB, available 76.5250
[HCTR][09:27:02.470][INFO][RK0][main]: Start all2all warmup
[HCTR][09:27:02.471][INFO][RK0][main]: End all2all warmup
[HCTR][09:27:02.472][INFO][RK0][main]: ./data_reader_test_data/temp_dataset_0.data
[HCTR][09:27:02.757][INFO][RK0][main]: train_file_list.txt done!
[HCTR][09:27:02.757][INFO][RK0][main]: ./data_reader_test_data exist
[HCTR][09:27:02.757][INFO][RK0][main]: ./data_reader_test_data/temp_dataset_0.data
[HCTR][09:27:02.828][INFO][RK0][main]: test_file_list.txt done!
[HCTR][09:27:02.828][DEBUG][RK0][main]: [device 0] allocating 0.0012 GB, available 76.2593
[HCTR][09:27:02.828][DEBUG][RK0][main]: [device 0] allocating 0.0030 GB, available 76.2554
[HCTR][09:27:03.179][INFO][RK0][main]: max_vocabulary_size_per_gpu_=100000
[HCTR][09:27:03.184][ERROR][RK0][main]: CUDA RT call "cudaGetLastError()" in line 341 of file /hugectr/HugeCTR/include/hashtable/cudf/concurrent_unordered_map.cuh failed with no kernel image is available for execution on the device (209).
root@jupyuterlab-nb-1691543551529-ddf9dcb96-jr455:/usr/local/hugectr/bin#
the cuda version: 12.2
HugeCtr docker image : Merlin-hugectr:23.02
The text was updated successfully, but these errors were encountered: