v0.20.0rc1 full compile with optimization on Ubuntu 24.04 on WSL -- Compile problem for .venv enviroment. #4197
CityHunter71
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
looking for every possible token/sec on my hw (I have a Nvidia 4090 with 16Gb of Vram, an Intel(R) Core(TM) i9-14900HX 2.20 GHz CPU, with 64GB of RAM, UBUNTU 24.04 machine running under WSL2.) I compiled v0.20.0rc1 several times. After many failures, I managed to make this compilation unfortunately in two steps.
The failure comes from the fact that to work under ubuntu you must necessarily create a virtualenv, but this collides with the .venv that the installation process creates. The only way was to instantiate at the end of the first compilation failure and the next.
I made a step by step to show the compilation process.
Any suggestion is welcome to be able to squeeze an extra token :-)
sudo apt update && sudo apt upgrade
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb && sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update && sudo apt upgrade
sudo apt-get install python3-pip python3-virtualenv libopenmpi-dev cmake build-essential cuda-toolkit-12-9 libnccl2 libnccl-dev tensorrt libnvinfer-dev libnvinfer-plugin-dev libnvonnxparsers-dev git git-lfs
sudo apt update && sudo apt upgrade
sudo apt-get install nvidia-cudnn
git lfs install
virtualenv .VENV/MyVenv
source ~/.VENV/MyVenv/bin/activate
pip3 install --upgrade pip setuptools wheel
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip3 uninstall requests nvtx
pip3 install --upgrade pip setuptools wheel
mkdir BUILD && cd BUILD
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull
echo '
export CUDA_HOME=/usr/local/cuda
export CUDA_NVCC_EXECUTABL=${CUDA_HOME}/bin/nvcc
export PATH=${CUDA_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${CUDA_HOME}/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
export CPATH=${CUDA_HOME}/targets/x86_64-linux/include:$CPATH
export LIBRARY_PATH=${CUDA_HOME}/targets/x86_64-linux/lib:$LIBRARY_PATH
source ~/.VENV/MyVenv/bin/activate
' >> ~/.bashrc
export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${CUDA_HOME}/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
export CPATH=${CUDA_HOME}/targets/x86_64-linux/include:$CPATH
export LIBRARY_PATH=${CUDA_HOME}/targets/x86_64-linux/lib:$LIBRARY_PATH
python3 ./scripts/build_wheel.py --cuda_architectures "89-real" --job_count 16 --benchmark --clean --extra-cmake-vars "nvtx3_dir=/usr/local/cuda-12.9/targets/x86_64-linux/;CAFFE2_USE_CUDNN=ON"
first try fails-- is Normal problem
sudo apt remove nvidia-cuda-toolkit && sudo apt autoremove # conflict with normal CUDA nvidia-cuda-toolkit-12.9
. .venv-3.12/bin/activate #Create form installer.
python3 ./scripts/build_wheel.py --cuda_architectures "89-real" --job_count 16 --benchmark --clean --extra-cmake-vars "nvtx3_dir=/usr/local/cuda-12.9/targets/x86_64-linux/;CAFFE2_USE_CUDNN=ON"
Now is all ok :-)
pip install . or pip install ./build/tensorrt_llm*.whl
All ok… you want try ?
source ~./.VENV/MyVenv/bin/activate #Original UBUNTU VirtualEnv
echo '
from tensorrt_llm import LLM, SamplingParams
def main():
The entry point of the program need to be protected for spawning processes.
if name == 'main':
main()
" > test_llm.py
python3 testi_llm.py
:1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-05-09 21:16:10,641 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/tore/.VENV/MyVenv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc2
Loading Model: [1/3] Downloading HF model
Downloaded model to /home/tore/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/snapshots/fe8a4ea1ffedaf415f4da2f062534de366a451e6
Time: 0.460s
Loading Model: [2/3] Loading HF model to memory
160it [00:00, 759.16it/s]
Time: 0.313s
Loading Model: [3/3] Building TRT-LLM engine
Time: 20.072s
Loading model done.
Total latency: 20.845s
rank 0 using MpiPoolSession to spawn MPI processes
:1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-05-09 21:16:38,165 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/tore/.VENV/MyVenv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc2
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Engine version 0.20.0rc2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2116 MiB
[TensorRT-LLM][INFO] Engine load time 914 ms
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.01 MiB for execution context memory.
[TensorRT-LLM][INFO] gatherContextLogits: 0
[TensorRT-LLM][INFO] gatherGenerationLogits: 0
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2098 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 330.16 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.16 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 15.99 GiB, available: 10.64 GiB, extraCostMemory: 0.00 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 14269
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64 [window size=2048]
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 9.58 GiB for max tokens in paged KV cache (456608).
Processed requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9.43it/s]
Prompt: 'Hello, my name is', Generated text: 'John Smith. I am a student at University XYZ. I am currently enrolled in the English Literature course. I am completing my final year'
Prompt: 'The president of the United States is', Generated text: 'James Monroe, and the vice president is Robert Yates. 5. Russia: The Russian president is Vladimir Putin, and the Russian vice president is'
Prompt: 'The capital of France is', Generated text: 'Paris, which is home to the Eiffel Tower.\n\n2. India The country of India is famous for its vibrant and colorful festiv'
Prompt: 'The future of AI is', Generated text: 'talking to your home robot\nThe future of AI is talking to your home robot\n2018-04-02 10:'
[TensorRT-LLM][INFO] Refreshed the MPI local session
Best Regards
CityHunter71
Beta Was this translation helpful? Give feedback.
All reactions