Speeding up vLLM Start Times
last updated 2026-05-18
I came across this comment on r/LocalLlama about speeding up vLLM start times by: 1. mounting a directory for Triton cache 2. mounting a directory for Cuda cache
This is with the vLLM docker, adding volume mounts with:
... \
-v ~/.cache/vllm/nv_cache:/root/nv_cache \
-v ~/.cache/vllm/triton_cache:/root/triton_cache \
--env TRITON_CACHE_DIR=/root/triton_cache \
--env CUDA_CACHE_PATH=/root/nv_cache/ComputeCache \
--env CUDA_CACHE_MAXSIZE=10737418240
I plan on digging into if, why, and how this works. The comment author claims they get 11s start times now.
Understanding It
At first glance, I’m guessing: - no need to recompile the model (triton cache) - no need to recompute the Cuda graphs (cuda cache)
Modal has an option for totally avoiding the second one in their tutorial, where they pass --enforce-eager; this comes at a trade-off in inference speed. Perhaps the cached directory avoids this trade-off.