Joe Barrow field_notes

Field Notes

Speeding up vLLM Start Times

last updated 2026-05-18

I came across this comment on r/LocalLlama about speeding up vLLM start times by: 1. mounting a directory for Triton cache 2. mounting a directory for Cuda cache

This is with the vLLM docker, adding volume mounts with:

... \
    -v ~/.cache/vllm/nv_cache:/root/nv_cache \
    -v ~/.cache/vllm/triton_cache:/root/triton_cache \
    --env TRITON_CACHE_DIR=/root/triton_cache \
    --env CUDA_CACHE_PATH=/root/nv_cache/ComputeCache \
    --env CUDA_CACHE_MAXSIZE=10737418240

I plan on digging into if, why, and how this works. The comment author claims they get 11s start times now.

Understanding It

At first glance, I’m guessing: - no need to recompile the model (triton cache) - no need to recompute the Cuda graphs (cuda cache)

Modal has an option for totally avoiding the second one in their tutorial, where they pass --enforce-eager; this comes at a trade-off in inference speed. Perhaps the cached directory avoids this trade-off.