Abstract
This report evaluates the performance impact of enabling Trusted ExecutionEnvironments (TEE) on NVIDIA Hopper GPUs for large language model (LLM)inference tasks. We benchmark the overhead introduced by TEE mode acrossvarious LLMs and token lengths, with a particular focus on the bottleneckcaused by CPU-GPU data transfers via PCIe. Our results indicate that whilethere is minimal computational overhead within the GPU, the overall performancepenalty is primarily attributable to data transfer. For the majority of typicalLLM queries, the overhead remains below 7%, with larger models and longersequences experiencing nearly zero overhead.
Quick Read (beta)
loading the full paper ...