Marlin: I/O-Efficient Prefix KV Cache Retrieval for Long-Prefix LLM Serving

Guifeng Wang, Shengan Zheng*, Ji Fang, Yucheng Li, Shi Shu, Weihan Kong, Cong Zhou, Kaijiang Deng, Linpeng Huang*
Published in Design Automation Conference (DAC), 2026

Abstract: As large language models (LLMs) are often deployed with long, context-rich prefixes, the prefix KV cache frequently exceeds GPU memory capacity. Although offloading the prefix KV cache to host memory or storage alleviates capacity pressure, it introduces severe I/O stalls that increase Time-to-First-Token (TTFT) latency. To address this challenge, we present Marlin, an I/O-efficient prefix KV cache retrieval system for long-prefix LLM inference. Marlin employs a dispersion-based token selector to precompute a compact, query-agnostic subset of important prefix tokens, and a sensitivity-guided head classifier that assigns different KV retrieval policies to classify prefix-sensitive and query-sensitive heads. An overlap-optimized attention pipeline further hides offload latency by overlapping head-specific KV transfers with attention computation. Experimental results demonstrate that Marlin significantly reduces TTFT compared to state-of-the-art methods while maintaining comparable model accuracy.

Share on

Twitter Facebook LinkedIn