April 24th 2024

Researchers from Cohere, Princeton University and the University of Illinois have developed a new technique called Snapkv that efficiently compresses the key-value (KV) cache in large language models (LLMs), leading to improvements in memory efficiency and processing speed.

What is Prelox?
BYD unveils budget-friendly e2 Glory …
Discover How to Enable or Disable the…
â€œIâ€™m off the marketâ€ â€“ Korra …
Annapurna Circuit Trek in March

You can read the paper here.

The KV cache plays a crucial role in LLMs to process extensive contexts. However, as input length increases, the growth of the KV cache poses challenges to memory and time efficiency.

Previous works have attempted to address this issue by evicting the KV cache using various algorithms, such as StreamLLM, Heavy-Hitter Oracle, and Adaptive KV Compression (FastGen).

However, these methods either face the challenge of losing important information or focus solely on compressing the KV cache for generated tokens while overlooking the compression of the input sequence KV cache.

SnapKV takes a different approach by intelligently identifying and selecting the most important attention features per head to create a new KV cache.

The researchers discovered that each attention head in the model consistently focuses on specific prompt attention features during generation, and this robust pattern can be obtained from an ‘observation’ window located at the end of the prompts.

The SnapKV algorithm works in two steps. First, it picks out key features from a specific part of the data through a voting method. Then, it groups these features with nearby related ones to keep the important context. In the second step, it combines these chosen features with other relevant data and compresses it. This compressed data is saved and used later to help generate responses.

The researchers evaluated SnapKV on various LLMs and long-sequence datasets, affirming its improvement over previous work and comparability to conventional KV caching.

In the Needle-in-a-Haystack test, SnapKV achieved a remarkable ability to precisely manage small details on extremely long input contexts with a 380x compression ratio.

The paper described, “Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens.”

Furthermore, SnapKV was integrated with a leading retrieval-augmented generation (RAG) model, showcasing its extended performance capabilities.

The researchers also demonstrated that SnapKV could be combined orthogonally with other acceleration strategies, such as parallel decoding, to further enhance LLM efficiency.

By efficiently compressing the KV caches, this technique opens up new possibilities for the application of LLMs in real-world scenarios involving long context understanding, such as document processing and multi-round conversations.

This post first appeared on Analytics India Magazine, please read the originial post: here