All Tags AWS ai algorithm-design architecture browser cloud cloud-efficiency cloud-principles cost-reduction data-centric data-compression data-processing deployment design documentation edge-computing email-sharing energy-efficiency energy-footprint enterprise-optimization green-ai hardware libraries llm locality machine-learning maintainability management measured microservices migration mobile model-optimization model-training multi-objective network-traffic parameter-tuning performance queries rebuilding scaling services storage-optimization strategies tabs template testing workloads

Tactic: Using Storage Optimization for Efficient LLM Inference

Tactic sort: Awesome Tactic

Type: Architectural Tactic

Category: green-ml-enabled-systems

Tags: architecture machine-learning storage-optimization

Title

Using Storage Optimization for Efficient LLM Inference

Description

This tactic includes a set of techniques focusing on serving level optimizations. It optimizes LLM inference by reducing data movement, minimizing memory fragmentation, and compressing storage through a suite of interrelated techniques. It integrates computational units into memory (in-memory compute), separates prompt and response KV data buffers, and stores embeddings/attention matrices in DRAM while dynamically loading FFN weights from flash. It further compresses the KV cache and applies model quantization (e.g., INT8, FP8), enabling efficient reuse of shared data and reducing redundant memory operations. Additional strategies include recomputing portions of the KV cache on GPU, using top-r selective attention to fetch only relevant key components, and KV-Guided Grouping to avoid repeated cache reads. The approach leverages multi-tier storage (GPU memory, DRAM, SSDs) to support efficient scheduling on both modern and older-generation hardware

Participant

AI engineers

Related software artifact

Large Language Model (LLM)

Context

High-load inference systems requiring efficient memory and compute usage across GPU, DRAM, and SSD, particularly in settings with limited resources or legacy hardware

Software feature

KV cache handling, attention mechanism

Tactic intent

To reduce energy consumption and memory overhead during inference by minimizing data movement and optimizing cache and memory usage

Target quality attribute

Energy efficiency

Other related quality attributes

Memory efficiency

Measured impact

Memory Usage, Inference time

Source

Pelin R. Kuran, Improving the Environmental Sustainability of Large Language Model Inference: A Rapid Review (DOI: https://drive.google.com/file/d/1jOcGP65anFemXiHKSa3bhyScSEmEcY4o/view)