Awesome and Dark Tactics
Homepage Catalog Tag Selection Contributions
All Tags AWS ai algorithm-design architecture browser cloud cloud-efficiency cloud-principles cost-reduction data-centric data-compression data-processing deployment design documentation edge-computing email-sharing energy-efficiency energy-footprint enterprise-optimization green-ai hardware libraries llm locality machine-learning maintainability management measured microservices migration mobile model-optimization model-training multi-objective network-traffic parameter-tuning performance queries rebuilding scaling services storage-optimization strategies tabs template testing workloads

<- Back to category

Tactic: [Using Storage Optimization for Efficient LLM Inference]

Tactic sort: Awesome Tactic
Type: Architectural Tactic
Category: green-ml-enabled-systems
Tags: architecture  machine-learning  storage-optimization 

Title

[Using Storage Optimization for Efficient LLM Inference]

Description

This tactic includes a set of techniques focusing on serving level optimizations. It optimizes LLM inference by reducing data movement, minimizing memory fragmentation, and compressing storage through a suite of interrelated techniques. It integrates computational units into memory (in-memory compute), separates prompt and response KV data buffers, and stores embeddings/attention matrices in DRAM while dynamically loading FFN weights from flash. It further compresses the KV cache and applies model quantization (e.g., INT8, FP8), enabling efficient reuse of shared data and reducing redundant memory operations. Additional strategies include recomputing portions of the KV cache on GPU, using top-r selective attention to fetch only relevant key components, and KV-Guided Grouping to avoid repeated cache reads. The approach leverages multi-tier storage (GPU memory, DRAM, SSDs) to support efficient scheduling on both modern and older-generation hardware

Participant

AI engineers

Related software artifact

Large Language Model (LLM)

Context

High-load inference systems requiring efficient memory and compute usage across GPU, DRAM, and SSD, particularly in settings with limited resources or legacy hardware

Software feature

KV cache handling, attention mechanism

Tactic intent

To reduce energy consumption and memory overhead during inference by minimizing data movement and optimizing cache and memory usage

Target quality attribute

Energy efficiency

Other related quality attributes

Memory efficiency

Measured impact

Memory Usage, Inference time

Source

Pelin R. Kuran, Improving the Environmental Sustainability of Large Language Model Inference: A Rapid Review (DOI: https://drive.google.com/file/d/1jOcGP65anFemXiHKSa3bhyScSEmEcY4o/view)


Graphical representation

  • Contact person
  • Patricia Lago (VU Amsterdam)
  •  disc at vu.nl
  •  patricialago.nl

The Archive of Awesome and Dark Tactics (AADT) is an initiative of the Digital Sustainability Center (DiSC). It received funding from the VU Amsterdam Sustainability Institute, and is maintained by the S2 Group of the Vrije Universiteit Amsterdam.

Initial development of the Archive of Awesome and Dark Tactics by Robin van der Wiel