All Tags
AWS
ai
algorithm-design
architecture
browser
cloud
cloud-efficiency
cloud-principles
cost-reduction
data-centric
data-compression
data-processing
deployment
design
documentation
edge-computing
email-sharing
energy-efficiency
energy-footprint
enterprise-optimization
green-ai
hardware
libraries
llm
locality
machine-learning
maintainability
management
measured
microservices
migration
mobile
model-optimization
model-training
multi-objective
network-traffic
parameter-tuning
performance
queries
rebuilding
scaling
services
storage-optimization
strategies
tabs
template
testing
workloads
Tactic: [Using Storage Optimization for Efficient LLM Inference]
Tactic sort:
Awesome Tactic
Type: Architectural Tactic
Category: green-ml-enabled-systems
Title
[Using Storage Optimization for Efficient LLM Inference]
Description
This tactic includes a set of techniques focusing on serving level optimizations. It optimizes LLM inference by reducing data movement, minimizing memory fragmentation, and compressing storage through a suite of interrelated techniques. It integrates computational units into memory (in-memory compute), separates prompt and response KV data buffers, and stores embeddings/attention matrices in DRAM while dynamically loading FFN weights from flash. It further compresses the KV cache and applies model quantization (e.g., INT8, FP8), enabling efficient reuse of shared data and reducing redundant memory operations. Additional strategies include recomputing portions of the KV cache on GPU, using top-r selective attention to fetch only relevant key components, and KV-Guided Grouping to avoid repeated cache reads. The approach leverages multi-tier storage (GPU memory, DRAM, SSDs) to support efficient scheduling on both modern and older-generation hardware
Participant
AI engineers
Related software artifact
Large Language Model (LLM)
Context
High-load inference systems requiring efficient memory and compute usage across GPU, DRAM, and SSD, particularly in settings with limited resources or legacy hardware
Software feature
KV cache handling, attention mechanism
Tactic intent
To reduce energy consumption and memory overhead during inference by minimizing data movement and optimizing cache and memory usage
Target quality attribute
Energy efficiency
Other related quality attributes
Memory efficiency
Measured impact
Memory Usage, Inference time
