One GPU, Two Bottlenecks: Serving Vision-Language Models Without Wasting Silicon

You shipped a vision-language model on the same GPU stack that served your text LLM. Same vLLM config, similar parameter count, no red alarms in monitoring. Throughput still fell. Inter-token latency crept up. Batch-size tuning helped a bit; quantisation helped a bit; neither explained the gap.

The issue is not a bad deploy. Multimodal inference runs two workloads with opposite hardware appetites on one piece of silicon. Aggregate GPU utilisation looks fine because each phase underuses a different half of the chip. Users feel the slowdown anyway.

Two GPU workload phases with opposing compute and memory demands
Encode wants tensor cores. Decode wants memory bandwidth. One card rarely satisfies both.

Encode vs Decode: A Resource Inversion

A VLM pipeline is really two engines bolted together.

Image encode is matrix-heavy. Patches flow through a ViT-style tower; FLOPs dominate, HBM barely moves. Profiling on consumer cards often shows 80%+ compute utilisation with single-digit bandwidth use.

Token decode is memory-heavy. Each new token reloads weights and the growing KV cache from HBM, does a little maths, emits one token, repeats. On datacenter GPUs, bandwidth saturates while compute idles.

Co-locate both on one card and you permanently rent capacity each phase ignores. Encode leaves HBM idle; decode leaves tensor cores idle. Neither phase runs on hardware tuned for what it actually needs.

“Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound.” — Donglin Yu et al., arXiv:2603.12707

Kleppmann’s lesson from Designing Data-Intensive Applications applies here: when the bottleneck shifts between stages, a single aggregate metric hides the truth. Your dashboard averages two opposite stories into one misleadingly healthy number.

Visual Tokens Lodge in the KV Cache

Prefill builds the KV cache: per-layer key and value tensors for every token in context. Without it, autoregressive decode would recompute attention over full history at each step, which scales badly.

Visual tokens join that cache at prefill and remain for the entire generation. The model has already compressed the image into embeddings, but hundreds of image tokens still occupy HBM and get re-read on every decode step. Bandwidth cost scales with output length, not with how “done” the image processing feels.

Case 1: High-Resolution Inputs Bloat Cache Before Decode Starts

A modest 336×336 image can produce ~576 visual tokens. Add 128 text tokens: 704 tokens in cache before the first answer token. For a 7B MHA model at FP16, that is roughly 350 MB per request. Eight concurrent requests: ~2.8 GB of cache before generation begins.

// FP16 KV cache (one sequence):
// bytes ≈ 2 × layers × seq_len × kv_heads × head_dim × 2
//
// Text-only (128 tokens):  ~64 MB
// With one image (704):    ~350 MB
//
// Quantisation shrinks bytes per element.
// It does not remove 576 persistent image slots in the cache.

Case 2: Concurrency Makes It Worse Than Linear

Raising batch size under multimodal load can hurt. More requests means more fat caches fighting for the same HBM bandwidth. Inter-token latency often climbs faster than image-token count alone predicts.

// Log these together:
// - kv_cache_bytes per request vs image resolution
// - hbm_bandwidth_util during decode phase only
// - itl_p95 vs concurrent_request_count
//
// Super-linear ITL growth = bandwidth contention, not "model slowness"
KV cache growth with vision tokens
Image tokens persist in cache for the full decode loop.

Picking the Right Split Point

Disaggregation only helps if you cut at the right boundary.

Prefill/decode split ships the entire KV cache between nodes. Payload size tracks model depth and context: hundreds of MB to GB. That demands high-bandwidth fabric (NVLink, InfiniBand). Ordinary PCIe clusters need not apply.

Encoder/decoder split moves only the vision embedding: token count × hidden size. Depth L does not inflate the transfer. KV state does not exist yet, so you never ship it.

// LLaVA-7B-style numbers (576 tokens, dim 4096, FP16):
// Embedding crossing the wire: ~4.5 MB
// Full KV cache at same context: ~350 MB
// Ratio: ~78× less data at the encoder seam
//
// PCIe transfer: sub-millisecond vs seconds of encode work

“Partitioning here reduces transfer complexity from O(L·s_ctx) bytes (GB-scale KV caches under stage-level disaggregation) to O(N_v·d) bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth.” — Donglin Yu et al., arXiv:2603.12707

Yu et al. report 12×–196× transfer reductions across current architectures. Deeper models widen the gap: embeddings stay compact; KV migration cost grows with L.

Right-Sizing Hardware Per Phase

After the split, assignment is obvious. Encode on FLOPs-per-dollar silicon (consumer/datacenter compute cards). Decode on HBM-rich cards (A100/H100 class). An RTX 4090 and an A100 offer similar peak TFLOPS; the 4090 wins on FLOPs/$; the A100 wins on bandwidth and VRAM.

Heterogeneous deployment under phase-separable workloads showed ~40% measured savings versus homogeneous baselines in recent work, with no latency regression when scheduled correctly. Engine tricks (CUDA graphs, packed prefill, paged KV) still matter, but they do not replace matching silicon to phase.

Case 3: Text-Only Traffic Leaves Encoders Idle

Pure splits waste encoder capacity during text-only bursts. Work-stealing schedulers let encode pools absorb decode jobs when the vision queue is empty, recovering utilisation without fragile role swapping.

Diagnose Before You Re-Architect

  1. Hold model and sampling fixed; vary only image size/count.
  2. Plot ITL vs vision-token count at realistic concurrency.
  3. Profile encode and decode separately (Nsight Systems).
  4. Correlate with vLLM KV-cache and queue metrics.

When Monolithic Serving Is Enough

Dual pools pay off at sustained multimodal volume where a third of inference spend is on the table. Prototypes, low QPS, and teams without scheduler appetite should stay on one GPU until the maths forces a change. The same encoder/decoder asymmetry applies to audio and video towers, not just vision.

What to Check Right Now

  • KV bytes per request across image resolutions.
  • Phase-split profiling for compute vs bandwidth inversion.
  • ITL slope under load for super-linear growth.
  • Business case before investing in heterogeneous pools.

Half your GPU was on holiday during every request. The fix is giving each phase the silicon it actually uses.

nJoy 😉

Leave a Reply

Your email address will not be published. Required fields are marked *