Layer 6 of 9
Batching turns GPUs into factories.
Production inference piles constraints: tail latency p99, throughput per dollar, VRAM ceilings.
KV caches reuse attention keys across tokens; FlashAttention trades memory for bandwidth-aware tiling.
Continuous batching absorbs uneven prompts; speculative decoding drafts tokens then verifies cheaply.
Efficient servers coordinate paging across GPUs — Deadwood abstracts vendor knobs.
Without custodianship, your team inherits every sharp edge below.
Typical DIY cost
Opinionated APIs wire custodied data, runners, and proofs together — no boilerplate archaeology.
from deadwood import InferenceServer
server = InferenceServer(
model="mistral-7b-lora-finance",
optimization="auto",
target_latency_ms=100,
target_rps=1000,
)
results = server.batch_inference(requests)InferenceServer negotiates batch sizes, cache tiers, and quantization passes until telemetry satisfies SLAs.
Cost dashboards tie millisecond regressions to ledger lines — finance sees ops impact instantly.
| Optimization | Speedup | Cost / VRAM |
|---|---|---|
| Stock PyTorch | 1× | baseline |
| + batching | ≈8× | — |
| + KV and flash kernels | ≈3× | lower VRAM |
| + int8 quant | ≈2.5× | −50% VRAM |
| Deadwood stack | 15–20× | ≈70% savings |
Illustrative geometric means — actual uplift depends on sequence lengths.
Follow how custody chains into Evaluation & verification.
Next: Evaluation & verification