Key design: Register custom attention via HuggingFace AttentionInterface — works with any CausalLM without modifying model code.
# Register custom attention (once)
AttentionInterface.register(
"ecah_attention", ecah_attention
)
# Load model — any CausalLM works
model = AutoModelForCausalLM.from_pretrained(
model_name,
attn_implementation="ecah_attention",
)
Two FlashInfer Wrappers
Wrapper
Phase
BatchPrefillWithPagedKVCache
Prefill (causal)
BatchDecodeWithPagedKVCache
Decode (single-token)
Metadata (kv_indices, kv_indptr, kv_last_page_len) pre-computed once per batch, reused across all layers.
Chunked Prefill — The Problem
At high concurrency, many prefill requests arrive simultaneously:
Result: Large prefill batches monopolize the GPU. All decode requests for existing users are blocked, ITL spikes for everyone, and new users wait for the entire batch to complete before getting their first token.
Chunked Prefill — How It Works
Set a token budget per prefill round (max_prefill_tokens = 2048). Iterate through queued requests, adding each one until the budget would be exceeded.
Chunked Prefill — Impact
Early chunks finish and enter the decode queue while later chunks are still prefilling.
Concurrency
Baseline TTFT
Chunked TTFT
Change
1
0.046s
0.041s
-9%
16
0.146s
0.122s
-16%
64
0.493s
0.396s
-20%
128
0.964s
0.743s
-23%
-23% TTFT at 128 concurrency (0.96s → 0.74s)
Bonus: Also improves ITL indirectly. Smaller prefill chunks release the GPU sooner, so the decode loop gets serviced more frequently. ITL at 128: baseline 120.75ms → chunked 43.04ms
CUDA Graphs — The Problem
During decode, each token = tiny GPU workload. But CPU launch overhead dominates:
Per decode step:
Prepare tensors: ~0.2ms
Kernel launch: ~0.5ms
Actual GPU compute: ~2ms
Synchronization: ~0.3ms
CPU overhead ≈ 33% of total time
GPU idle >60% at batch_size=1, GPU waits for CPU
CUDA Graphs — The Solution
Record entire forward pass once at startup, replay with a single CPU call:
class CUDAGraphDecodeWrapper:
def warmup(self, bs, capture_stream, **inputs):
self.static_inputs[bs] = {
k: v.clone().cuda() for k,v in inputs.items()
}
for _ in range(2): self.fn(**self.static_inputs[bs])
torch.cuda.synchronize()
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g, stream=capture_stream):
out = self.fn(**self.static_inputs[bs])
self.graphs[bs] = g
def run(self, bs, **new_inputs):
for k, v in new_inputs.items():
self.static_inputs[bs][k].copy_(v)
self.graphs[bs].replay() # single call!
return self.static_outputs[bs]
CUDA Graphs — Bucket Padding
CUDA Graphs require fixed tensor shapes. Batch sizes vary at runtime → round up to power-of-2 buckets and pad with dummy KV pages.
Req 1real
Req 2real
Req 3real
Req 4real
Req 5real
PADdummy
PADdummy
PADdummy
Batch of 5 → bucket 8 (next power of 2) · Padding results discarded
Bucket sizes
1
2
4
8
16
32
64
128
Per bucket: alloc dummy KV → plan wrapper → warmup 2x → capture graph → free dummies
At inference time
CUDA Graphs — Impact
Concurrency
Baseline ITL
CUDA Graph ITL
Speedup
1
26.30 ms
8.06 ms
3.3x
4
28.76 ms
8.84 ms
3.3x
8
57.36 ms
19.02 ms
3.0x
32
71.55 ms
22.03 ms
3.3x
64
98.22 ms
27.69 ms
3.5x
128
120.75 ms
44.19 ms
2.7x
2.7–3.5x ITL across all concurrency levels
10.1s → 3.1s E2E at 1 user (3.2x faster)
torch.compile
Alternative to CUDA Graphs: JIT-compile the decode function. Same bucketing infrastructure, less rigid.
Mutually exclusive (enforced). Same bucketing infra.
CUDA Stream Overlap
Operations on different CUDA streams execute concurrently. ecahLang uses two streams.
Without (sequential):
Single stream
plan
forward 15ms
plan
forward 15ms
With stream overlap:
Default
plan
CPU free
plan
CPU free
Forward
forward + sampling
forward + sampling
fwd_stream.wait_stream(torch.cuda.current_stream()) # ensure plan() finished
with torch.cuda.stream(fwd_stream):
output = model.forward(...) # GPU on forward stream
logits = output.logits[...] # sampling also on forward stream
# CPU free to collect next batch, detokenize, etc.
Two-Phase Batch Collection
While GPU processes current batch, CPU pre-collects the next batch from the queue.
Without overlap:
GPU
Forward A
idle
Forward B
idle
CPU
plan
idle
collect B
plan
idle
collect
With two-phase collection:
GPU
Forward Batch A
Forward Batch B
CPU
plan
collect B (while GPU runs)
plan
collect C (while GPU runs)
with torch.cuda.stream(fwd_stream): # Phase 1: GPU working
output = model.forward(...)
await asyncio.sleep(0) # Phase 2: yield to event loop
while not queue.empty(): # collect next batch while GPU runs
next_batch.append(await queue.get())
fwd_stream.synchronize() # Phase 3: GPU done, next batch ready
# Also: tokenizer.batch_decode() in thread pool via run_in_executor()
Pinned Sampling Buffers
Regular Transfer
Pinned Memory (ecahLang)
# Startup: allocate ONCE
temp_cpu = torch.ones(max_batch, 1,
dtype=torch.float32).pin_memory()
# Numpy views (zero-overhead scalar writes)
temp_np = temp_cpu.numpy()
# Persistent GPU buffers
temp_gpu = torch.ones(max_batch, 1,
dtype=torch.float32, device="cuda")
# Each decode step:
# 1. Write via numpy (zero Python overhead)
temp_np[:n, 0] = temperatures
# 2. Async DMA copy (non-blocking)
temp_gpu[:n].copy_(
temp_cpu[:n], non_blocking=True
)
# 3. GPU reads temp_gpu during sampling
# No allocation, no staging, no blocking!
Multi-Step Decode
Run N decode steps in one batch round-trip instead of returning to the event loop after each token.
Takeaway: Chunked Prefill is the clear TTFT winner within ecahLang — 23% improvement at 128 concurrency. CUDA Graphs are decode-only and slightly hurt TTFT. SGLang/vLLM excel at high concurrency (advanced prefill-decode interleaving).
Benchmark — ITL (Inter-Token Latency)
Lower is better. Measures decode speed.
Concurrency
Baseline
CUDA Graph
Chunked Prefill
torch.compile
SGLang
vLLM
1
26.30 ms
8.06 ms
7.84 ms
19.79 ms
3.85 ms
3.86 ms
8
57.36 ms
19.02 ms
18.07 ms
38.56 ms
4.09 ms
4.35 ms
32
71.55 ms
22.03 ms
22.69 ms
45.61 ms
4.50 ms
5.25 ms
64
98.22 ms
27.69 ms
28.68 ms
57.63 ms
5.02 ms
6.66 ms
128
120.75 ms
44.19 ms
43.04 ms
75.93 ms
6.67 ms
9.85 ms
Takeaway: CUDA Graphs give 2.7–3.5x ITL improvement. Chunked Prefill nearly matches at high concurrency — indirectly helps decode. SGLang/vLLM maintain sub-10ms ITL even at 128 (custom fused kernels + advanced scheduling).
Benchmark — E2E & Throughput
E2E Latency (lower is better)
Conc.
Baseline
Best ecahLang
SGLang
1
10.12s
3.05s
1.58s
8
22.07s
7.01s
1.61s
32
27.67s
8.71s
1.84s
128
47.25s
17.23s
2.89s
Throughput (higher is better)
Conc.
Baseline
Best ecahLang
SGLang
1
38
126
243
8
139
438
1,903
32
444
1,411
6,664
128
1,040
2,835
16,940
Why the ~6x throughput gap vs SGLang/vLLM?
Advanced schedulers: interleave prefill + decode
Tensor parallelism (multi-GPU)
Prefix caching (KV reuse)
Custom fused CUDA kernels
Speculative decoding
ecahLang: simplicity & hackability (~1500 LOC)
Future Work — LMCache Integration
Goal: KV cache reuse across requests with common token prefixes (system prompts, RAG context). Skip prefill for cached portions → reduce TTFT.
Expected Impact
1000-token cached system prompt = skip ~1000 tokens of prefill
CPU memory tier with LRU eviction
Significant TTFT reduction for repeated prefixes
Compatibility
Chunked Prefill: reduces count before budgeting
CUDA Graphs: decode-only, LMCache is prefill-only
torch.compile: same as CUDA Graphs
All Optimizations
#
Optimization
Phase
Key Idea
Impact
1
Paged KV Cache
Both
OS-style virtual memory for KV
128+ concurrent requests
2
FlashInfer Attention
Both
Batched paged attention kernels
Faster than SDPA
3
Chunked Prefill
Prefill
Token-budgeted batch splitting
-23% TTFT
4
CUDA Graphs
Decode
Record & replay GPU ops
2.7-3.5x ITL
5
torch.compile
Decode
JIT compile decode path
1.6x ITL
6
Stream Overlap
Both
Concurrent CUDA streams
~5-10% latency
7
Pinned Buffers
Decode
DMA + numpy views
Zero-alloc per step
8
Two-Phase Batching
Both
Collect while GPU runs
Hide collection latency
9
Multi-Step Decode
Decode
N steps per round-trip
Reduce queue overhead
10
Fused Sampling
Decode
Single FlashInfer kernel
Fewer launches
Key Takeaways
Different phases need different optimizations
Prefill is compute-bound → chunked prefill
Decode is memory-bandwidth-bound → CUDA graphs, stream overlap, pinned buffers
Simplicity and hackability over maximum throughput
~1,500 LOC vs 100k+ in SGLang/vLLM · Works with any HuggingFace CausalLM · OpenAI-compatible API