ecahLang

Building a Lightweight LLM Inference Engine

From naive inference to optimized continuous batching with FlashInfer

PyTorch FlashInfer HuggingFace FastAPI

~1,500 lines of code · Any HuggingFace CausalLM · OpenAI-compatible API

The Problem: LLM Inference is Slow

LLMs are autoregressive — one token at a time
384 tokens = 384 sequential forward passes
CPU kernel launch overhead dominates small batches
At 128 users, everything compounds

Goal: minimize TTFT (Time-to-First-Token) and ITL (Inter-Token Latency)

Naive Baseline Performance
ITL (1 user)	26.3 ms/token
E2E (1 user)	10.1 seconds
ITL (128 users)	120.8 ms/token
E2E (128 users)	47.3 seconds
Throughput (128)	1,040 tok/s

2.7x faster
after all optimizations

Two Phases of LLM Inference

Architecture Overview

Continuous Batching

Static Batching

Req A

500 tokens

Req B

200 tok

idle...

Req C

400 tok

Req D

blocked

All must finish before new requests start

Continuous Batching (ecahLang)

Req A

500 tokens

Req B

200 tok

Req C

400 tokens

Req D

joins when B finishes

Requests join/leave dynamically — no wasted GPU

Two independent asyncio.Queue loops — prefill and decode run as concurrent background tasks
Microsleep (0.1ms) groups incoming requests · Up to 128 requests per batch

Paged KV Cache — The Problem

KV Cache Memory per Token

Naive Pre-allocation

Used200 tok

Wasted1848 tok

Used150 tok

Wasted1898 tok

128 reqs × 2048 max × 72KB = 18.4 GB wasted
Most requests use <10% of allocated memory

Paged KV Cache — The Solution

kv_cache = torch.zeros(num_layers, max_blocks, 2, block_size, num_kv_heads, head_dim)  # auto-computed from GPU memory

FlashInfer Integration

Key design: Register custom attention via HuggingFace AttentionInterface — works with any CausalLM without modifying model code.

# Register custom attention (once)
AttentionInterface.register(
    "ecah_attention", ecah_attention
)

# Load model — any CausalLM works
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="ecah_attention",
)

Two FlashInfer Wrappers

Wrapper	Phase
BatchPrefillWithPagedKVCache	Prefill (causal)
BatchDecodeWithPagedKVCache	Decode (single-token)

Metadata (kv_indices, kv_indptr, kv_last_page_len) pre-computed once per batch, reused across all layers.

Chunked Prefill — The Problem

At high concurrency, many prefill requests arrive simultaneously:

Result: Large prefill batches monopolize the GPU. All decode requests for existing users are blocked, ITL spikes for everyone, and new users wait for the entire batch to complete before getting their first token.

Chunked Prefill — How It Works

Set a token budget per prefill round (max_prefill_tokens = 2048). Iterate through queued requests, adding each one until the budget would be exceeded.

Chunked Prefill — Impact

Early chunks finish and enter the decode queue while later chunks are still prefilling.

Concurrency	Baseline TTFT	Chunked TTFT	Change
1	0.046s	0.041s	-9%
16	0.146s	0.122s	-16%
64	0.493s	0.396s	-20%
128	0.964s	0.743s	-23%

-23% TTFT
at 128 concurrency (0.96s → 0.74s)

Bonus: Also improves ITL indirectly. Smaller prefill chunks release the GPU sooner, so the decode loop gets serviced more frequently.
ITL at 128: baseline 120.75ms → chunked 43.04ms

CUDA Graphs — The Problem

During decode, each token = tiny GPU workload. But CPU launch overhead dominates:

Per decode step:

Prepare tensors: ~0.2ms
Kernel launch: ~0.5ms
Actual GPU compute: ~2ms
Synchronization: ~0.3ms

CPU overhead ≈ 33% of total time

GPU idle >60%
at batch_size=1, GPU waits for CPU

CUDA Graphs — The Solution

Record entire forward pass once at startup, replay with a single CPU call:

class CUDAGraphDecodeWrapper:
    def warmup(self, bs, capture_stream, **inputs):
        self.static_inputs[bs] = {
            k: v.clone().cuda() for k,v in inputs.items()
        }
        for _ in range(2): self.fn(**self.static_inputs[bs])
        torch.cuda.synchronize()
        g = torch.cuda.CUDAGraph()
        with torch.cuda.graph(g, stream=capture_stream):
            out = self.fn(**self.static_inputs[bs])
        self.graphs[bs] = g

    def run(self, bs, **new_inputs):
        for k, v in new_inputs.items():
            self.static_inputs[bs][k].copy_(v)
        self.graphs[bs].replay()  # single call!
        return self.static_outputs[bs]

CUDA Graphs — Bucket Padding

CUDA Graphs require fixed tensor shapes. Batch sizes vary at runtime → round up to power-of-2 buckets and pad with dummy KV pages.

Req 1real

Req 2real

Req 3real

Req 4real

Req 5real

PADdummy

Batch of 5 → bucket 8 (next power of 2) · Padding results discarded

Bucket sizes

128

Per bucket: alloc dummy KV → plan wrapper → warmup 2x → capture graph → free dummies

At inference time

CUDA Graphs — Impact

Concurrency	Baseline ITL	CUDA Graph ITL	Speedup
1	26.30 ms	8.06 ms	3.3x
4	28.76 ms	8.84 ms	3.3x
8	57.36 ms	19.02 ms	3.0x
32	71.55 ms	22.03 ms	3.3x
64	98.22 ms	27.69 ms	3.5x
128	120.75 ms	44.19 ms	2.7x

2.7–3.5x ITL
across all concurrency levels

10.1s → 3.1s E2E
at 1 user (3.2x faster)

torch.compile

Alternative to CUDA Graphs: JIT-compile the decode function. Same bucketing infrastructure, less rigid.

Aspect	CUDA Graphs	torch.compile
Mechanism	Record & replay ops	JIT → optimized CUDA
Flexibility	Fixed shapes only	Dynamic shapes possible
Overhead	Near-zero replay	Some per-call overhead
Speed	Faster	Good middle ground

@torch.compiler.disable  # FlashInfer ops can't compile
def ecah_attention(...):
    ...

decode = torch.compile(decode, mode="default", dynamic=False)

	Baseline	torch.compile	CUDA Graphs
ITL @ 1	26.3 ms	19.8 ms (1.3x)	8.1 ms (3.3x)
ITL @ 128	120.8 ms	75.9 ms (1.6x)	44.2 ms (2.7x)

Mutually exclusive (enforced). Same bucketing infra.

CUDA Stream Overlap

Operations on different CUDA streams execute concurrently. ecahLang uses two streams.

Without (sequential):

Single stream

plan

forward 15ms

plan

forward 15ms

With stream overlap:

Default

plan

CPU free

plan

CPU free

Forward

forward + sampling

fwd_stream.wait_stream(torch.cuda.current_stream())  # ensure plan() finished
with torch.cuda.stream(fwd_stream):
    output = model.forward(...)    # GPU on forward stream
    logits = output.logits[...]    # sampling also on forward stream
# CPU free to collect next batch, detokenize, etc.

Two-Phase Batch Collection

While GPU processes current batch, CPU pre-collects the next batch from the queue.

Without overlap:

GPU

Forward A

idle

Forward B

idle

CPU

plan

idle

collect B

plan

idle

collect

With two-phase collection:

GPU

Forward Batch A

Forward Batch B

CPU

plan

collect B (while GPU runs)

plan

collect C (while GPU runs)

with torch.cuda.stream(fwd_stream):    # Phase 1: GPU working
    output = model.forward(...)
await asyncio.sleep(0)                  # Phase 2: yield to event loop
while not queue.empty():                # collect next batch while GPU runs
    next_batch.append(await queue.get())
fwd_stream.synchronize()                # Phase 3: GPU done, next batch ready
# Also: tokenizer.batch_decode() in thread pool via run_in_executor()

Pinned Sampling Buffers

Regular Transfer

Pinned Memory (ecahLang)

# Startup: allocate ONCE
temp_cpu = torch.ones(max_batch, 1,
    dtype=torch.float32).pin_memory()

# Numpy views (zero-overhead scalar writes)
temp_np = temp_cpu.numpy()

# Persistent GPU buffers
temp_gpu = torch.ones(max_batch, 1,
    dtype=torch.float32, device="cuda")


# Each decode step:
# 1. Write via numpy (zero Python overhead)
temp_np[:n, 0] = temperatures

# 2. Async DMA copy (non-blocking)
temp_gpu[:n].copy_(
    temp_cpu[:n], non_blocking=True
)

# 3. GPU reads temp_gpu during sampling
# No allocation, no staging, no blocking!

Multi-Step Decode

Run N decode steps in one batch round-trip instead of returning to the event loop after each token.

multi_step=1 (default):

Pipeline

loop

fwd

loop

fwd

loop

fwd

multi_step=4:

Pipeline

loop

fwd 1

fwd 2

fwd 3

fwd 4

detok

loop

Active Sequence Tracking

Step	Seq A	Seq B	Seq C
0	"The"	"Hello"	"Yes"
1	"end"	<EOS>	","
2	"."	skip	"it"
3	<EOS>	skip	"is"

Benefits

1 queue round-trip for N tokens
Batch detokenization at the end
EOS-aware: inactive seqs skip forward passes
Not compatible with CUDA Graphs (enforced)

FlashInfer-Native Sampling

Standard: 5 Kernel Launches

FlashInfer: 1 Fused Kernel

idx = flashinfer.sampling \
  .top_k_top_p_sampling_from_logits(
    logits, top_k=top_k_t,
    top_p=top_p_t, deterministic=True)

Benchmark Setup

Configuration

Model: Qwen2.5-3B-Instruct
Dtype: float16
Output: 384 tokens (ignore_eos)
Prompt: ~200 tokens

Concurrency

1, 2, 4, 8, 16, 32, 64, 128

All concurrent · SSE streaming

Compared Against

Baseline (no opts)
CUDA Graphs / torch.compile
Chunked Prefill
SGLang / vLLM

Metric	What it measures	Phase
TTFT	Time to First Token	Prefill speed
ITL	Inter-Token Latency	Decode speed
E2E	End-to-End Latency	Total request time
Throughput	Tokens/second	System capacity

Benchmark — TTFT (Time to First Token)

Lower is better. Measures prefill speed.

Concurrency	Baseline	CUDA Graph	Chunked Prefill	torch.compile	SGLang	vLLM
1	0.046s	0.037s	0.041s	0.051s	0.104s	0.061s
8	0.094s	0.094s	0.088s	0.164s	0.047s	0.047s
32	0.253s	0.264s	0.226s	0.271s	0.117s	0.171s
64	0.493s	0.579s	0.396s	0.520s	0.399s	0.245s
128	0.964s	1.033s	0.743s	0.942s	0.336s	0.412s

Takeaway: Chunked Prefill is the clear TTFT winner within ecahLang — 23% improvement at 128 concurrency. CUDA Graphs are decode-only and slightly hurt TTFT. SGLang/vLLM excel at high concurrency (advanced prefill-decode interleaving).

Benchmark — ITL (Inter-Token Latency)

Lower is better. Measures decode speed.

Concurrency	Baseline	CUDA Graph	Chunked Prefill	torch.compile	SGLang	vLLM
1	26.30 ms	8.06 ms	7.84 ms	19.79 ms	3.85 ms	3.86 ms
8	57.36 ms	19.02 ms	18.07 ms	38.56 ms	4.09 ms	4.35 ms
32	71.55 ms	22.03 ms	22.69 ms	45.61 ms	4.50 ms	5.25 ms
64	98.22 ms	27.69 ms	28.68 ms	57.63 ms	5.02 ms	6.66 ms
128	120.75 ms	44.19 ms	43.04 ms	75.93 ms	6.67 ms	9.85 ms

Takeaway: CUDA Graphs give 2.7–3.5x ITL improvement. Chunked Prefill nearly matches at high concurrency — indirectly helps decode. SGLang/vLLM maintain sub-10ms ITL even at 128 (custom fused kernels + advanced scheduling).

Benchmark — E2E & Throughput

E2E Latency (lower is better)

Conc.	Baseline	Best ecahLang	SGLang
1	10.12s	3.05s	1.58s
8	22.07s	7.01s	1.61s
32	27.67s	8.71s	1.84s
128	47.25s	17.23s	2.89s

Throughput (higher is better)

Conc.	Baseline	Best ecahLang	SGLang
1	38	126	243
8	139	438	1,903
32	444	1,411	6,664
128	1,040	2,835	16,940

Why the ~6x throughput gap vs SGLang/vLLM?

Advanced schedulers: interleave prefill + decode
Tensor parallelism (multi-GPU)
Prefix caching (KV reuse)

Custom fused CUDA kernels
Speculative decoding
ecahLang: simplicity & hackability (~1500 LOC)

Future Work — LMCache Integration

Goal: KV cache reuse across requests with common token prefixes (system prompts, RAG context). Skip prefill for cached portions → reduce TTFT.

Expected Impact

1000-token cached system prompt = skip ~1000 tokens of prefill
CPU memory tier with LRU eviction
Significant TTFT reduction for repeated prefixes

Compatibility

Chunked Prefill: reduces count before budgeting
CUDA Graphs: decode-only, LMCache is prefill-only
torch.compile: same as CUDA Graphs

All Optimizations

#	Optimization	Phase	Key Idea	Impact
1	Paged KV Cache	Both	OS-style virtual memory for KV	128+ concurrent requests
2	FlashInfer Attention	Both	Batched paged attention kernels	Faster than SDPA
3	Chunked Prefill	Prefill	Token-budgeted batch splitting	-23% TTFT
4	CUDA Graphs	Decode	Record & replay GPU ops	2.7-3.5x ITL
5	torch.compile	Decode	JIT compile decode path	1.6x ITL
6	Stream Overlap	Both	Concurrent CUDA streams	~5-10% latency
7	Pinned Buffers	Decode	DMA + numpy views	Zero-alloc per step
8	Two-Phase Batching	Both	Collect while GPU runs	Hide collection latency
9	Multi-Step Decode	Decode	N steps per round-trip	Reduce queue overhead
10	Fused Sampling	Decode	Single FlashInfer kernel	Fewer launches

Key Takeaways

Different phases need different optimizations
Prefill is compute-bound → chunked prefill
Decode is memory-bandwidth-bound → CUDA graphs, stream overlap, pinned buffers

Simplicity and hackability over maximum throughput
~1,500 LOC vs 100k+ in SGLang/vLLM · Works with any HuggingFace CausalLM · OpenAI-compatible API

2.8x ITL
@ 128 concurrency

2.7x E2E
@ 128 concurrency

2.7x Throughput
1,040 → 2,835 tok/s

github.com/Scicom-AI-Enterprise-Organization/ecahLang