⚡ WebGPU · No Framework · No TVM

Run Phi-3 in your browser
with 10 kernels / 27 WGSL files

In February 2026, Hugging Face shipped Transformers.js v4 — a C++ WebGPU runtime built with Microsoft's ONNX Runtime team — as the production answer for browser LLM inference. In April 2026, this repo shows that for Phi-3 Mini specifically, the answer is 10 kernel roles across 27 WGSL files and ~2,000 lines of TypeScript. No WebLLM, no TVM, no ONNX, no WASM runtime.

Open Chat → GitHub ↗

~40

tok/s on M2 Pro

10 / 27

kernel roles / WGSL files

3.8B

parameter model

ML framework deps

2.1 GB

weights, Q4F16

zero-tvm · phi-3-mini · browser

// No WebLLM. No TVM. No WASM runtime.

// 10 kernel roles · 27 WGSL files · raw WebGPU.

[local] ndarray-cache.json

[cache] params_shard_0.bin 49.3 MB

...

✓ All weights loaded · 2.1 GB · GPU ready

YOU What is the capital of Australia?

PHI-3 The capital of Australia is Canberra.

It was selected as the capital in 1908 as a

compromise between rivals Sydney and Melbourne.

88 tokens · ~40 tok/s (M2 Pro) · 100% local · zero server

Architecture

How it works

Every step from token to token — weights, tokenizer, attention, sampling — runs in the browser with no external calls during inference.

STEP 01

📦

Load weights

Fetches MLC Q4F16 shards from HuggingFace (or browser cache from a prior WebLLM session). Uploads directly to GPU buffers.

STEP 02

🔤

Tokenize

Hand-written BPE tokenizer reads tokenizer.json. Applies Metaspace pre-tokenization and Phi-3 chat template — no tokenizer WASM.

STEP 03

⚡

GPU forward pass

32 transformer layers on the GPU. Embedding → RMSNorm → (QKV + RoPE + KV-append, fused) → Paged Attention → Fused FFN → LM Head. All WGSL, 228 dispatches per token.

STEP 04

🎯

Argmax + decode

GPU argmax over 32,064 logits selects the next token. BPE decode maps IDs back to text. Streams to the UI incrementally.

💬

Prompt

text

→

🔤
BPE Tokenizer
JS · no WASM

→

🔢
Embedding
shader 1

→

🔁
× 32 Layers
6 fused kernels

→

🎯
LM Head + Argmax
int4 + reduce

→

💬

Token

streamed

WGSL Shaders

10 kernel roles. 27 WGSL files.

Every compute step is a readable WGSL file you can open and understand. No auto-generated code. No compiled blobs. The extra files are A/B variants — subgroup-reduced kernels, tiled matmuls, and an opt-in int8 KV cache path.

Q4F16 dequant f16 arithmetic paged KV cache RoPE RMSNorm SiLU

01 🔤

Embedding

Token ID lookup with Q4F16 dequantization. Group size 32, int4 packed as uint32.

embedding.wgsl

02 📐

RMSNorm

Root mean square layer normalization. Tree-reduce in workgroup shared memory.

rms_norm.wgsl

03 🌀

QKV + RoPE + KV-append (fused)

Q/K/V projection, rotary embeddings, and paged-KV write in a single dispatch per layer. TVM ships these as three separate kernels.

qkv_fused.wgsl

04 👁️

Paged Attention

Multi-head attention over the paged KV cache. Online softmax, scaled dot-product, page-table read folded in.

attention.wgsl

05 ✖️

Int4 Matmul (O-proj)

Q4F16 output projection from attention. Same kernel family powers FFN-down and LM-head. Eight tile/subgroup variants for GPU portability.

int4_matmul*.wgsl

06 ➕

Add + RMSNorm (fused)

Residual add and normalize in one pass. Ping-pong buffers avoid GPU aliasing.

add_norm.wgsl

07 ⚡

Fused FFN (Gate · Up · SiLU)

Gate and up projections with SiLU activation fused. Q4F16, 8192-dim intermediate. Tiled-subgroup variant available via ?ffnsg=1.

fused_ffn.wgsl

08 🎯

LM Head + Argmax

Projects hidden state to 32,064 vocab logits (int4 matmul), then GPU argmax selects the next token. No CPU roundtrip.

argmax.wgsl + int4_matmul.wgsl

09 🧪

Subgroup variants

Subgroup-reduced versions of attention, QKV, argmax, and matmul — same math, one hardware-accelerated reduction step instead of workgroup shared memory.

*_sg.wgsl (6 files)

10 💾

Int8 KV cache (opt-in)

Per-page int8 quantization of K and V with an attention kernel that dequantizes on read. Halves KV memory at a small accuracy cost. Gate with ?kv8=1.

kv_quantize_int8 + attention_int8

Comparison

vs WebLLM vs transformers.js

Similar goal, very different philosophy. We trade peak speed for full transparency and zero dependencies.

	Zero TVM (ours)	TVM / WebLLM	Transformers.js v4 + ORT-WebGPU
Unique WGSL kernels	10 roles / 27 files	85 captured	ORT-generated (not user-visible)
Hand-written WGSL LOC	3,078 (incl. A/B variants)	12,962 (compiler-generated)	0 (runtime-generated)
Dispatches / decode token	228 (f16 KV) / 260 (int8 KV)	342 (measured)	not measured here
Runtime language	TypeScript	Apache TVM WASM	C++ via ONNX Runtime
JS bundle (chat page, excl. weights)	157 kB / 33 kB gz	5.9 MB / 2.1 MB gz	~10 MB JS + native
Decode speed (Phi-3, M2 Pro)	~40 tok/s	~51 tok/s	not measured here
Model scope	Phi-3 Mini Q4F16_1 only	~20 MLC-compiled models	~200 architectures
Readable source	✓ 27 WGSL files	✗ Compiled blob	✗ Generated ops
Designed for	"Readable in one sitting"	General production	General production

Transformers.js v4 is the right choice for most production work. Zero-TVM is not trying to replace it. The point of the contrast: when a stack can assume a single model and precision, most of the compiler and runtime surface is optional.

What this repo is not

Not a compiler. Nothing here emits WGSL — the 27 shader files are human-written.
Not cross-architecture. Hard-coded to Phi-3 Mini with Q4F16_1 weights. No Llama / Qwen / Gemma path without shader work.
Not cross-runtime. Browser WebGPU only. No Node / Bun / Deno target.
Not a production replacement for Transformers.js v4. It does fewer things, on purpose.
Not faster than WebLLM on Apple Silicon. ~22% behind on M2 Pro (~40 vs ~51 tok/s). Closing that gap is work-in-progress, not a shipped claim.

It is a minimal-surface-area demonstration that for one fixed model shape, the compiler/runtime complexity is not required to do browser LLM inference end-to-end.

Explore deeper