Local inference · memory & throughput

Will it run?

Pick your machine and a model. The meter shows exactly how the memory budget is spent — weights, KV cache, and overhead — against what your hardware can actually hand to the model. Throughput is estimated from memory bandwidth.

Your machine

Preset

Memory (GB)

Bandwidth (GB/s)

Memory type

The model

Preset Context length 8K Detail readout for

—

tokens / sec · generation

Memory allocation

Weights — KV cache — Overhead — Usable —

Quant	Weights	Total @ ctx	tok/s	Fit

How to read it. Green fits comfortably in usable memory; amber is tight (within ~92% of usable — expect slowdowns and little headroom for longer context); red won't fit on the GPU/unified budget alone. For discrete GPUs, a red model can still run by spilling layers into system RAM, but throughput collapses — that's shown as a partial-offload estimate.

Assumptions & math

Weights = params × effective-bits ÷ 8. Effective bits use GGUF k-quant averages (Q4_K_M ≈ 4.83 bpw), so numbers track real downloaded file sizes, not the naive bit count.

KV cache = 2 × layers × kv_heads × head_dim × context × 2 bytes (fp16), computed from each model's real architecture. Custom models use a GQA-based estimate. Quantizing the KV cache to 8-bit roughly halves this.

Overhead = compute/activation buffer ≈ 0.8 GB + 5% of weights.

Usable memory — unified: 75% of total (macOS GPU cap; raisable via iogpu.wired_limit_mb); VRAM: 92%; system RAM: 80%.

Throughput ≈ bandwidth ÷ active-weights × efficiency (~0.55 GPU/unified, ~0.4 CPU). This is generation speed and a ballpark — real numbers vary with engine, batch, and prompt length. MoE models use active (not total) params for speed.