PrismML’s 1-Bit Bonsai LLMs: 8B Model in 1.15 GB

April 1, 2026Provided by Utku Ege Tuluk

On March 31, 2026, PrismML emerged from stealth to announce 1-bit Bonsai — a family of open-weight language models the company calls the first commercially viable 1-bit LLMs. The 8B flagship fits in just 1.15 GB of memory, runs 8× faster than a standard FP16 8B model, and matches its benchmark performance. Backed by $16.25M from Khosla Ventures and built on years of mathematical research at Caltech, Bonsai makes a credible claim to being a turning point for efficient AI.

Intermediate

Performance vs. size scatter plot comparing 1-bit Bonsai models against Qwen3 and other 8B-class models — Image credit: PrismML

What Is a 1-Bit LLM?

In a standard language model, each weight is stored as a 16-bit or 32-bit floating-point number. In Bonsai, every weight is a single binary value — 0 or 1 — mapped to −scale or +scale, where each group of 128 weights shares one FP16 scale factor. The effective storage is just 1.125 bits per weight. Critically, this is not post-training quantization applied to an existing full-precision model; it is a native 1-bit architecture trained from scratch on Google v4 TPUs.

The practical consequence for inference is significant: the matrix multiplications that dominate transformer compute collapse into simple additions, since multiplying by ±1 requires no FPU. This is why the energy and speed gains are so large. PrismML CEO Babak Hassibi, a Caltech professor, put it plainly: “We spent years developing the mathematical theory required to compress a neural network without losing its reasoning capabilities.”

Three Models, Radical Efficiency

PrismML released three models simultaneously, all under the Apache 2.0 license, available on Hugging Face in GGUF and MLX formats for immediate use with llama.cpp, Ollama, and Apple Silicon:

1-bit Bonsai 8B — 1.15 GB (vs. ~16 GB for a standard FP16 8B model), competitive with Llama 3 8B on standard benchmarks
1-bit Bonsai 4B — 0.57 GB, 132 tokens/second on an M4 Pro
1-bit Bonsai 1.7B — 0.24 GB, 130 tokens/second on an iPhone 17 Pro Max

The headline claim is a ~14× reduction in memory footprint with no meaningful accuracy regression compared to full-precision 8B models. PrismML measures this with a custom metric they call Intelligence Density — performance per gigabyte — where Bonsai 8B scores 1.06/GB compared to 0.10/GB for Qwen3 8B.

Intelligence density bar chart: 1-bit Bonsai 8B scores 1.06/GB, far ahead of all compared models — Image credit: PrismML

Benchmark comparison table showing 1-bit Bonsai 8B alongside Qwen3 8B, Llama 3 8B, and other models across MMLU, GSM8K, HumanEval, and other evaluations — Image credit: PrismML

Energy and Hardware Implications

Beyond memory, the energy story is compelling. On an RTX 4090, 1-bit Bonsai 8B consumes 0.276 mWh per token compared to 1.134 mWh for a standard 8B 16-bit model — a 4× reduction. On an M4 Pro, the gap narrows but remains substantial (0.074 vs. 0.415 mWh/token). The 1.7B model is the only variant tested on an iPhone 17 Pro Max, since the full 8B 16-bit model does not fit in phone memory at all.

Energy consumption chart comparing 1-bit Bonsai 8B vs 8B 16-bit across RTX 4090, M4 Pro, and iPhone 17 Pro Max — Image credit: PrismML

This arithmetic opens up hardware markets previously inaccessible to 8B-class reasoning: microcontrollers, edge devices, and battery-constrained mobile deployments. Amir Salek, founder of Google’s TPU program and an investor in PrismML, called it “a fundamental change in the power-to-compute equation.”

Why This Time It Might Actually Work

Microsoft’s BitNet series has been pursuing 1-bit LLMs since 2023, but prior efforts consistently fell short of matching full-precision models at practical scales. What’s different with Bonsai?

PrismML argues three things close the gap: (1) a native 1-bit training regime rather than quantization after the fact, grounded in new mathematical theory from Caltech; (2) sufficient scale — 8B parameters appears to be large enough that the 1-bit constraint no longer causes meaningful accuracy degradation; and (3) standard deployment formats (GGUF, MLX) that let the models slot directly into existing inference stacks without custom infrastructure.

Independent community benchmarking is still forthcoming — the models were released just days ago. The “Intelligence Density” metric is also PrismML’s own framing, and replication of raw benchmark numbers against established leaderboards will be the real test. That said, the combination of open weights, permissive licensing, and credible academic backing makes Bonsai worth watching closely.

Related Coverage

Google’s TurboQuant Cuts LLM Memory 6x with Zero Accuracy Loss — another approach to dramatic LLM compression announced in March 2026