Cactus Releases Needle: A 26M Distilled Model for On-Device Tool Calling

Cactus Compute has released Needle, an open-source 26-million-parameter model distilled from Google’s Gemini 3.1 Flash Lite for single-shot function calling. Released under the MIT license, Needle quantizes to a 14 MB INT4 footprint and is designed to run AI agents entirely on phones, smartwatches, and other consumer devices — a sharp contrast to the cloud round-trips that power most tool-calling assistants today.

Advanced

Needle project banner from Cactus Compute
Image credit: Cactus Compute / Needle GitHub

An Attention-Only Architecture

Needle is built on what Cactus calls a Simple Attention Network (SAN) — a transformer-style model with the feed-forward (MLP) layers removed entirely. The encoder–decoder design uses 12 encoder layers without FFNs and 8 decoder layers with masked self-attention plus cross-attention. The hidden dimension is 512, with 8 attention heads (4 key-value heads), an 8,192-entry BPE vocabulary, RoPE positional encoding, and shared embedding weights between the encoder and the output projection.

The team’s argument for dropping MLPs is task-specific: single-shot tool calling is a “retrieval-and-assembly” problem — matching a user query against a list of tool definitions, extracting arguments, and emitting JSON. Softmax attention is already a non-linear routing primitive, and at under 50 M parameters the FFN budget contributes less than additional attention layers would. Removing MLPs eliminates roughly two-thirds of the parameter count of a comparable transformer and cuts inference latency on edge hardware.

Other architectural choices reinforce the small-model recipe: gated residuals (x + sigmoid(gate) · Attn(Norm(x)) with the gate initialized to zero), ZCRMSNorm applied to QK heads for training stability, a contrastive CLIP-style tool selection head for filtering relevant tools from larger sets, the Muon optimizer with an orthogonality constraint on linear projections to prevent representation collapse, and INT4 quantization-aware training applied as regularization noise every 100 steps.

Stylized diagram of an encoder-decoder neural network with cross-attention arcs and no feed-forward blocks
Illustration generated by AI

Training Recipe and Benchmarks

Pretraining ran on 16 TPU v6e chips for 27 hours, consuming 200 billion tokens. Post-training on a 2 billion-token synthetic function-call dataset took only 45 minutes. The fine-tuning data was generated by Gemini across 15 categories — timers, messaging, navigation, smart-home control, and similar on-device assistant tasks — making this a textbook example of using a frontier model as a data engine rather than as a runtime dependency.

On single-shot function calling, Cactus reports that Needle outperforms FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM 2.5-350M — all of them an order of magnitude larger. On Cactus’s own runtime, the model hits 6,000 tokens/sec prefill and 1,200 tokens/sec decode. The team is careful to position Needle as a specialist: those larger competitors retain broader scope for conversational use, while Needle is optimized narrowly for the retrieve-arguments-and-emit-JSON loop.

Why It Matters

Needle is interesting on two axes. First, it pushes back on the assumption that meaningful agentic behavior requires hundreds of millions of parameters or a cloud connection. A 14 MB on-device model that can reliably parse “what’s the weather in San Francisco?” into a structured tool call opens the door to genuinely local assistants on watches and glasses, with the privacy and latency properties that implies.

Second, the project illustrates a clean separation between training-time and inference-time use of frontier models. Cactus used Gemini to synthesize a domain-specific dataset, then deployed a tiny open model — the API output served as the training signal, not a production dependency. That pattern is increasingly common, and it sits squarely inside the broader debate about distillation, attribution, and what frontier-model providers’ terms of service should permit (see related coverage below).

Weights are available on Hugging Face, the training and data-generation pipeline is on GitHub, and a local playground (needle playground) lets developers fine-tune the model on their own tool schemas via a web UI. In-context learning is not currently supported but is on the roadmap.

Related Coverage

Sources