SupraLabs Ships a 30M-Parameter Any-to-Any Model

SupraLabs has released Supra-A2A-Nano-Exp, an experimental ~30-million-parameter “any-to-any” model that handles text, images, and video in a single autoregressive transformer — small enough to run on a laptop CPU. Released under Apache 2.0 by the independent research group behind a string of trending ultra-small models, it is framed not as a capable generator but as a deliberately transparent, hackable reference architecture for unified multimodal tokenization.

Advanced

Visualization of a single token stream blending teal text tokens and magenta image tokens flowing through a compact four-layer transformer block
Illustration generated by AI

At roughly 30M parameters, Supra-A2A-Nano-Exp is about 230 times smaller than a “small” 7B model. Its appeal is not output quality — the authors are blunt that you should “not expect coherent long-form text or photorealistic images” — but the clarity of its design. It collapses text and visual generation into one token stream, one vocabulary, and one output head, with no separate vision encoder. For anyone trying to understand how modern omni-modal models actually work, that minimalism is the point.

What “A2A” Means Here

The “A2A” in this model is easy to misread. In the broader agent ecosystem, A2A usually refers to Google’s Agent-to-Agent protocol, the open standard (now under the Linux Foundation) that lets independent AI agents discover and delegate tasks to one another. This model has nothing to do with that. Here, A2A means “any-to-any”: a single network that can take text, an image, or a video as input and produce text, an image, or a video as output. The naming collision is unfortunate, but the architecture is purely a multimodal sequence model, not an agent-communication system.

Architecture

The model is intentionally tiny and legible. The breakdown, from the model card:

  • GPT backbone: 4 transformer blocks, pre-norm, fused QKV attention, causal masking
  • Embedding dimension: 256, with 4 attention heads (64 dims each)
  • MLP: 4× expansion (256 → 1024 → 256), GELU activation
  • Context length: 384 tokens
  • Parameters: ~29.7M for the GPT plus ~0.22M for the VQ-VAE, all in fp32

The trick that makes a single model handle multiple modalities is a shared vocabulary of 50,520 tokens: 50,264 standard GPT-2-style BPE text tokens (including seven control tokens) plus 256 visual codes. Images are discretized into those 256 “visual words” by a small 3-layer convolutional VQ-VAE that downsamples by a factor of 8 — a 64×64 image becomes an 8×8 grid of visual tokens. Because text IDs and image codes live in the same embedding table and the same output head, cross-modal attention happens for free: the transformer never knows or cares whether the next token is a word or a pixel-patch code.

How It Works

Modality boundaries are marked with control tokens — <TEXT>, <IMAGE>, <VIDEO>, and <FRAME> — embedded directly in the sequence. To generate an image from a prompt, you literally feed the model a string and let it autoregress into visual codes, which the VQ-VAE decoder turns back into pixels. The repo ships a self-contained inference script:

# text completion
python run_supra_a2a.py --mode text --prompt "Once upon a time"

# text to image
python run_supra_a2a.py --mode text2image --prompt "<TEXT>a red square</TEXT><IMAGE>"

# image to text
python run_supra_a2a.py --mode image2text --image photo.png

# video to video
python run_supra_a2a.py --mode video2video --image clip.gif --frames 4

Sampling supports temperature, top-k, and modality-constrained decoding (so the model only emits valid visual codes inside an image block). The caveats are equally instructive: images must be square with side lengths that are multiples of 8, there is no instruction tuning or RLHF — just pure next-token training — and a few hyperparameters (attention head count, decoder activation) can’t be recovered from the weights alone and default to documented values.

What This Means

Supra-A2A-Nano-Exp sits at the intersection of two trends RITS has been tracking: the rise of unified omni-modal architectures and the surprising momentum behind ultra-small models. SupraLabs’ earlier 50M-parameter instruct model briefly topped Hugging Face’s trending list despite fitting in under 250 MB and running on a Raspberry Pi. The pitch for that family is “localized swarm intelligence” — many tiny specialized models running locally for tasks like intent parsing, PII redaction, and query routing, where privacy and latency matter more than raw capability.

This Nano model extends that ethos to multimodality. It will not replace a frontier image generator, and it is not meant to. Its value is pedagogical: a working, end-to-end demonstration that you can fold vision and language into one autoregressive stream on consumer hardware, with every moving part small enough to read in an afternoon. For students and researchers, that transparency is often worth more than another point of benchmark performance.

Related Coverage

Sources