Google Launches Gemini Embedding 2: Its First Multimodal Embedding Model

March 16, 2026Provided by Utku Ege Tuluk

On March 10, 2026, Google DeepMind released Gemini Embedding 2 — the company’s first natively multimodal embedding model. Unlike traditional embedding models that handle only text, Gemini Embedding 2 maps text, images, video, audio, and documents into a single unified vector space, enabling cross-modal search, classification, and clustering across over 100 languages.

Intermediate

Abstract visualization of five geometric forms representing text, images, video, audio, and documents converging into a single embedding space — Illustration generated by AI

What Are Embeddings — and Why Go Multimodal?

Embeddings convert data into numerical vectors that capture semantic meaning. Two pieces of content with similar meaning land close together in vector space, making embeddings the backbone of modern search, recommendation, and retrieval-augmented generation (RAG) systems. Until now, most production embedding models handled text only — requiring separate pipelines for images, audio, or video. Gemini Embedding 2 eliminates that fragmentation by projecting all five modalities into one shared space.

Technical Specifications

The model generates 3,072-dimensional vectors by default but supports flexible output via Matryoshka Representation Learning (MRL). MRL packs the most critical semantic information into the earliest dimensions of the vector, so developers can truncate to 1,536 or 768 dimensions with minimal accuracy loss — Google’s benchmarks show near-peak performance even at 768 dimensions.

Input limits are generous across modalities:

Text: up to 8,192 tokens (4× the previous model)
Images: up to 6 per request (PNG/JPEG)
Video: up to 120 seconds (MP4/MOV)
Audio: up to 80 seconds, ingested natively without transcription
PDF documents: up to 6 pages per request

Crucially, the model accepts interleaved multimodal input — you can pass an image and text together in a single request, and the model captures the combined semantic relationship rather than treating each modality independently. Developers can also specify a task_type parameter (e.g., RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, CLASSIFICATION) to optimize the vector’s mathematical properties for specific operations.

Benchmark Performance

Gemini Embedding 2 leads the MTEB (Massive Text Embedding Benchmark) English leaderboard with a score of 68.32, a +5.81 point margin over competitors — a substantial gap in a field where improvements are often measured in fractions of a point. On code-specific retrieval (MTEB Code), it scores 74.66, and it leads the MMTEB multilingual benchmark by +5.09 points.

Dimension flexibility barely dents accuracy: at 2,048 dimensions the model scores 68.16; at 1,536 it scores 68.17; at 768 it still manages 67.99 — meaning teams can cut storage costs significantly with negligible retrieval quality loss.

On video retrieval tasks (Vatex, MSR-VTT, Youcook2), the model outperforms all existing alternatives by a wide margin. Early adopters have reported a 70% latency reduction and 20% recall improvement over conventional multi-model pipelines that chain separate text, image, and video embedding systems.

Pricing and Availability

Gemini Embedding 2 is available now in public preview as gemini-embedding-2-preview through both the Gemini API and Vertex AI. Pricing is set at $0.25 per million tokens with a free tier included. The model integrates with popular frameworks including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Google’s Vector Search.

It’s worth noting the competitive landscape: open-source models like Alibaba’s Qwen3-8B (MTEB 70.2) and NVIDIA’s NV-Embed-v2 (MTEB 69.3) achieve comparable or superior scores on text-only benchmarks — but neither matches Gemini Embedding 2’s native multimodal breadth.

Related Coverage

Google Releases Gemini 3.1 Pro with 2× Reasoning Performance — the latest Gemini language model update
Gemini 3: A New Era of Intelligence from Google — the Gemini 3 family launch

What Are Embeddings — and Why Go Multimodal?

Technical Specifications

Benchmark Performance

Pricing and Availability

Related Coverage

Sources