On March 10, 2026, Google DeepMind released Gemini Embedding 2 — the company’s first natively multimodal embedding model. Unlike traditional embedding models that handle only text, Gemini Embedding 2 maps text, images, video, audio, and documents into a single unified vector space, enabling cross-modal search, classification, and clustering across over 100 languages.
Intermediate
Embeddings convert data into numerical vectors that capture semantic meaning. Two pieces of content with similar meaning land close together in vector space, making embeddings the backbone of modern search, recommendation, and retrieval-augmented generation (RAG) systems. Until now, most production embedding models handled text only — requiring separate pipelines for images, audio, or video. Gemini Embedding 2 eliminates that fragmentation by projecting all five modalities into one shared space.
The model generates 3,072-dimensional vectors by default but supports flexible output via Matryoshka Representation Learning (MRL). MRL packs the most critical semantic information into the earliest dimensions of the vector, so developers can truncate to 1,536 or 768 dimensions with minimal accuracy loss — Google’s benchmarks show near-peak performance even at 768 dimensions.
Input limits are generous across modalities:
Crucially, the model accepts interleaved multimodal input — you can pass an image and text together in a single request, and the model captures the combined semantic relationship rather than treating each modality independently. Developers can also specify a task_type parameter (e.g., RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, CLASSIFICATION) to optimize the vector’s mathematical properties for specific operations.
Gemini Embedding 2 leads the MTEB (Massive Text Embedding Benchmark) English leaderboard with a score of 68.32, a +5.81 point margin over competitors — a substantial gap in a field where improvements are often measured in fractions of a point. On code-specific retrieval (MTEB Code), it scores 74.66, and it leads the MMTEB multilingual benchmark by +5.09 points.
Dimension flexibility barely dents accuracy: at 2,048 dimensions the model scores 68.16; at 1,536 it scores 68.17; at 768 it still manages 67.99 — meaning teams can cut storage costs significantly with negligible retrieval quality loss.
On video retrieval tasks (Vatex, MSR-VTT, Youcook2), the model outperforms all existing alternatives by a wide margin. Early adopters have reported a 70% latency reduction and 20% recall improvement over conventional multi-model pipelines that chain separate text, image, and video embedding systems.
Gemini Embedding 2 is available now in public preview as gemini-embedding-2-preview through both the Gemini API and Vertex AI. Pricing is set at $0.25 per million tokens with a free tier included. The model integrates with popular frameworks including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Google’s Vector Search.
It’s worth noting the competitive landscape: open-source models like Alibaba’s Qwen3-8B (MTEB 70.2) and NVIDIA’s NV-Embed-v2 (MTEB 69.3) achieve comparable or superior scores on text-only benchmarks — but neither matches Gemini Embedding 2’s native multimodal breadth.
