<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Embeddings on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/embeddings/</link><description>Recent content in Embeddings on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Wed, 13 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/embeddings/index.xml" rel="self" type="application/rss+xml"/><item><title>Two flavors of omni — one index to retrieve with, one framework to generate with</title><link>https://ice-ice-bear.github.io/posts/2026-05-13-multimodal-embeddings-digest/</link><pubDate>Wed, 13 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-13-multimodal-embeddings-digest/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post Two flavors of omni — one index to retrieve with, one framework to generate with" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;Two systems that surfaced around the same time both lead with the same word — &lt;strong&gt;&amp;ldquo;omni&amp;rdquo;&lt;/strong&gt;. One is an &lt;a class="link" href="https://en.wikipedia.org/wiki/Sentence_embedding" target="_blank" rel="noopener"
 &gt;embedding&lt;/a&gt; model that puts text, image, video, and audio into &lt;a class="link" href="https://en.wikipedia.org/wiki/Search_engine_indexing" target="_blank" rel="noopener"
 &gt;a single index&lt;/a&gt; and searches them together (&lt;a class="link" href="https://www.elastic.co/search-labs/blog/jina-embeddings-v5-omni-all-media-one-index" target="_blank" rel="noopener"
 &gt;jina-embeddings-v5-omni&lt;/a&gt;); the other is an image-generation &lt;a class="link" href="https://en.wikipedia.org/wiki/Foundation_model" target="_blank" rel="noopener"
 &gt;foundation model&lt;/a&gt; that folds high-fidelity generation and precise editing into one framework (&lt;a class="link" href="https://arxiv.org/abs/2605.10730" target="_blank" rel="noopener"
 &gt;Qwen-Image-2.0&lt;/a&gt;). Retrieval and generation point in opposite directions, yet both stand on the same design philosophy: &lt;strong&gt;drop the default of &amp;ldquo;a separate pipeline per modality&amp;rdquo; and merge it into one.&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Theme["The omni philosophy: &amp;lt;br/&amp;gt; collapse per-modality pipelines into one"]
 Theme --&gt; Retrieval["retrieval side &amp;lt;br/&amp;gt; (all-media one index)"]
 Theme --&gt; Generation["generation side &amp;lt;br/&amp;gt; (generation + editing)"]

 Retrieval --&gt; J1["jina-embeddings-v5-omni"]
 J1 --&gt; J2["cross-modal projector"]
 J1 --&gt; J3["truncatable + BBQ quantization"]
 J1 --&gt; J4["semantic_text index"]

 Generation --&gt; Q1["Qwen-Image-2.0"]
 Q1 --&gt; Q2["Qwen3-VL condition encoder"]
 Q1 --&gt; Q3["Multimodal Diffusion Transformer"]
 Q1 --&gt; Q4["1K-token instructions"]&lt;/pre&gt;&lt;h2 id="1-jina-embeddings-v5-omni--all-media-one-index"&gt;1. jina-embeddings-v5-omni — all media, one index
&lt;/h2&gt;&lt;p&gt;This is the launch post for &lt;a class="link" href="https://www.elastic.co/search-labs/blog/jina-embeddings-v5-omni-all-media-one-index" target="_blank" rel="noopener"
 &gt;jina-embeddings-v5-omni&lt;/a&gt;, published May 11, 2026 by &lt;a class="link" href="https://www.elastic.co/search-labs/author/scott-martens" target="_blank" rel="noopener"
 &gt;Scott Martens&lt;/a&gt; on &lt;a class="link" href="https://www.elastic.co/search-labs" target="_blank" rel="noopener"
 &gt;Elastic Search Labs&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="core"&gt;Core
&lt;/h3&gt;&lt;p&gt;The long-standing pain in &lt;a class="link" href="https://en.wikipedia.org/wiki/Multimedia_information_retrieval" target="_blank" rel="noopener"
 &gt;multimodal retrieval&lt;/a&gt; is that every modality runs its own indexing pipeline. Text gets a text embedder, images get something &lt;a class="link" href="https://en.wikipedia.org/wiki/Contrastive_Language-Image_Pre-training" target="_blank" rel="noopener"
 &gt;CLIP&lt;/a&gt;-shaped, audio gets yet another model — and cross-modal search is stitched together by hand. v5-omni puts text (~100 languages), images, video, and audio into &lt;strong&gt;one &lt;a class="link" href="https://en.wikipedia.org/wiki/Elasticsearch" target="_blank" rel="noopener"
 &gt;Elasticsearch&lt;/a&gt; index&lt;/strong&gt; and queries them at once.&lt;/p&gt;
&lt;h3 id="how"&gt;How
&lt;/h3&gt;&lt;p&gt;Not a full retrain but &lt;strong&gt;modular assembly&lt;/strong&gt;. The designers lift encoders straight out of pretrained models — &lt;a class="link" href="https://arxiv.org/abs/2502.14786" target="_blank" rel="noopener"
 &gt;SigLIP2&lt;/a&gt;-family vision encoders, &lt;a class="link" href="https://github.com/openai/whisper" target="_blank" rel="noopener"
 &gt;Whisper-large-v3&lt;/a&gt; for audio — and attach them as preprocessors in front of the existing jina-embeddings-v5-text backbone. The key piece is a trained &lt;strong&gt;cross-modal projector&lt;/strong&gt;: a small adapter that translates each media encoder&amp;rsquo;s output into the text model&amp;rsquo;s embedding space. For the small version that is roughly 5.5 million new parameters.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;small&lt;/strong&gt;: 1024-dim embeddings, 32,768-token context, 1.66B parameters with all extensions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;nano&lt;/strong&gt;: 768-dim embeddings, 8,192-token context, 1.004B parameters fully loaded&lt;/li&gt;
&lt;li&gt;Both swap in task-specific &lt;a class="link" href="https://arxiv.org/abs/2106.09685" target="_blank" rel="noopener"
 &gt;LoRA&lt;/a&gt; adapters for retrieval, clustering, classification, and semantic similarity&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="a-sense-of-storage-reality"&gt;A sense of storage reality
&lt;/h3&gt;&lt;p&gt;In large-scale &lt;a class="link" href="https://en.wikipedia.org/wiki/Vector_database" target="_blank" rel="noopener"
 &gt;vector search&lt;/a&gt;, embedding dimensionality is cost. v5-omni answers two ways. First, &lt;strong&gt;truncation&lt;/strong&gt; — in the style of &lt;a class="link" href="https://arxiv.org/abs/2205.13147" target="_blank" rel="noopener"
 &gt;Matryoshka representation learning&lt;/a&gt;, embeddings collapse from native dimension down to 32 dimensions, cutting storage by 93% at a 64-byte size. Second, &lt;a class="link" href="https://www.elastic.co/search-labs/blog/better-binary-quantization-lucene-elasticsearch" target="_blank" rel="noopener"
 &gt;Better Binary Quantization&lt;/a&gt; (BBQ) compatibility — meshing with Elasticsearch&amp;rsquo;s quantization for &amp;ldquo;near-identical performance&amp;rdquo; at lower precision. And crucially, the &lt;strong&gt;text embeddings v5-omni produces are identical&lt;/strong&gt; to jina-embeddings-v5-text. An existing text index can be promoted to a multimedia index in place.&lt;/p&gt;
&lt;h3 id="benchmarks"&gt;Benchmarks
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Text retrieval: top of its size class on the &lt;a class="link" href="https://github.com/embeddings-benchmark/mteb" target="_blank" rel="noopener"
 &gt;MMTEB&lt;/a&gt; suite&lt;/li&gt;
&lt;li&gt;Visual similarity: &amp;ldquo;only beaten by a model three times its size&amp;rdquo;; nano surpasses models 10-25x larger&lt;/li&gt;
&lt;li&gt;Visual document retrieval: competitive with 3-7B models while staying under 1B&lt;/li&gt;
&lt;li&gt;Audio: among the top on &lt;a class="link" href="https://huggingface.co/datasets/mteb/MAEB" target="_blank" rel="noopener"
 &gt;MAEB&lt;/a&gt; audio retrieval&lt;/li&gt;
&lt;li&gt;Video temporal grounding: 55.57 on &lt;a class="link" href="https://github.com/jiyanggao/TALL" target="_blank" rel="noopener"
 &gt;Charades-STA&lt;/a&gt; (vs ByteDance Seed 1.6 at 29.30), 58.93 on MomentSeeker&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-matters-now"&gt;Why it matters now
&lt;/h3&gt;&lt;p&gt;This is not &amp;ldquo;one more embedding model.&amp;rdquo; It &lt;strong&gt;simplifies an abstraction layer in search infrastructure by a notch.&lt;/strong&gt; In Elasticsearch you create an index with &lt;code&gt;type: semantic_text&lt;/code&gt;, drop the model name into &lt;code&gt;inference_id&lt;/code&gt;, and non-text inputs get Base64-encoded into the same fields. The modality-branching logic disappears from the application layer. Anyone who has built a &lt;a class="link" href="https://en.wikipedia.org/wiki/Retrieval-augmented_generation" target="_blank" rel="noopener"
 &gt;RAG&lt;/a&gt; pipeline knows exactly which operational cost that simplification removes.&lt;/p&gt;
&lt;h2 id="2-qwen-image-20--generation-and-editing-in-one-framework"&gt;2. Qwen-Image-2.0 — generation and editing in one framework
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.10730" target="_blank" rel="noopener"
 &gt;arxiv 2605.10730&lt;/a&gt;, authored by 75 contributors from the &lt;a class="link" href="https://qwenlm.github.io/" target="_blank" rel="noopener"
 &gt;Alibaba Qwen&lt;/a&gt; team, May 11, 2026, &lt;a class="link" href="https://arxiv.org/list/cs.CV/recent" target="_blank" rel="noopener"
 &gt;cs.CV&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="core-1"&gt;Core
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Qwen-Image-2.0&lt;/strong&gt; is an omni-capable image-generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. It targets exactly where existing models stay weak — ultra-long text rendering, multilingual &lt;a class="link" href="https://en.wikipedia.org/wiki/Typography" target="_blank" rel="noopener"
 &gt;typography&lt;/a&gt;, high-resolution &lt;a class="link" href="https://en.wikipedia.org/wiki/Photorealism" target="_blank" rel="noopener"
 &gt;photorealism&lt;/a&gt;, robust instruction following, and efficient deployment — especially in text-rich, compositionally complex scenes.&lt;/p&gt;
&lt;h3 id="how-1"&gt;How
&lt;/h3&gt;&lt;p&gt;The core structure couples two parts. It uses &lt;strong&gt;&lt;a class="link" href="https://qwenlm.github.io/" target="_blank" rel="noopener"
 &gt;Qwen3-VL&lt;/a&gt; as the condition encoder&lt;/strong&gt; and stacks a &lt;strong&gt;Multimodal &lt;a class="link" href="https://arxiv.org/abs/2212.09748" target="_blank" rel="noopener"
 &gt;Diffusion Transformer&lt;/a&gt;&lt;/strong&gt; on top for joint condition-target modeling. It is a &lt;a class="link" href="https://www.wpeebles.com/DiT" target="_blank" rel="noopener"
 &gt;DiT&lt;/a&gt;-family design — a &lt;a class="link" href="https://en.wikipedia.org/wiki/Diffusion_model" target="_blank" rel="noopener"
 &gt;diffusion model&lt;/a&gt; that takes its denoising backbone as a transformer instead of a &lt;a class="link" href="https://en.wikipedia.org/wiki/U-Net" target="_blank" rel="noopener"
 &gt;U-Net&lt;/a&gt; — backed by large-scale data curation and a customized multi-stage training pipeline. That structure keeps strong &lt;a class="link" href="https://en.wikipedia.org/wiki/Multimodal_learning" target="_blank" rel="noopener"
 &gt;multimodal understanding&lt;/a&gt; while moving flexibly between generation and editing.&lt;/p&gt;
&lt;h3 id="contributions"&gt;Contributions
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;Supports &lt;strong&gt;instructions up to 1K tokens&lt;/strong&gt; for text-rich content like slides, posters, infographics, and comics&lt;/li&gt;
&lt;li&gt;Substantially improves multilingual text fidelity and typography&lt;/li&gt;
&lt;li&gt;Strengthens photorealistic generation with richer detail, more realistic textures, and coherent lighting&lt;/li&gt;
&lt;li&gt;Follows complex prompts more reliably across diverse styles&lt;/li&gt;
&lt;li&gt;Extensive &lt;a class="link" href="https://en.wikipedia.org/wiki/Human_evaluation_of_machine_translation" target="_blank" rel="noopener"
 &gt;human evaluation&lt;/a&gt; shows it substantially outperforms previous Qwen-Image models on both generation and editing&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-it-matters-now-1"&gt;Why it matters now
&lt;/h3&gt;&lt;p&gt;The history of generative image models has been &lt;strong&gt;the separation of generation and editing&lt;/strong&gt;. You make something with &lt;a class="link" href="https://en.wikipedia.org/wiki/Stable_Diffusion" target="_blank" rel="noopener"
 &gt;Stable Diffusion&lt;/a&gt;, then fix it separately with &lt;a class="link" href="https://github.com/lllyasviel/ControlNet" target="_blank" rel="noopener"
 &gt;ControlNet&lt;/a&gt; or &lt;a class="link" href="https://en.wikipedia.org/wiki/Inpainting" target="_blank" rel="noopener"
 &gt;inpainting&lt;/a&gt; tools. Qwen-Image-2.0 puts both inside one model via joint condition-target modeling. That the condition encoder is a &lt;a class="link" href="https://en.wikipedia.org/wiki/Vision-language_model" target="_blank" rel="noopener"
 &gt;VLM&lt;/a&gt; matters too — the same encoder understands image conditions as well as text prompts, so &amp;ldquo;change this image like so&amp;rdquo; travels the same path as generation.&lt;/p&gt;
&lt;h2 id="reading-the-cluster"&gt;Reading the cluster
&lt;/h2&gt;&lt;p&gt;A retrieval model and a generation model, opposite tasks — yet the design decisions mirror each other.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Aspect&lt;/th&gt;
 &lt;th&gt;jina-embeddings-v5-omni&lt;/th&gt;
 &lt;th&gt;Qwen-Image-2.0&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Direction&lt;/td&gt;
 &lt;td&gt;multimodal → embedding (retrieval)&lt;/td&gt;
 &lt;td&gt;condition → image (generation/editing)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;What&amp;rsquo;s unified&lt;/td&gt;
 &lt;td&gt;per-modality indexing pipelines&lt;/td&gt;
 &lt;td&gt;a generation model + an editing model&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Means of unification&lt;/td&gt;
 &lt;td&gt;cross-modal projector&lt;/td&gt;
 &lt;td&gt;Qwen3-VL condition encoder&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Backbone&lt;/td&gt;
 &lt;td&gt;jina-embeddings-v5-text&lt;/td&gt;
 &lt;td&gt;Multimodal Diffusion Transformer&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Reuse strategy&lt;/td&gt;
 &lt;td&gt;pretrained encoders + small adapters&lt;/td&gt;
 &lt;td&gt;a VLM repurposed as condition encoder&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Deployment lens&lt;/td&gt;
 &lt;td&gt;truncation &amp;amp; BBQ for storage savings&lt;/td&gt;
 &lt;td&gt;up to 1K tokens, efficient deployment emphasized&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;flowchart LR
 subgraph Retrieval
 T1["text"] --&gt; P["cross-modal &amp;lt;br/&amp;gt; projector"]
 I1["image/video"] --&gt; P
 A1["audio"] --&gt; P
 P --&gt; IDX["one index"]
 end
 subgraph Generation
 TXT["text condition"] --&gt; VL["Qwen3-VL &amp;lt;br/&amp;gt; condition encoder"]
 IMG["image condition"] --&gt; VL
 VL --&gt; MMDIT["MM Diffusion &amp;lt;br/&amp;gt; Transformer"]
 MMDIT --&gt; OUT["generated/edited result"]
 end&lt;/pre&gt;&lt;p&gt;Three shared patterns. First, &lt;strong&gt;reuse of pretrained assets&lt;/strong&gt; — jina pulls in SigLIP2 and Whisper encoders, Qwen pulls in all of Qwen3-VL. Neither trains from scratch. Second, &lt;strong&gt;projection into a shared representation space&lt;/strong&gt; — jina&amp;rsquo;s projector funnels every medium into the text embedding space; Qwen&amp;rsquo;s condition encoder funnels text and image conditions into the same diffusion input. Third, &lt;strong&gt;deployment cost as a first-class design element&lt;/strong&gt; — jina with truncation and &lt;a class="link" href="https://en.wikipedia.org/wiki/Quantization_%28signal_processing%29" target="_blank" rel="noopener"
 &gt;quantization&lt;/a&gt;, Qwen with efficient deployment as an explicit goal. These are designs that assume a production system, not a research demo.&lt;/p&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;It is no accident that &lt;code&gt;omni&lt;/code&gt; got pinned to both systems at once. The first generation of multimodal AI was &lt;strong&gt;a dedicated model per modality&lt;/strong&gt; — CLIP for images, Whisper for audio, &lt;a class="link" href="https://en.wikipedia.org/wiki/BERT_%28language_model%29" target="_blank" rel="noopener"
 &gt;BERT&lt;/a&gt;-family for text. The second generation stitched them together with &lt;a class="link" href="https://en.wikipedia.org/wiki/Multimodal_learning" target="_blank" rel="noopener"
 &gt;late fusion&lt;/a&gt;. What we are watching now is the third generation — &lt;strong&gt;one representation space, one framework&lt;/strong&gt;. jina-v5-omni reaches that point from the retrieval side, Qwen-Image-2.0 from the generation side. The interesting part is that neither is &lt;em&gt;full unification&lt;/em&gt; but &lt;em&gt;clever reassembly&lt;/em&gt;: lift pretrained encoders, bind them with a small adapter or a joint-modeling layer. Training an omni model from scratch is still astronomically expensive, so practical omni comes from module reuse. And both cases &lt;strong&gt;bake deployment cost into the design at the research stage&lt;/strong&gt; — truncation, BBQ quantization, 1K-token instructions, efficient deployment. That is the signal that multimodal has moved past demos and become infrastructure. The next round&amp;rsquo;s question is not &amp;ldquo;more modalities&amp;rdquo; but &amp;ldquo;how cheaply and how reliably do you run this unification.&amp;rdquo;&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Primary sources&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://www.elastic.co/search-labs/blog/jina-embeddings-v5-omni-all-media-one-index" target="_blank" rel="noopener"
 &gt;One index, all media: Introducing jina-embeddings-v5-omni&lt;/a&gt; — &lt;a class="link" href="https://www.elastic.co/search-labs/author/scott-martens" target="_blank" rel="noopener"
 &gt;Scott Martens&lt;/a&gt;, &lt;a class="link" href="https://www.elastic.co/search-labs" target="_blank" rel="noopener"
 &gt;Elastic Search Labs&lt;/a&gt; (2026-05-11)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.10730" target="_blank" rel="noopener"
 &gt;Qwen-Image-2.0 Technical Report (2605.10730)&lt;/a&gt; — 75 contributors from the &lt;a class="link" href="https://qwenlm.github.io/" target="_blank" rel="noopener"
 &gt;Alibaba Qwen&lt;/a&gt; team (2026-05-11, &lt;a class="link" href="https://arxiv.org/list/cs.CV/recent" target="_blank" rel="noopener"
 &gt;cs.CV&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Models &amp;amp; components&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2502.14786" target="_blank" rel="noopener"
 &gt;SigLIP2 (2502.14786)&lt;/a&gt; — the vision encoder family jina-v5-omni borrows&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/openai/whisper" target="_blank" rel="noopener"
 &gt;Whisper&lt;/a&gt; — the audio encoder jina-v5-omni borrows&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://qwenlm.github.io/" target="_blank" rel="noopener"
 &gt;Qwen&lt;/a&gt; — home of the Qwen3-VL that Qwen-Image-2.0 uses as condition encoder&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2106.09685" target="_blank" rel="noopener"
 &gt;LoRA: Low-Rank Adaptation (2106.09685)&lt;/a&gt; — the basis of the task-specific adapters&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2205.13147" target="_blank" rel="noopener"
 &gt;Matryoshka Representation Learning (2205.13147)&lt;/a&gt; — the principle behind truncatable embeddings&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2212.09748" target="_blank" rel="noopener"
 &gt;Scalable Diffusion Models with Transformers — DiT (2212.09748)&lt;/a&gt; — the lineage of the Multimodal Diffusion Transformer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Background&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://en.wikipedia.org/wiki/Multimedia_information_retrieval" target="_blank" rel="noopener"
 &gt;Multimedia information retrieval&lt;/a&gt; · &lt;a class="link" href="https://en.wikipedia.org/wiki/Vector_database" target="_blank" rel="noopener"
 &gt;Vector database&lt;/a&gt; · &lt;a class="link" href="https://en.wikipedia.org/wiki/Sentence_embedding" target="_blank" rel="noopener"
 &gt;Sentence embedding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://en.wikipedia.org/wiki/Diffusion_model" target="_blank" rel="noopener"
 &gt;Diffusion model&lt;/a&gt; · &lt;a class="link" href="https://en.wikipedia.org/wiki/Vision-language_model" target="_blank" rel="noopener"
 &gt;Vision-language model&lt;/a&gt; · &lt;a class="link" href="https://en.wikipedia.org/wiki/Multimodal_learning" target="_blank" rel="noopener"
 &gt;Multimodal learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://en.wikipedia.org/wiki/Contrastive_Language-Image_Pre-training" target="_blank" rel="noopener"
 &gt;Contrastive Language-Image Pre-training (CLIP)&lt;/a&gt; — the first-generation multimodal staple&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://en.wikipedia.org/wiki/Retrieval-augmented_generation" target="_blank" rel="noopener"
 &gt;Retrieval-augmented generation&lt;/a&gt; — the canonical pipeline multimodal retrieval slots into&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.elastic.co/search-labs/blog/better-binary-quantization-lucene-elasticsearch" target="_blank" rel="noopener"
 &gt;Better Binary Quantization&lt;/a&gt; — Elasticsearch BBQ explainer&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/embeddings-benchmark/mteb" target="_blank" rel="noopener"
 &gt;MTEB / MMTEB&lt;/a&gt; — the embedding benchmark suite&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>