Multimodal Embeddings Tutorial
Traditional search is limited to text matching. Multimodal embeddings project text, images, and other modalities into the same high-dimensional vector space, enabling cross-modal retrieval — search a product catalog of images using text queries, or find documents that relate to an uploaded photo. This shared representation works because embedding models learn to map semantically similar content to nearby points in vector space, regardless of the original modality.
What You Will Build
Section titled “What You Will Build”A multimodal embedding pipeline that embeds both text and images into a shared vector space, then computes similarity scores across modalities.
Prerequisites
Section titled “Prerequisites”- A Google Cloud API key with the Generative Language API enabled (or equivalent multimodal embedding provider)
- Understanding of the RAG pipeline
The Embedder Interface
Section titled “The Embedder Interface”Beluga AI’s embedding interface in the rag/embedding package separates document embedding from query embedding. This split exists because some embedding models use different strategies for documents versus queries — for example, adding a “query: ” prefix for retrieval-optimized models. The interface follows the same registry pattern as LLM providers, so embedding providers are created with embedding.New() and discovered with embedding.List().
type Embedder interface { EmbedDocuments(ctx context.Context, texts []string) ([][]float32, error) EmbedQuery(ctx context.Context, text string) ([]float32, error)}For multimodal use cases, embedding providers accept schema.Document objects with metadata that specifies the content type.
Step 1: Initialize the Embedding Provider
Section titled “Step 1: Initialize the Embedding Provider”package main
import ( "context" "fmt" "log" "os"
"github.com/lookatitude/beluga-ai/config" "github.com/lookatitude/beluga-ai/rag/embedding" _ "github.com/lookatitude/beluga-ai/rag/embedding/providers/google")
func main() { ctx := context.Background()
embedder, err := embedding.New("google", config.ProviderConfig{ "api_key": os.Getenv("GOOGLE_API_KEY"), "model": "text-embedding-004", }) if err != nil { log.Fatalf("create embedder: %v", err) }
// Embed text documents texts := []string{ "A golden retriever playing in a park", "Financial quarterly report Q3 2025", "Kubernetes deployment configuration", }
vectors, err := embedder.EmbedDocuments(ctx, texts) if err != nil { log.Fatalf("embed documents: %v", err) }
for i, v := range vectors { fmt.Printf("Text %d: %d dimensions\n", i, len(v)) }}Step 2: Embed Queries
Section titled “Step 2: Embed Queries”Use EmbedQuery for search queries, which may use a different embedding strategy optimized for retrieval. The distinction between EmbedDocuments and EmbedQuery matters because asymmetric embedding models produce better retrieval results when they can distinguish between the content being indexed and the question being asked about it.
queryVector, err := embedder.EmbedQuery(ctx, "dog playing outside")if err != nil { log.Fatalf("embed query: %v", err)}
fmt.Printf("Query vector: %d dimensions\n", len(queryVector))Step 3: Batch Processing for Performance
Section titled “Step 3: Batch Processing for Performance”Embedding one document at a time is slow because each call incurs HTTP round-trip overhead. Processing documents in batches amortizes this overhead and takes advantage of the embedding model’s ability to process multiple inputs in a single forward pass. The batch size should balance throughput against API rate limits and memory constraints.
func batchEmbed(ctx context.Context, embedder embedding.Embedder, texts []string, batchSize int) ([][]float32, error) { var allVectors [][]float32
for i := 0; i < len(texts); i += batchSize { end := i + batchSize if end > len(texts) { end = len(texts) }
batch := texts[i:end] vectors, err := embedder.EmbedDocuments(ctx, batch) if err != nil { return nil, fmt.Errorf("batch %d-%d: %w", i, end, err) } allVectors = append(allVectors, vectors...) }
return allVectors, nil}Step 4: Compute Cosine Similarity
Section titled “Step 4: Compute Cosine Similarity”Compare vectors to find semantic matches. Cosine similarity measures the angle between two vectors, producing a value between -1 and 1 where 1 means identical direction (maximum similarity). This metric is preferred over Euclidean distance for embeddings because it is invariant to vector magnitude — two vectors pointing in the same direction are similar regardless of their length.
import "math"
func cosineSimilarity(a, b []float32) float64 { if len(a) != len(b) { return 0 }
var dot, normA, normB float64 for i := range a { dot += float64(a[i]) * float64(b[i]) normA += float64(a[i]) * float64(a[i]) normB += float64(b[i]) * float64(b[i]) }
if normA == 0 || normB == 0 { return 0 } return dot / (math.Sqrt(normA) * math.Sqrt(normB))}Usage:
// Compare query to each documentfor i, docVec := range vectors { score := cosineSimilarity(queryVector, docVec) fmt.Printf("Similarity to doc %d: %.4f\n", i, score)}Step 5: Choosing the Right Model
Section titled “Step 5: Choosing the Right Model”| Model | Dimensions | Best For |
|---|---|---|
| OpenAI text-embedding-3-large | 3072 | High accuracy, general purpose |
| OpenAI text-embedding-3-small | 1536 | Cost-effective, general purpose |
| Google text-embedding-004 | 768 | Multimodal, cross-language |
| Ollama nomic-embed-text | 768 | Local, privacy-sensitive, free |
Always use the same model for both document indexing and query embedding. Mixing models produces incompatible vector spaces because each model learns a different mapping from text to vectors.
Verification
Section titled “Verification”- Embed a list of 100 texts using batch processing — measure the time versus sequential embedding.
- Compute similarity between semantically related texts — verify scores above 0.7.
- Compute similarity between unrelated texts — verify lower scores.
Next Steps
Section titled “Next Steps”- Fine-tuning Embedding Strategies — Optimize retrieval performance
- In-memory Vector Store — Store and search embeddings locally