RAG Pipeline
Hybrid retrieval combining dense vectors, BM25 sparse search, and graph traversal with Reciprocal Rank Fusion. 12+ vector stores, 8+ embedding providers, and advanced strategies like CRAG, Adaptive RAG, HyDE, and GraphRAG.
Overview
Retrieval-Augmented Generation (RAG) is how agents ground their responses in real data instead of relying solely on training knowledge. Beluga AI's RAG pipeline is built around a hybrid search default that combines dense vector similarity, BM25 sparse keyword matching, and optional graph traversal — fused with Reciprocal Rank Fusion (k=60). This three-signal approach consistently outperforms any single retrieval method alone.
The pipeline is modular at every stage. Choose from 8+ embedding providers (OpenAI, Google, Cohere, Voyage, Jina, and more), 12+ vector stores (pgvector, Qdrant, Pinecone, Milvus, and others), and 8+ document loaders for ingesting content from web pages, PDFs, APIs, and cloud storage. Each component implements a clean interface and is swappable via the registry pattern.
Beyond basic retrieval, Beluga AI includes 5 advanced strategies for production RAG systems: CRAG for relevance-aware fallback, Adaptive RAG for query-complexity routing, HyDE for zero-shot retrieval, SEAL-RAG for self-aligned generation, and GraphRAG for knowledge-graph-enhanced answers. These strategies compose with the base pipeline, letting you start simple and add sophistication as your requirements evolve.
Capabilities
Hybrid Search Default
Every retrieval query runs through a three-stage pipeline by default. First, BM25 sparse search returns approximately 200 keyword-matched candidates. Second, dense vector search narrows to the top 100 by semantic similarity. Finally, cross-encoder reranking selects the top 10 most relevant chunks. Results from sparse and dense stages are combined with Reciprocal Rank Fusion (k=60).
retriever := rag.NewHybridRetriever(
rag.WithSparse(bm25Index), // BM25 keyword matching
rag.WithDense(vectorStore, embedder), // Dense vector similarity
rag.WithReranker(crossEncoder), // Cross-encoder precision
rag.WithRRF(60), // Reciprocal Rank Fusion
rag.WithTopK(10), // Final result count
) Embedding Providers
Eight embedding providers covering proprietary and open-source models. Each implements the Embedder interface with batch embedding support and automatic dimension handling.
- OpenAI — text-embedding-3-small/large, ada-002
- Google — text-embedding-004, Gecko
- Ollama — Local embedding models (nomic-embed, mxbai)
- Cohere — embed-v3, multilingual
- Voyage — voyage-3, code-optimized embeddings
- Jina — jina-embeddings-v3, multilingual and cross-lingual
- Mistral — mistral-embed
- Sentence Transformers — Local ONNX-based inference
embedder, _ := embedding.New("openai", embedding.Config{
Model: "text-embedding-3-large",
Dimensions: 1536,
})
vectors, err := embedder.EmbedBatch(ctx, documents) Vector Store Providers
Twelve vector store backends ranging from lightweight embedded options to distributed cloud-scale systems. All implement the same VectorStore interface with support for metadata filtering, namespace isolation, and batch operations.
- pgvector — PostgreSQL extension, HNSW/IVFFlat indexes
- Qdrant — Purpose-built, advanced filtering, hybrid search
- Pinecone — Managed cloud, serverless option
- ChromaDB — Developer-friendly, embedded or client-server
- Weaviate — Graph + vector, hybrid BM25
- Milvus — Distributed, billion-scale
- Turbopuffer — Serverless, cost-optimized
- Redis — In-memory speed, RediSearch integration
- Elasticsearch — Full-text + vector, existing infrastructure
- SQLite-vec — Embedded, zero-dependency local
- MongoDB — Atlas Vector Search, document store integration
- Vespa — Hybrid serving, real-time indexing
store, _ := vectorstore.New("pgvector", vectorstore.Config{
ConnectionString: "postgres://localhost/beluga",
Collection: "documents",
Dimensions: 1536,
}) Advanced Retrieval Strategies
Five strategies for production RAG systems that go beyond basic retrieve-and-generate:
- CRAG (Corrective RAG) — Evaluates retrieved document relevance; falls back to web search when confidence is below threshold.
- Adaptive RAG — Routes by query complexity: no retrieval for simple factual questions, single-step for straightforward lookups, multi-step for complex reasoning chains.
- HyDE (Hypothetical Document Embeddings) — Generates a hypothetical answer first, then uses its embedding for retrieval. Enables zero-shot retrieval without training data.
- SEAL-RAG — Self-Aligned RAG that iteratively refines retrieval and generation.
- GraphRAG — Builds a knowledge graph with community summaries (Microsoft approach) for complex multi-hop questions.
retriever := rag.NewAdaptiveRetriever(
rag.WithSimpleHandler(directLLM), // No retrieval needed
rag.WithSingleStep(hybridRetriever), // Standard RAG
rag.WithMultiStep(iterativeRetriever), // Multi-hop reasoning
rag.WithComplexityClassifier(classifier),
) Contextual Retrieval Ingestion
During document ingestion, each chunk is enriched with document-level context before embedding. An LLM prepends a brief summary describing how the chunk fits within the larger document, significantly improving retrieval accuracy for chunks that would otherwise lack sufficient context on their own.
pipeline := rag.NewIngestionPipeline(
rag.WithLoader(loader),
rag.WithSplitter(splitter),
rag.WithContextualRetrieval(model), // Prepend doc-level context
rag.WithEmbedder(embedder),
rag.WithStore(vectorStore),
) Document Loaders
Eight document loaders for ingesting content from diverse sources. Each returns a stream of Document objects with metadata preserved for downstream filtering.
- Firecrawl — Web scraping with JavaScript rendering
- Unstructured.io — PDF, DOCX, PPTX, HTML parsing
- Docling — Advanced document understanding
- Confluence — Atlassian wiki pages and spaces
- Notion — Notion pages and databases
- GitHub — Repository files and README content
- Google Drive — Docs, Sheets, and file content
- S3/GCS — Cloud object storage files
loader, _ := loader.New("firecrawl", loader.Config{
APIKey: os.Getenv("FIRECRAWL_API_KEY"),
})
docs, err := loader.Load(ctx, "https://example.com/docs") Text Splitters
Three splitting strategies to divide documents into chunks optimized for embedding and retrieval. Recursive character splits by hierarchy (paragraph, sentence, word) with configurable overlap. Semantic splits at topic boundaries detected by embedding similarity. Token-based splits by exact token count for precise context budget control.
splitter := splitter.NewRecursive(
splitter.WithChunkSize(512),
splitter.WithChunkOverlap(50),
splitter.WithSeparators([]string{"\n\n", "\n", ". ", " "}),
)
chunks := splitter.Split(documents) Cross-Encoder Reranking
The final stage of the retrieval pipeline uses cross-encoder models for maximum precision. Unlike bi-encoder embeddings that encode query and document independently, cross-encoders process the query-document pair together, capturing fine-grained relevance signals that dramatically improve top-k accuracy.
reranker, _ := reranker.New("cross-encoder", reranker.Config{
Model: "cross-encoder/ms-marco-MiniLM-L-12-v2",
TopK: 10,
})
reranked, err := reranker.Rerank(ctx, query, candidates) Architecture
Providers & Implementations
Embedding Providers
| Provider | Priority | Key Differentiator |
|---|---|---|
| OpenAI | Core | text-embedding-3-small/large, industry standard |
| Core | text-embedding-004, Gecko, multimodal | |
| Ollama | Core | Local inference, nomic-embed, mxbai-embed |
| Cohere | Extended | embed-v3, multilingual, search-optimized |
| Voyage | Extended | voyage-3, code-optimized, high-quality retrieval |
| Jina | Extended | jina-embeddings-v3, multilingual, cross-lingual |
| Mistral | Extended | mistral-embed, EU-hosted |
| Sentence Transformers | Community | Local ONNX inference, no API dependency |
Vector Stores
| Provider | Priority | Key Differentiator |
|---|---|---|
| pgvector | Core | PostgreSQL extension, HNSW/IVFFlat, use existing Postgres |
| Qdrant | Core | Purpose-built, advanced filtering, hybrid search native |
| Pinecone | Core | Managed cloud, serverless option, zero ops |
| ChromaDB | Extended | Developer-friendly, embedded or client-server mode |
| Weaviate | Extended | Graph + vector hybrid, built-in BM25 |
| Milvus | Extended | Distributed, billion-scale, GPU-accelerated |
| Turbopuffer | Extended | Serverless, cost-optimized storage |
| Redis | Extended | In-memory speed, RediSearch integration |
| Elasticsearch | Extended | Full-text + vector, leverage existing infrastructure |
| SQLite-vec | Community | Embedded, zero-dependency, local development |
| MongoDB | Community | Atlas Vector Search, document store integration |
| Vespa | Community | Hybrid serving engine, real-time indexing |
Document Loaders
| Loader | Priority | Key Differentiator |
|---|---|---|
| Firecrawl | Core | Web scraping with JavaScript rendering and crawling |
| Unstructured.io | Core | PDF, DOCX, PPTX, HTML, images — multi-format parsing |
| Docling | Extended | Advanced document understanding and layout analysis |
| Confluence | Extended | Atlassian wiki pages, spaces, and attachments |
| Notion | Extended | Pages, databases, and rich content blocks |
| GitHub | Extended | Repository files, READMEs, issues, and PRs |
| Google Drive | Community | Docs, Sheets, Slides, and file storage |
| S3/GCS | Community | Cloud object storage with prefix filtering |
Full Example
A complete RAG pipeline: load documents, split, embed, store, retrieve, and stream an answer:
package main
import (
"context"
"fmt"
"os"
"github.com/lookatitude/beluga-ai/llm"
"github.com/lookatitude/beluga-ai/rag/embedding"
"github.com/lookatitude/beluga-ai/rag/loader"
"github.com/lookatitude/beluga-ai/rag/retriever"
"github.com/lookatitude/beluga-ai/rag/splitter"
"github.com/lookatitude/beluga-ai/rag/vectorstore"
"github.com/lookatitude/beluga-ai/schema"
)
func main() {
ctx := context.Background()
// 1. Load documents from a website
webLoader, _ := loader.New("firecrawl", loader.Config{
APIKey: os.Getenv("FIRECRAWL_API_KEY"),
})
docs, _ := webLoader.Load(ctx, "https://docs.example.com")
// 2. Split into chunks with overlap
chunks := splitter.NewRecursive(
splitter.WithChunkSize(512),
splitter.WithChunkOverlap(50),
).Split(docs)
// 3. Embed and store in pgvector
embedder, _ := embedding.New("openai", embedding.Config{
Model: "text-embedding-3-large",
})
store, _ := vectorstore.New("pgvector", vectorstore.Config{
ConnectionString: os.Getenv("DATABASE_URL"),
Collection: "docs",
})
store.AddDocuments(ctx, chunks, embedder)
// 4. Build a hybrid retriever
ret := retriever.NewHybrid(
retriever.WithDense(store, embedder),
retriever.WithReranker(retriever.CrossEncoder("ms-marco-MiniLM")),
retriever.WithTopK(5),
)
// 5. Retrieve and generate a streaming answer
model, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"})
results, _ := ret.Retrieve(ctx, "How do I configure authentication?")
// Build context from retrieved chunks
contextText := ""
for _, doc := range results {
contextText += doc.Content + "\n---\n"
}
for event, err := range model.Stream(ctx, []schema.Message{
{Role: "system", Content: "Answer using the provided context:\n" + contextText},
{Role: "user", Content: "How do I configure authentication?"},
}) {
if err != nil { break }
fmt.Print(event.Text())
}
}