Skip to content
Docs

RAG Pipeline Guide

Language models generate answers from their training data, but they cannot access your private documents, recent data, or domain-specific knowledge. Retrieval-Augmented Generation (RAG) solves this by fetching relevant documents at query time and injecting them into the LLM’s context window. The model then generates answers grounded in your actual data rather than relying on potentially outdated or hallucinated information.

The rag/ package provides a complete, modular pipeline for building RAG systems. Each stage — loading, splitting, embedding, storing, and retrieving — is a separate package with its own interface, registry, and providers. This decomposition lets you swap any component independently: change your vector database without touching your embedding logic, or upgrade your retrieval strategy without modifying your document pipeline.

graph LR
  subgraph Indexing
    A[Documents] --> B[Loader] --> C[Splitter] --> D[Embedder] --> E[VectorStore]
  end
  subgraph Query
    F[Query] --> G[Embedder] --> H[Retriever] --> I[Relevant Docs] --> J[LLM] --> K[Response]
  end
  E -.-> H

Each stage is a separate package with its own interface, registry, and providers:

PackageInterfacePurpose
rag/loaderDocumentLoaderLoad content from files, URLs, APIs
rag/splitterTextSplitterChunk documents for embedding
rag/embeddingEmbedderConvert text to vectors
rag/vectorstoreVectorStoreStore and search embeddings
rag/retrieverRetrieverFind relevant documents

The first step in any RAG pipeline is getting your data into a structured format. Document loaders read content from various sources — files, URLs, APIs — and produce schema.Document values with both content and metadata. The registry pattern means you can add new loader types (databases, cloud storage, custom APIs) without modifying existing code.

Load documents from various sources:

import (
"github.com/lookatitude/beluga-ai/config"
"github.com/lookatitude/beluga-ai/rag/loader"
)
// Load a text file
textLoader, err := loader.New("text", config.ProviderConfig{})
if err != nil {
log.Fatal(err)
}
docs, err := textLoader.Load(ctx, "/path/to/document.txt")
if err != nil {
log.Fatal(err)
}
LoaderFormatDescription
textPlain textSimple file loading
jsonJSONConfigurable path extraction
csvCSVOne document per row
markdownMarkdownStructure-aware loading

After loading, you often need to enrich documents with additional metadata for filtering and auditing. Transformers let you add source attribution, timestamps, or any custom metadata before documents enter the splitting stage. This metadata is preserved through splitting and stored alongside embeddings, enabling filtered searches later.

// Add metadata to every document
addSource := loader.TransformerFunc(func(ctx context.Context, doc schema.Document) (schema.Document, error) {
if doc.Metadata == nil {
doc.Metadata = make(map[string]any)
}
doc.Metadata["source"] = "internal-docs"
doc.Metadata["loaded_at"] = time.Now().Format(time.RFC3339)
return doc, nil
})

Embedding models have token limits — typically 512 to 8192 tokens depending on the model. Documents that exceed these limits must be split into smaller chunks. But splitting is not just about fitting token budgets: smaller, focused chunks improve retrieval precision because each chunk’s embedding captures a narrower semantic meaning, making it easier to match against specific queries.

The chunk_overlap parameter controls how many characters overlap between adjacent chunks. Overlap prevents information loss at split boundaries — without it, a sentence that spans two chunks would be cut in half, and neither chunk would contain the complete thought. An overlap of 10-20% of the chunk size is typically sufficient to preserve context.

Split documents into chunks optimized for embedding:

import (
"github.com/lookatitude/beluga-ai/config"
"github.com/lookatitude/beluga-ai/rag/splitter"
)
s, err := splitter.New("recursive", config.ProviderConfig{
Options: map[string]any{
"chunk_size": 1000,
"chunk_overlap": 200,
},
})
if err != nil {
log.Fatal(err)
}
// Split raw text
chunks, err := s.Split(ctx, longText)
// Or split documents (preserves metadata)
chunkedDocs, err := s.SplitDocuments(ctx, docs)

SplitDocuments preserves the original metadata and adds chunk_index, chunk_total, and parent_id to each chunk.

SplitterStrategyBest For
recursiveRecursive character boundariesGeneral-purpose text
markdownHeading hierarchyMarkdown documents
tokenToken-based boundariesPrecise token-budget chunks

Embeddings convert text into dense vector representations where semantically similar texts are close together in vector space. This is the core mechanism that enables semantic search — finding documents by meaning rather than exact keyword matches. The embedding model you choose affects both the quality of retrieval and the dimensionality (and therefore storage cost) of your vectors.

Convert text to vector representations:

import (
"github.com/lookatitude/beluga-ai/rag/embedding"
_ "github.com/lookatitude/beluga-ai/rag/embedding/providers/openai"
)
embedder, err := embedding.New("openai", embedding.ProviderConfig{
APIKey: os.Getenv("OPENAI_API_KEY"),
Model: "text-embedding-3-small",
})
if err != nil {
log.Fatal(err)
}
// Embed a batch of texts
vectors, err := embedder.Embed(ctx, []string{"hello world", "goodbye world"})
// Embed a single text
vec, err := embedder.EmbedSingle(ctx, "search query")
// Check dimensions
fmt.Println("Dimensions:", embedder.Dimensions())
ProviderImport PathModels
OpenAIrag/embedding/providers/openaitext-embedding-3-small, text-embedding-3-large
Googlerag/embedding/providers/googletext-embedding-004
Cohererag/embedding/providers/cohereembed-english-v3.0
Voyagerag/embedding/providers/voyagevoyage-3
Mistralrag/embedding/providers/mistralmistral-embed
Jinarag/embedding/providers/jinajina-embeddings-v3
Ollamarag/embedding/providers/ollamaLocal models
Sentence Transformersrag/embedding/providers/sentence_transformersLocal models
In-Memoryrag/embedding/providers/inmemoryTest/dev (random vectors)

Vector stores persist embeddings and support efficient similarity search over them. When a query arrives, it is embedded using the same model, and the vector store finds the nearest neighbors — the documents most semantically similar to the query. Different backends offer different trade-offs between latency, scalability, filtering capabilities, and operational complexity.

Store and search embeddings:

import (
"github.com/lookatitude/beluga-ai/rag/vectorstore"
_ "github.com/lookatitude/beluga-ai/rag/vectorstore/providers/pgvector"
)
store, err := vectorstore.New("pgvector", vectorstore.ProviderConfig{
ConnectionString: os.Getenv("DATABASE_URL"),
})
if err != nil {
log.Fatal(err)
}
// Add documents with embeddings
err = store.Add(ctx, chunkedDocs, vectors)
// Search for similar documents
queryVec, err := embedder.EmbedSingle(ctx, "What is Go?")
results, err := store.Search(ctx, queryVec, 10,
vectorstore.WithThreshold(0.7),
vectorstore.WithFilter(map[string]any{"source": "internal-docs"}),
)
OptionDescription
WithThreshold(t)Minimum similarity score (0.0–1.0)
WithFilter(meta)Match metadata key-value pairs
WithStrategy(s)Distance metric: Cosine, DotProduct, Euclidean
ProviderImport PathType
In-Memoryrag/vectorstore/providers/inmemoryDevelopment/testing
pgvectorrag/vectorstore/providers/pgvectorPostgreSQL extension
Pineconerag/vectorstore/providers/pineconeManaged cloud
Qdrantrag/vectorstore/providers/qdrantOpen-source
Weaviaterag/vectorstore/providers/weaviateOpen-source
Milvusrag/vectorstore/providers/milvusOpen-source
Chromarag/vectorstore/providers/chromaOpen-source
Redisrag/vectorstore/providers/redisRedis Stack
Elasticsearchrag/vectorstore/providers/elasticsearchElastic
MongoDBrag/vectorstore/providers/mongodbAtlas Vector Search
SQLite-vecrag/vectorstore/providers/sqlitevecEmbedded
Vesparag/vectorstore/providers/vespaEnterprise search
Turbopufferrag/vectorstore/providers/turbopufferServerless

The Retriever interface abstracts the search step, decoupling your application from specific vector store implementations and search strategies. Retrievers can combine multiple backends, apply reranking, or implement advanced strategies like CRAG and HyDE. This abstraction is where the most impactful RAG quality improvements happen — choosing the right retrieval strategy often matters more than choosing the right embedding model.

import "github.com/lookatitude/beluga-ai/rag/retriever"
docs, err := r.Retrieve(ctx, "What is quantum computing?",
retriever.WithTopK(5),
retriever.WithThreshold(0.7),
retriever.WithMetadata(map[string]any{"topic": "physics"}),
)
StrategyDescriptionWhen to Use
vectorPure vector similarity searchSimple use cases
hybridVector + BM25 with RRF fusionRecommended default
cragCorrective RAG with quality gradingQuality-critical applications
hydeHypothetical Document EmbeddingsSparse-data domains
adaptiveAdjusts strategy based on queryVariable query patterns
ensembleCombines multiple retriever outputsMaximum recall

Pure vector search excels at finding semantically similar content but can miss documents that contain the exact keywords a user is looking for. Conversely, BM25 keyword matching finds exact term matches but misses paraphrases and synonyms. Hybrid search combines both signals using Reciprocal Rank Fusion (RRF), which merges the ranked results from each method into a single list. This is the recommended default because it handles both precise keyword queries (“error code 404”) and conceptual queries (“how to handle missing pages”) effectively.

hybridRetriever, err := retriever.New("hybrid", retriever.ProviderConfig{
Options: map[string]any{
"vector_store": store,
"embedder": embedder,
"bm25_weight": 0.3,
"vector_weight": 0.7,
},
})
docs, err := hybridRetriever.Retrieve(ctx, "Go concurrency patterns",
retriever.WithTopK(10),
)

A fundamental problem with naive RAG is that retrieved documents may be irrelevant to the query. When an LLM receives irrelevant context, it often generates plausible-sounding but incorrect answers — a form of hallucination. Corrective RAG addresses this by using an LLM to grade each retrieved document for relevance before passing it to the generation step. Documents below the confidence threshold are discarded, and if too few relevant documents remain, CRAG can trigger a web search as a fallback. This quality-gating step significantly reduces hallucination in production systems.

cragRetriever, err := retriever.New("crag", retriever.ProviderConfig{
Options: map[string]any{
"base_retriever": baseRetriever,
"grader_llm": model,
"threshold": 0.6,
},
})

Short or vague user queries often produce poor embeddings because there is not enough semantic content to capture the user’s intent. For example, the query “auth” generates a very different embedding than a paragraph explaining authentication flows. HyDE solves this by first asking an LLM to generate a hypothetical document that would answer the query, then embedding that hypothetical answer instead of the raw query. The hypothetical document’s embedding is much closer in vector space to the actual relevant documents, dramatically improving recall for sparse-data domains and terse queries.

hydeRetriever, err := retriever.New("hyde", retriever.ProviderConfig{
Options: map[string]any{
"base_retriever": baseRetriever,
"llm": model,
"embedder": embedder,
},
})

The following example demonstrates the full RAG pipeline from end to end: loading a text file, splitting it into chunks, embedding the chunks, storing them in an in-memory vector database, retrieving relevant context for a query, and generating an answer with an LLM. In production, you would replace the in-memory store with a persistent backend like pgvector or Pinecone.

package main
import (
"context"
"fmt"
"log"
"os"
"github.com/lookatitude/beluga-ai/llm"
"github.com/lookatitude/beluga-ai/rag/embedding"
"github.com/lookatitude/beluga-ai/rag/loader"
"github.com/lookatitude/beluga-ai/rag/retriever"
"github.com/lookatitude/beluga-ai/rag/splitter"
"github.com/lookatitude/beluga-ai/rag/vectorstore"
"github.com/lookatitude/beluga-ai/schema"
"github.com/lookatitude/beluga-ai/config"
_ "github.com/lookatitude/beluga-ai/llm/providers/openai"
_ "github.com/lookatitude/beluga-ai/rag/embedding/providers/openai"
_ "github.com/lookatitude/beluga-ai/rag/vectorstore/providers/inmemory"
)
func main() {
ctx := context.Background()
// 1. Load documents
l, err := loader.New("text", config.ProviderConfig{})
if err != nil {
log.Fatal(err)
}
docs, err := l.Load(ctx, "knowledge-base.txt")
if err != nil {
log.Fatal(err)
}
// 2. Split into chunks
s, err := splitter.New("recursive", config.ProviderConfig{
Options: map[string]any{"chunk_size": 500, "chunk_overlap": 50},
})
if err != nil {
log.Fatal(err)
}
chunks, err := s.SplitDocuments(ctx, docs)
if err != nil {
log.Fatal(err)
}
// 3. Embed chunks
emb, err := embedding.New("openai", embedding.ProviderConfig{
APIKey: os.Getenv("OPENAI_API_KEY"),
Model: "text-embedding-3-small",
})
if err != nil {
log.Fatal(err)
}
texts := make([]string, len(chunks))
for i, c := range chunks {
texts[i] = c.Content
}
vectors, err := emb.Embed(ctx, texts)
if err != nil {
log.Fatal(err)
}
// 4. Store in vector database
store, err := vectorstore.New("inmemory", vectorstore.ProviderConfig{})
if err != nil {
log.Fatal(err)
}
err = store.Add(ctx, chunks, vectors)
if err != nil {
log.Fatal(err)
}
// 5. Retrieve relevant context
query := "How does error handling work?"
queryVec, err := emb.EmbedSingle(ctx, query)
if err != nil {
log.Fatal(err)
}
relevant, err := store.Search(ctx, queryVec, 5)
if err != nil {
log.Fatal(err)
}
// 6. Generate answer with context
model, err := llm.New("openai", llm.ProviderConfig{
APIKey: os.Getenv("OPENAI_API_KEY"),
Model: "gpt-4o",
})
if err != nil {
log.Fatal(err)
}
contextStr := ""
for _, doc := range relevant {
contextStr += doc.Content + "\n\n"
}
msgs := []schema.Message{
schema.NewSystemMessage("Answer the question using the provided context. If unsure, say so."),
schema.NewHumanMessage(fmt.Sprintf("Context:\n%s\nQuestion: %s", contextStr, query)),
}
resp, err := model.Generate(ctx, msgs)
if err != nil {
log.Fatal(err)
}
fmt.Println(resp.Text())
}

Beluga AI uses the hooks pattern across all subsystems for lifecycle observation without wrapping. Retriever hooks let you log queries, measure latency, audit which documents were retrieved, and track reranking behavior. Hooks are optional function fields — any nil hook is simply skipped, so you only pay for the observation you need.

hooks := retriever.Hooks{
BeforeRetrieve: func(ctx context.Context, query string) error {
log.Printf("Retrieving for: %q", query)
return nil
},
AfterRetrieve: func(ctx context.Context, docs []schema.Document, err error) {
log.Printf("Found %d documents", len(docs))
},
OnRerank: func(ctx context.Context, query string, before, after []schema.Document) {
log.Printf("Reranked: %d%d documents", len(before), len(after))
},
}