RAG Pipeline Guide
Language models generate answers from their training data, but they cannot access your private documents, recent data, or domain-specific knowledge. Retrieval-Augmented Generation (RAG) solves this by fetching relevant documents at query time and injecting them into the LLM’s context window. The model then generates answers grounded in your actual data rather than relying on potentially outdated or hallucinated information.
The rag/ package provides a complete, modular pipeline for building RAG systems. Each stage — loading, splitting, embedding, storing, and retrieving — is a separate package with its own interface, registry, and providers. This decomposition lets you swap any component independently: change your vector database without touching your embedding logic, or upgrade your retrieval strategy without modifying your document pipeline.
Pipeline Architecture
Section titled “Pipeline Architecture”graph LR
subgraph Indexing
A[Documents] --> B[Loader] --> C[Splitter] --> D[Embedder] --> E[VectorStore]
end
subgraph Query
F[Query] --> G[Embedder] --> H[Retriever] --> I[Relevant Docs] --> J[LLM] --> K[Response]
end
E -.-> H
Each stage is a separate package with its own interface, registry, and providers:
| Package | Interface | Purpose |
|---|---|---|
rag/loader | DocumentLoader | Load content from files, URLs, APIs |
rag/splitter | TextSplitter | Chunk documents for embedding |
rag/embedding | Embedder | Convert text to vectors |
rag/vectorstore | VectorStore | Store and search embeddings |
rag/retriever | Retriever | Find relevant documents |
Document Loading
Section titled “Document Loading”The first step in any RAG pipeline is getting your data into a structured format. Document loaders read content from various sources — files, URLs, APIs — and produce schema.Document values with both content and metadata. The registry pattern means you can add new loader types (databases, cloud storage, custom APIs) without modifying existing code.
Load documents from various sources:
import ( "github.com/lookatitude/beluga-ai/config" "github.com/lookatitude/beluga-ai/rag/loader")
// Load a text filetextLoader, err := loader.New("text", config.ProviderConfig{})if err != nil { log.Fatal(err)}
docs, err := textLoader.Load(ctx, "/path/to/document.txt")if err != nil { log.Fatal(err)}Built-in Loaders
Section titled “Built-in Loaders”| Loader | Format | Description |
|---|---|---|
text | Plain text | Simple file loading |
json | JSON | Configurable path extraction |
csv | CSV | One document per row |
markdown | Markdown | Structure-aware loading |
Document Transformers
Section titled “Document Transformers”After loading, you often need to enrich documents with additional metadata for filtering and auditing. Transformers let you add source attribution, timestamps, or any custom metadata before documents enter the splitting stage. This metadata is preserved through splitting and stored alongside embeddings, enabling filtered searches later.
// Add metadata to every documentaddSource := loader.TransformerFunc(func(ctx context.Context, doc schema.Document) (schema.Document, error) { if doc.Metadata == nil { doc.Metadata = make(map[string]any) } doc.Metadata["source"] = "internal-docs" doc.Metadata["loaded_at"] = time.Now().Format(time.RFC3339) return doc, nil})Text Splitting
Section titled “Text Splitting”Embedding models have token limits — typically 512 to 8192 tokens depending on the model. Documents that exceed these limits must be split into smaller chunks. But splitting is not just about fitting token budgets: smaller, focused chunks improve retrieval precision because each chunk’s embedding captures a narrower semantic meaning, making it easier to match against specific queries.
The chunk_overlap parameter controls how many characters overlap between adjacent chunks. Overlap prevents information loss at split boundaries — without it, a sentence that spans two chunks would be cut in half, and neither chunk would contain the complete thought. An overlap of 10-20% of the chunk size is typically sufficient to preserve context.
Split documents into chunks optimized for embedding:
import ( "github.com/lookatitude/beluga-ai/config" "github.com/lookatitude/beluga-ai/rag/splitter")
s, err := splitter.New("recursive", config.ProviderConfig{ Options: map[string]any{ "chunk_size": 1000, "chunk_overlap": 200, },})if err != nil { log.Fatal(err)}
// Split raw textchunks, err := s.Split(ctx, longText)
// Or split documents (preserves metadata)chunkedDocs, err := s.SplitDocuments(ctx, docs)SplitDocuments preserves the original metadata and adds chunk_index, chunk_total, and parent_id to each chunk.
Built-in Splitters
Section titled “Built-in Splitters”| Splitter | Strategy | Best For |
|---|---|---|
recursive | Recursive character boundaries | General-purpose text |
markdown | Heading hierarchy | Markdown documents |
token | Token-based boundaries | Precise token-budget chunks |
Embeddings
Section titled “Embeddings”Embeddings convert text into dense vector representations where semantically similar texts are close together in vector space. This is the core mechanism that enables semantic search — finding documents by meaning rather than exact keyword matches. The embedding model you choose affects both the quality of retrieval and the dimensionality (and therefore storage cost) of your vectors.
Convert text to vector representations:
import ( "github.com/lookatitude/beluga-ai/rag/embedding" _ "github.com/lookatitude/beluga-ai/rag/embedding/providers/openai")
embedder, err := embedding.New("openai", embedding.ProviderConfig{ APIKey: os.Getenv("OPENAI_API_KEY"), Model: "text-embedding-3-small",})if err != nil { log.Fatal(err)}
// Embed a batch of textsvectors, err := embedder.Embed(ctx, []string{"hello world", "goodbye world"})
// Embed a single textvec, err := embedder.EmbedSingle(ctx, "search query")
// Check dimensionsfmt.Println("Dimensions:", embedder.Dimensions())Embedding Providers
Section titled “Embedding Providers”| Provider | Import Path | Models |
|---|---|---|
| OpenAI | rag/embedding/providers/openai | text-embedding-3-small, text-embedding-3-large |
rag/embedding/providers/google | text-embedding-004 | |
| Cohere | rag/embedding/providers/cohere | embed-english-v3.0 |
| Voyage | rag/embedding/providers/voyage | voyage-3 |
| Mistral | rag/embedding/providers/mistral | mistral-embed |
| Jina | rag/embedding/providers/jina | jina-embeddings-v3 |
| Ollama | rag/embedding/providers/ollama | Local models |
| Sentence Transformers | rag/embedding/providers/sentence_transformers | Local models |
| In-Memory | rag/embedding/providers/inmemory | Test/dev (random vectors) |
Vector Store
Section titled “Vector Store”Vector stores persist embeddings and support efficient similarity search over them. When a query arrives, it is embedded using the same model, and the vector store finds the nearest neighbors — the documents most semantically similar to the query. Different backends offer different trade-offs between latency, scalability, filtering capabilities, and operational complexity.
Store and search embeddings:
import ( "github.com/lookatitude/beluga-ai/rag/vectorstore" _ "github.com/lookatitude/beluga-ai/rag/vectorstore/providers/pgvector")
store, err := vectorstore.New("pgvector", vectorstore.ProviderConfig{ ConnectionString: os.Getenv("DATABASE_URL"),})if err != nil { log.Fatal(err)}
// Add documents with embeddingserr = store.Add(ctx, chunkedDocs, vectors)
// Search for similar documentsqueryVec, err := embedder.EmbedSingle(ctx, "What is Go?")results, err := store.Search(ctx, queryVec, 10, vectorstore.WithThreshold(0.7), vectorstore.WithFilter(map[string]any{"source": "internal-docs"}),)Search Options
Section titled “Search Options”| Option | Description |
|---|---|
WithThreshold(t) | Minimum similarity score (0.0–1.0) |
WithFilter(meta) | Match metadata key-value pairs |
WithStrategy(s) | Distance metric: Cosine, DotProduct, Euclidean |
Vector Store Providers
Section titled “Vector Store Providers”| Provider | Import Path | Type |
|---|---|---|
| In-Memory | rag/vectorstore/providers/inmemory | Development/testing |
| pgvector | rag/vectorstore/providers/pgvector | PostgreSQL extension |
| Pinecone | rag/vectorstore/providers/pinecone | Managed cloud |
| Qdrant | rag/vectorstore/providers/qdrant | Open-source |
| Weaviate | rag/vectorstore/providers/weaviate | Open-source |
| Milvus | rag/vectorstore/providers/milvus | Open-source |
| Chroma | rag/vectorstore/providers/chroma | Open-source |
| Redis | rag/vectorstore/providers/redis | Redis Stack |
| Elasticsearch | rag/vectorstore/providers/elasticsearch | Elastic |
| MongoDB | rag/vectorstore/providers/mongodb | Atlas Vector Search |
| SQLite-vec | rag/vectorstore/providers/sqlitevec | Embedded |
| Vespa | rag/vectorstore/providers/vespa | Enterprise search |
| Turbopuffer | rag/vectorstore/providers/turbopuffer | Serverless |
Retriever
Section titled “Retriever”The Retriever interface abstracts the search step, decoupling your application from specific vector store implementations and search strategies. Retrievers can combine multiple backends, apply reranking, or implement advanced strategies like CRAG and HyDE. This abstraction is where the most impactful RAG quality improvements happen — choosing the right retrieval strategy often matters more than choosing the right embedding model.
import "github.com/lookatitude/beluga-ai/rag/retriever"
docs, err := r.Retrieve(ctx, "What is quantum computing?", retriever.WithTopK(5), retriever.WithThreshold(0.7), retriever.WithMetadata(map[string]any{"topic": "physics"}),)Retrieval Strategies
Section titled “Retrieval Strategies”| Strategy | Description | When to Use |
|---|---|---|
vector | Pure vector similarity search | Simple use cases |
hybrid | Vector + BM25 with RRF fusion | Recommended default |
crag | Corrective RAG with quality grading | Quality-critical applications |
hyde | Hypothetical Document Embeddings | Sparse-data domains |
adaptive | Adjusts strategy based on query | Variable query patterns |
ensemble | Combines multiple retriever outputs | Maximum recall |
Hybrid Search (Recommended)
Section titled “Hybrid Search (Recommended)”Pure vector search excels at finding semantically similar content but can miss documents that contain the exact keywords a user is looking for. Conversely, BM25 keyword matching finds exact term matches but misses paraphrases and synonyms. Hybrid search combines both signals using Reciprocal Rank Fusion (RRF), which merges the ranked results from each method into a single list. This is the recommended default because it handles both precise keyword queries (“error code 404”) and conceptual queries (“how to handle missing pages”) effectively.
hybridRetriever, err := retriever.New("hybrid", retriever.ProviderConfig{ Options: map[string]any{ "vector_store": store, "embedder": embedder, "bm25_weight": 0.3, "vector_weight": 0.7, },})
docs, err := hybridRetriever.Retrieve(ctx, "Go concurrency patterns", retriever.WithTopK(10),)CRAG (Corrective RAG)
Section titled “CRAG (Corrective RAG)”A fundamental problem with naive RAG is that retrieved documents may be irrelevant to the query. When an LLM receives irrelevant context, it often generates plausible-sounding but incorrect answers — a form of hallucination. Corrective RAG addresses this by using an LLM to grade each retrieved document for relevance before passing it to the generation step. Documents below the confidence threshold are discarded, and if too few relevant documents remain, CRAG can trigger a web search as a fallback. This quality-gating step significantly reduces hallucination in production systems.
cragRetriever, err := retriever.New("crag", retriever.ProviderConfig{ Options: map[string]any{ "base_retriever": baseRetriever, "grader_llm": model, "threshold": 0.6, },})HyDE (Hypothetical Document Embeddings)
Section titled “HyDE (Hypothetical Document Embeddings)”Short or vague user queries often produce poor embeddings because there is not enough semantic content to capture the user’s intent. For example, the query “auth” generates a very different embedding than a paragraph explaining authentication flows. HyDE solves this by first asking an LLM to generate a hypothetical document that would answer the query, then embedding that hypothetical answer instead of the raw query. The hypothetical document’s embedding is much closer in vector space to the actual relevant documents, dramatically improving recall for sparse-data domains and terse queries.
hydeRetriever, err := retriever.New("hyde", retriever.ProviderConfig{ Options: map[string]any{ "base_retriever": baseRetriever, "llm": model, "embedder": embedder, },})Complete Pipeline Example
Section titled “Complete Pipeline Example”The following example demonstrates the full RAG pipeline from end to end: loading a text file, splitting it into chunks, embedding the chunks, storing them in an in-memory vector database, retrieving relevant context for a query, and generating an answer with an LLM. In production, you would replace the in-memory store with a persistent backend like pgvector or Pinecone.
package main
import ( "context" "fmt" "log" "os"
"github.com/lookatitude/beluga-ai/llm" "github.com/lookatitude/beluga-ai/rag/embedding" "github.com/lookatitude/beluga-ai/rag/loader" "github.com/lookatitude/beluga-ai/rag/retriever" "github.com/lookatitude/beluga-ai/rag/splitter" "github.com/lookatitude/beluga-ai/rag/vectorstore" "github.com/lookatitude/beluga-ai/schema" "github.com/lookatitude/beluga-ai/config"
_ "github.com/lookatitude/beluga-ai/llm/providers/openai" _ "github.com/lookatitude/beluga-ai/rag/embedding/providers/openai" _ "github.com/lookatitude/beluga-ai/rag/vectorstore/providers/inmemory")
func main() { ctx := context.Background()
// 1. Load documents l, err := loader.New("text", config.ProviderConfig{}) if err != nil { log.Fatal(err) } docs, err := l.Load(ctx, "knowledge-base.txt") if err != nil { log.Fatal(err) }
// 2. Split into chunks s, err := splitter.New("recursive", config.ProviderConfig{ Options: map[string]any{"chunk_size": 500, "chunk_overlap": 50}, }) if err != nil { log.Fatal(err) } chunks, err := s.SplitDocuments(ctx, docs) if err != nil { log.Fatal(err) }
// 3. Embed chunks emb, err := embedding.New("openai", embedding.ProviderConfig{ APIKey: os.Getenv("OPENAI_API_KEY"), Model: "text-embedding-3-small", }) if err != nil { log.Fatal(err) } texts := make([]string, len(chunks)) for i, c := range chunks { texts[i] = c.Content } vectors, err := emb.Embed(ctx, texts) if err != nil { log.Fatal(err) }
// 4. Store in vector database store, err := vectorstore.New("inmemory", vectorstore.ProviderConfig{}) if err != nil { log.Fatal(err) } err = store.Add(ctx, chunks, vectors) if err != nil { log.Fatal(err) }
// 5. Retrieve relevant context query := "How does error handling work?" queryVec, err := emb.EmbedSingle(ctx, query) if err != nil { log.Fatal(err) } relevant, err := store.Search(ctx, queryVec, 5) if err != nil { log.Fatal(err) }
// 6. Generate answer with context model, err := llm.New("openai", llm.ProviderConfig{ APIKey: os.Getenv("OPENAI_API_KEY"), Model: "gpt-4o", }) if err != nil { log.Fatal(err) }
contextStr := "" for _, doc := range relevant { contextStr += doc.Content + "\n\n" }
msgs := []schema.Message{ schema.NewSystemMessage("Answer the question using the provided context. If unsure, say so."), schema.NewHumanMessage(fmt.Sprintf("Context:\n%s\nQuestion: %s", contextStr, query)), }
resp, err := model.Generate(ctx, msgs) if err != nil { log.Fatal(err) } fmt.Println(resp.Text())}Retriever Hooks
Section titled “Retriever Hooks”Beluga AI uses the hooks pattern across all subsystems for lifecycle observation without wrapping. Retriever hooks let you log queries, measure latency, audit which documents were retrieved, and track reranking behavior. Hooks are optional function fields — any nil hook is simply skipped, so you only pay for the observation you need.
hooks := retriever.Hooks{ BeforeRetrieve: func(ctx context.Context, query string) error { log.Printf("Retrieving for: %q", query) return nil }, AfterRetrieve: func(ctx context.Context, docs []schema.Document, err error) { log.Printf("Found %d documents", len(docs)) }, OnRerank: func(ctx context.Context, query string, before, after []schema.Document) { log.Printf("Reranked: %d → %d documents", len(before), len(after)) },}Next Steps
Section titled “Next Steps”- Working with LLMs — The ChatModel that generates answers
- Memory System — Persistent memory with vector search
- Tools & MCP — Give agents retrieval as a tool
- Monitoring & Observability — Trace RAG pipeline performance