Document Ingestion Recipes
Document ingestion is the first stage of any RAG pipeline: getting documents from their source format into chunks that can be embedded and stored. The quality of ingestion directly affects retrieval quality downstream. Poorly split documents produce poor embeddings; missing metadata means no filtering capability; sequential loading wastes time on large collections.
Problem
Section titled “Problem”You need to load documents from various sources, split them into chunks, and prepare them for embedding and storage in a RAG pipeline. Different file types, directory structures, and processing strategies require flexible ingestion patterns.
Solution
Section titled “Solution”Use Beluga AI’s document loaders and text splitters to build composable ingestion pipelines. The framework provides DocumentLoader and TextSplitter interfaces with multiple implementations, each configurable via the WithX() functional options pattern. Combine directory loading, extension filtering, concurrency, and type-aware splitting to handle diverse document sources efficiently.
Code Example
Section titled “Code Example”Basic Directory Loading
Section titled “Basic Directory Loading”Load all text files from a directory. The WithExtensions option filters by file type, avoiding binary files that would produce garbage chunks:
loader, err := documentloaders.NewDirectoryLoader( os.DirFS("./data"), documentloaders.WithExtensions(".txt", ".md"),)if err != nil { log.Fatalf("Failed to create loader: %v", err)}docs, err := loader.Load(ctx)if err != nil { log.Fatalf("Failed to load: %v", err)}Recursive Loading with Depth Limit
Section titled “Recursive Loading with Depth Limit”Limit directory traversal depth to avoid accidentally traversing deep dependency trees (e.g., node_modules or .git directories):
loader, err := documentloaders.NewDirectoryLoader( os.DirFS("./docs"), documentloaders.WithMaxDepth(3), documentloaders.WithExtensions(".md"),)if err != nil { log.Fatalf("Failed to create loader: %v", err)}docs, err := loader.Load(ctx)if err != nil { log.Fatalf("Failed to load: %v", err)}Filtering by File Size
Section titled “Filtering by File Size”Skip files larger than a threshold. Very large files (logs, data dumps) can overwhelm the splitter and produce too many chunks. Setting a size limit keeps ingestion predictable:
loader, err := documentloaders.NewDirectoryLoader( os.DirFS("./data"), documentloaders.WithMaxFileSize(10*1024*1024), // 10MB limit)if err != nil { log.Fatalf("Failed to create loader: %v", err)}docs, err := loader.Load(ctx)if err != nil { log.Fatalf("Failed to load: %v", err)}Concurrent Loading
Section titled “Concurrent Loading”Use multiple workers for faster loading. Document loading is I/O-bound, so concurrency significantly improves throughput on large directories by overlapping disk reads:
loader, err := documentloaders.NewDirectoryLoader( os.DirFS("./large_dataset"), documentloaders.WithConcurrency(8),)if err != nil { log.Fatalf("Failed to create loader: %v", err)}docs, err := loader.Load(ctx)if err != nil { log.Fatalf("Failed to load: %v", err)}Lazy Loading for Large Datasets
Section titled “Lazy Loading for Large Datasets”Stream documents one at a time to avoid loading the entire collection into memory at once. This is important when the total document size exceeds available RAM:
ch, err := loader.LazyLoad(ctx)if err != nil { log.Fatalf("Failed to lazy load: %v", err)}
var docs []schema.Documentfor item := range ch { switch v := item.(type) { case schema.Document: docs = append(docs, v) case error: log.Printf("Error: %v", v) }}Basic Text Splitting
Section titled “Basic Text Splitting”Split documents into chunks. The RecursiveCharacterTextSplitter tries increasingly fine-grained separators (paragraphs, then sentences, then words) to find the best split point near the target size:
splitter, err := textsplitters.NewRecursiveCharacterTextSplitter( textsplitters.WithRecursiveChunkSize(1000), textsplitters.WithRecursiveChunkOverlap(200),)if err != nil { log.Fatalf("Failed to create splitter: %v", err)}chunks, err := splitter.SplitDocuments(ctx, docs)if err != nil { log.Fatalf("Failed to split: %v", err)}Markdown-Aware Splitting
Section titled “Markdown-Aware Splitting”Preserve markdown structure by splitting at heading boundaries. This ensures each chunk corresponds to a logical section, making retrieval results more coherent and preserving the document’s organizational hierarchy in the chunk metadata:
splitter, err := textsplitters.NewMarkdownTextSplitter( textsplitters.WithMarkdownChunkSize(500), textsplitters.WithHeadersToSplitOn("#", "##", "###"),)if err != nil { log.Fatalf("Failed to create splitter: %v", err)}chunks, err := splitter.SplitDocuments(ctx, markdownDocs)if err != nil { log.Fatalf("Failed to split: %v", err)}Token-Based Splitting
Section titled “Token-Based Splitting”Use token counting for accurate chunk sizing. Character counts are a poor proxy for tokens because different words have different tokenization lengths. When your embedding model has a strict token limit, counting tokens directly prevents truncation errors:
tokenizer := func(text string) int { return len(strings.Fields(text)) // Simple word count}
splitter, err := textsplitters.NewRecursiveCharacterTextSplitter( textsplitters.WithRecursiveLengthFunction(tokenizer), textsplitters.WithRecursiveChunkSize(100), textsplitters.WithRecursiveChunkOverlap(20),)if err != nil { log.Fatalf("Failed to create splitter: %v", err)}chunks, err := splitter.SplitDocuments(ctx, docs)if err != nil { log.Fatalf("Failed to split: %v", err)}Complete Ingestion Pipeline
Section titled “Complete Ingestion Pipeline”Load, split, and prepare for RAG. This three-stage pipeline (load, split, embed/store) is the standard pattern for document ingestion:
// 1. Loadloader, err := documentloaders.NewDirectoryLoader( os.DirFS("./data"), documentloaders.WithExtensions(".txt", ".md"), documentloaders.WithMaxDepth(2),)if err != nil { log.Fatalf("Failed to create loader: %v", err)}docs, err := loader.Load(ctx)if err != nil { log.Fatalf("Failed to load: %v", err)}
// 2. Splitsplitter, err := textsplitters.NewRecursiveCharacterTextSplitter( textsplitters.WithRecursiveChunkSize(1000), textsplitters.WithRecursiveChunkOverlap(200),)if err != nil { log.Fatalf("Failed to create splitter: %v", err)}chunks, err := splitter.SplitDocuments(ctx, docs)if err != nil { log.Fatalf("Failed to split: %v", err)}
// 3. Ready for embedding and storageLoading Multiple Sources
Section titled “Loading Multiple Sources”Combine documents from different sources into a single collection for uniform splitting and embedding. This is common when your knowledge base spans multiple directories, individual files, or external sources:
var allDocs []schema.Document
// Load from directorydirLoader, err := documentloaders.NewDirectoryLoader(os.DirFS("./docs"))if err != nil { log.Fatalf("Failed to create dir loader: %v", err)}dirDocs, err := dirLoader.Load(ctx)if err != nil { log.Fatalf("Failed to load dir: %v", err)}allDocs = append(allDocs, dirDocs...)
// Load single filefileLoader, err := documentloaders.NewTextLoader("./important.txt")if err != nil { log.Fatalf("Failed to create file loader: %v", err)}fileDocs, err := fileLoader.Load(ctx)if err != nil { log.Fatalf("Failed to load file: %v", err)}allDocs = append(allDocs, fileDocs...)
// Process all togethersplitter, err := textsplitters.NewRecursiveCharacterTextSplitter()if err != nil { log.Fatalf("Failed to create splitter: %v", err)}chunks, err := splitter.SplitDocuments(ctx, allDocs)if err != nil { log.Fatalf("Failed to split: %v", err)}Explanation
Section titled “Explanation”These recipes demonstrate the core document ingestion patterns in Beluga AI:
-
Directory loading with functional options — Use
NewDirectoryLoaderwithWithX()options to control file discovery. Options likeWithExtensions,WithMaxDepth, andWithMaxFileSizeallow fine-grained control over which files are loaded. This follows Beluga AI’s standard functional options pattern for configuration. -
Concurrent loading — For large datasets,
WithConcurrencydistributes file loading across multiple goroutines. This significantly improves throughput for I/O-bound loading because Go’s goroutine scheduler efficiently multiplexes concurrent reads. -
Type-aware splitting — Different document types benefit from different splitting strategies. Markdown documents should use
MarkdownTextSplitterto preserve heading structure (which becomes metadata), while plain text usesRecursiveCharacterTextSplitter. Code files benefit from the language-awareCodeSplitter. Choosing the right splitter for each content type is one of the highest-leverage decisions in a RAG pipeline. -
Token-based sizing — When embedding models have token limits (most do), use a length function that counts tokens rather than characters for accurate chunk sizing. This prevents silent truncation where the embedding model only sees part of each chunk, degrading embedding quality without any error signal.
Variations
Section titled “Variations”Error Handling Pattern
Section titled “Error Handling Pattern”Handle errors gracefully during loading:
docs, err := loader.Load(ctx)if err != nil { if loaderErr, ok := err.(*documentloaders.LoaderError); ok { switch loaderErr.Code { case documentloaders.ErrCodeFileTooLarge: log.Printf("Skipped large file: %s", loaderErr.Path) case documentloaders.ErrCodeBinaryFile: log.Printf("Skipped binary file: %s", loaderErr.Path) default: log.Printf("Error loading: %v", loaderErr) } } return err}Processing Different File Types
Section titled “Processing Different File Types”Handle different document types with appropriate splitters:
var textDocs, markdownDocs []schema.Documentfor _, doc := range docs { if strings.HasSuffix(doc.Metadata["source"], ".md") { markdownDocs = append(markdownDocs, doc) } else { textDocs = append(textDocs, doc) }}
textSplitter, err := textsplitters.NewRecursiveCharacterTextSplitter()if err != nil { log.Fatalf("Failed to create text splitter: %v", err)}textChunks, err := textSplitter.SplitDocuments(ctx, textDocs)if err != nil { log.Fatalf("Failed to split text docs: %v", err)}
markdownSplitter, err := textsplitters.NewMarkdownTextSplitter()if err != nil { log.Fatalf("Failed to create markdown splitter: %v", err)}markdownChunks, err := markdownSplitter.SplitDocuments(ctx, markdownDocs)if err != nil { log.Fatalf("Failed to split markdown docs: %v", err)}Chunk Metadata Preservation
Section titled “Chunk Metadata Preservation”Access chunk metadata after splitting:
chunks, err := splitter.SplitDocuments(ctx, docs)if err != nil { log.Fatalf("Failed to split: %v", err)}
for _, chunk := range chunks { source := chunk.Metadata["source"] index := chunk.Metadata["chunk_index"] total := chunk.Metadata["chunk_total"]
fmt.Printf("Chunk %s/%s from %s\n", index, total, source)}Batch Processing
Section titled “Batch Processing”Process documents in batches to control memory usage:
const batchSize = 100for i := 0; i < len(docs); i += batchSize { end := i + batchSize if end > len(docs) { end = len(docs) }
batch := docs[i:end] chunks, err := splitter.SplitDocuments(ctx, batch) if err != nil { log.Printf("Failed to split batch %d: %v", i/batchSize, err) continue }
processBatch(chunks)}Registry Pattern
Section titled “Registry Pattern”Create loaders dynamically using the registry, which allows selecting the loader type from configuration without hardcoding the implementation:
registry := documentloaders.GetRegistry()
loader, err := registry.Create("directory", map[string]any{ "max_depth": 2, "extensions": []string{".txt", ".md"}, "concurrency": 4,})if err != nil { log.Fatalf("Failed to create loader: %v", err)}docs, err := loader.Load(ctx)if err != nil { log.Fatalf("Failed to load: %v", err)}Related Recipes
Section titled “Related Recipes”- Parallel File Loading — Parallel directory traversal with worker pools
- Corrupt Document Handling — Graceful error handling for corrupt documents
- Sentence-Aware Splitting — Sentence-boundary-aware text splitting
- Code Splitting — Tree-sitter-based code splitting