Document Loader Providers
Beluga AI v2 provides a unified loader.DocumentLoader interface for loading content from diverse sources — files, APIs, cloud storage, and web pages — and converting them into schema.Document slices for use in RAG pipelines. All providers register via init() and are instantiated through the global registry.
Interface
Section titled “Interface”type DocumentLoader interface { Load(ctx context.Context, source string) ([]schema.Document, error)}The source parameter is provider-specific: it may be a file path, URL, page ID, or cloud storage URI depending on the loader.
Registry Usage
Section titled “Registry Usage”import ( "context" "fmt" "log" "os"
"github.com/lookatitude/beluga-ai/config" "github.com/lookatitude/beluga-ai/rag/loader"
// Register the provider you need via blank import _ "github.com/lookatitude/beluga-ai/rag/loader/providers/github")
func main() { l, err := loader.New("github", config.ProviderConfig{ APIKey: os.Getenv("GITHUB_TOKEN"), }) if err != nil { log.Fatal(err) }
docs, err := l.Load(context.Background(), "owner/repo/README.md") if err != nil { log.Fatal(err) } fmt.Printf("Loaded %d documents\n", len(docs))}Available Providers
Section titled “Available Providers”| Provider | Registry Name | Source Format | Description |
|---|---|---|---|
| Cloud Storage | cloudstorage | s3://, gs://, az:// URIs | AWS S3, Google Cloud Storage, Azure Blob |
| Confluence | confluence | Page ID or SPACE/page-id | Atlassian Confluence wiki pages |
| Docling | docling | File path or URL | IBM Docling document conversion (PDF, DOCX, images) |
| Firecrawl | firecrawl | URL | Web scraping with JavaScript rendering |
| Google Drive | gdrive | File ID | Google Drive files and Google Workspace exports |
| GitHub | github | owner/repo/path | GitHub repository files via the Contents API |
| Notion | notion | Page ID | Notion pages via the Notion API |
| Unstructured | unstructured | File path | Unstructured.io document extraction |
Built-in Loaders
Section titled “Built-in Loaders”The loader package also includes four built-in loaders that require no external dependencies:
| Loader | Registry Name | Description |
|---|---|---|
| Text | text | Plain text files |
| JSON | json | JSON files with configurable path extraction |
| CSV | csv | CSV files (one document per row) |
| Markdown | markdown | Markdown files |
Provider Discovery
Section titled “Provider Discovery”List all registered providers at runtime:
names := loader.List()// Returns sorted list: ["cloudstorage", "confluence", "csv", "docling", ...]Transformer Pipeline
Section titled “Transformer Pipeline”Loaders can be combined with transformers in a pipeline. The pipeline runs all loaders, concatenates their results, and applies transformers to each document:
import "github.com/lookatitude/beluga-ai/rag/loader"
pipeline := loader.NewPipeline( loader.WithLoader(githubLoader), loader.WithLoader(notionLoader), loader.WithTransformer(loader.TransformerFunc(func(ctx context.Context, doc schema.Document) (schema.Document, error) { doc.Metadata["processed"] = true return doc, nil })),)
docs, err := pipeline.Load(ctx, "owner/repo/docs/")if err != nil { log.Fatal(err)}Document Structure
Section titled “Document Structure”All loaders return schema.Document values with consistent metadata:
type Document struct { ID string Content string Metadata map[string]any}Every loader sets at minimum:
source— the original source string passed toLoadloader— the loader name (e.g.,"github","confluence")
Additional metadata fields are provider-specific and documented on each provider’s page.