Skip to content
Docs

Document Loader Providers

Beluga AI v2 provides a unified loader.DocumentLoader interface for loading content from diverse sources — files, APIs, cloud storage, and web pages — and converting them into schema.Document slices for use in RAG pipelines. All providers register via init() and are instantiated through the global registry.

type DocumentLoader interface {
Load(ctx context.Context, source string) ([]schema.Document, error)
}

The source parameter is provider-specific: it may be a file path, URL, page ID, or cloud storage URI depending on the loader.

import (
"context"
"fmt"
"log"
"os"
"github.com/lookatitude/beluga-ai/config"
"github.com/lookatitude/beluga-ai/rag/loader"
// Register the provider you need via blank import
_ "github.com/lookatitude/beluga-ai/rag/loader/providers/github"
)
func main() {
l, err := loader.New("github", config.ProviderConfig{
APIKey: os.Getenv("GITHUB_TOKEN"),
})
if err != nil {
log.Fatal(err)
}
docs, err := l.Load(context.Background(), "owner/repo/README.md")
if err != nil {
log.Fatal(err)
}
fmt.Printf("Loaded %d documents\n", len(docs))
}
ProviderRegistry NameSource FormatDescription
Cloud Storagecloudstorages3://, gs://, az:// URIsAWS S3, Google Cloud Storage, Azure Blob
ConfluenceconfluencePage ID or SPACE/page-idAtlassian Confluence wiki pages
DoclingdoclingFile path or URLIBM Docling document conversion (PDF, DOCX, images)
FirecrawlfirecrawlURLWeb scraping with JavaScript rendering
Google DrivegdriveFile IDGoogle Drive files and Google Workspace exports
GitHubgithubowner/repo/pathGitHub repository files via the Contents API
NotionnotionPage IDNotion pages via the Notion API
UnstructuredunstructuredFile pathUnstructured.io document extraction

The loader package also includes four built-in loaders that require no external dependencies:

LoaderRegistry NameDescription
TexttextPlain text files
JSONjsonJSON files with configurable path extraction
CSVcsvCSV files (one document per row)
MarkdownmarkdownMarkdown files

List all registered providers at runtime:

names := loader.List()
// Returns sorted list: ["cloudstorage", "confluence", "csv", "docling", ...]

Loaders can be combined with transformers in a pipeline. The pipeline runs all loaders, concatenates their results, and applies transformers to each document:

import "github.com/lookatitude/beluga-ai/rag/loader"
pipeline := loader.NewPipeline(
loader.WithLoader(githubLoader),
loader.WithLoader(notionLoader),
loader.WithTransformer(loader.TransformerFunc(func(ctx context.Context, doc schema.Document) (schema.Document, error) {
doc.Metadata["processed"] = true
return doc, nil
})),
)
docs, err := pipeline.Load(ctx, "owner/repo/docs/")
if err != nil {
log.Fatal(err)
}

All loaders return schema.Document values with consistent metadata:

type Document struct {
ID string
Content string
Metadata map[string]any
}

Every loader sets at minimum:

  • source — the original source string passed to Load
  • loader — the loader name (e.g., "github", "confluence")

Additional metadata fields are provider-specific and documented on each provider’s page.