Skip to content
Docs

Document Loader API — Files, Cloud, APIs

import "github.com/lookatitude/beluga-ai/rag/loader"

Package loader provides document loading capabilities for the RAG pipeline. It defines the DocumentLoader interface for reading content from various sources (files, URLs, APIs) and converting them into [schema.Document] slices.

The core interface is DocumentLoader:

type DocumentLoader interface {
Load(ctx context.Context, source string) ([]schema.Document, error)
}

The Transformer interface allows post-load enrichment:

type Transformer interface {
Transform(ctx context.Context, doc schema.Document) (schema.Document, error)
}

The package follows Beluga’s registry pattern. Providers register via init() and are instantiated with New:

l, err := loader.New("text", config.ProviderConfig{})
if err != nil {
log.Fatal(err)
}
docs, err := l.Load(ctx, "/path/to/file.txt")

Use List to discover all registered loader names.

  • “text” — plain text files
  • “json” — JSON files with configurable path extraction
  • “csv” — CSV files (one document per row)
  • “markdown” — Markdown files

Available as provider imports:

  • “cloudstorage” — S3, GCS, Azure Blob Storage
  • “confluence” — Atlassian Confluence pages
  • “docling” — IBM Docling document understanding (PDFs, DOCX, images)
  • “firecrawl” — Firecrawl web scraping and crawling
  • “gdrive” — Google Drive files
  • “github” — GitHub repository files
  • “notion” — Notion pages
  • “unstructured” — Unstructured.io document extraction

LoaderPipeline chains multiple loaders and transformers. Loaders are invoked in order and their results concatenated, then transformers are applied to each document:

p := loader.NewPipeline(
loader.WithLoader(textLoader),
loader.WithTransformer(loader.TransformerFunc(func(ctx context.Context, doc schema.Document) (schema.Document, error) {
doc.Metadata["processed"] = true
return doc, nil
})),
)
docs, err := p.Load(ctx, "/path/to/files")

To add a custom document loader:

func init() {
loader.Register("custom", func(cfg config.ProviderConfig) (loader.DocumentLoader, error) {
return &myLoader{apiKey: cfg.APIKey}, nil
})
}

import "github.com/lookatitude/beluga-ai/rag/loader/providers/cloudstorage"

Package cloudstorage provides a DocumentLoader that loads files from cloud storage services (S3, GCS, Azure Blob). It detects the provider by URL prefix (s3://, gs://, az://) and uses direct HTTP calls with pre-signed URLs or service-specific APIs.

The provider registers as “cloudstorage” in the loader registry:

import _ "github.com/lookatitude/beluga-ai/rag/loader/providers/cloudstorage"
l, err := loader.New("cloudstorage", config.ProviderConfig{
APIKey: "your-access-key",
Options: map[string]any{
"secret_key": "your-secret-key",
"region": "us-east-1",
},
})
docs, err := l.Load(ctx, "s3://bucket/path/to/file.txt")
  • S3 — URLs starting with “s3://”
  • GCS — URLs starting with “gs://”
  • Azure Blob — URLs starting with “az://“

ProviderConfig fields:

  • APIKey — access key (required)
  • Options[“secret_key”] — secret key
  • Options[“region”] — cloud region

import "github.com/lookatitude/beluga-ai/rag/loader/providers/confluence"

Package confluence provides a DocumentLoader that loads pages from Atlassian Confluence via its REST API. It implements the [loader.DocumentLoader] interface.

The provider registers as “confluence” in the loader registry:

import _ "github.com/lookatitude/beluga-ai/rag/loader/providers/confluence"
l, err := loader.New("confluence", config.ProviderConfig{
APIKey: "your-api-token",
BaseURL: "https://your-domain.atlassian.net/wiki",
Options: map[string]any{"user": "user@example.com"},
})
docs, err := l.Load(ctx, "12345") // page ID

ProviderConfig fields:

  • APIKey — Confluence API token (required)
  • BaseURL — Confluence wiki base URL (required)
  • Options[“user”] — username for basic auth

import "github.com/lookatitude/beluga-ai/rag/loader/providers/docling"

Package docling provides a DocumentLoader that uses the IBM Docling API to convert documents (PDFs, DOCX, images, etc.) into structured content.

Docling (https://github.com/DS4SD/docling) is IBM’s document understanding service that extracts text, tables, and layout from documents.

The provider registers as “docling” in the loader registry:

import _ "github.com/lookatitude/beluga-ai/rag/loader/providers/docling"
l, err := loader.New("docling", config.ProviderConfig{
BaseURL: "http://localhost:5001",
})
docs, err := l.Load(ctx, "/path/to/document.pdf")

ProviderConfig fields:

  • BaseURL — Docling API server URL (required)

import "github.com/lookatitude/beluga-ai/rag/loader/providers/firecrawl"

Package firecrawl provides a DocumentLoader that uses Firecrawl to crawl websites and extract their content as markdown.

Firecrawl (https://firecrawl.dev) is a web scraping service that handles JavaScript rendering, anti-bot detection, and returns clean markdown.

The provider registers as “firecrawl” in the loader registry:

import _ "github.com/lookatitude/beluga-ai/rag/loader/providers/firecrawl"
l, err := loader.New("firecrawl", config.ProviderConfig{
APIKey: "fc-...",
})
docs, err := l.Load(ctx, "https://example.com")

ProviderConfig fields:

  • APIKey — Firecrawl API key (required)

import "github.com/lookatitude/beluga-ai/rag/loader/providers/gdrive"

Package gdrive provides a DocumentLoader that loads files from Google Drive via the Google Drive REST API. It implements the [loader.DocumentLoader] interface using direct HTTP calls.

The provider registers as “gdrive” in the loader registry:

import _ "github.com/lookatitude/beluga-ai/rag/loader/providers/gdrive"
l, err := loader.New("gdrive", config.ProviderConfig{
APIKey: "your-api-key-or-oauth-token",
})
docs, err := l.Load(ctx, "file-id-here")

ProviderConfig fields:

  • APIKey — Google API key or OAuth token (required)

import "github.com/lookatitude/beluga-ai/rag/loader/providers/github"

Package github provides a DocumentLoader that loads files from GitHub repositories via the GitHub API. It implements the [loader.DocumentLoader] interface using direct HTTP calls.

The provider registers as “github” in the loader registry:

import _ "github.com/lookatitude/beluga-ai/rag/loader/providers/github"
l, err := loader.New("github", config.ProviderConfig{
APIKey: "ghp_...",
})
docs, err := l.Load(ctx, "owner/repo/path/to/file.go")

ProviderConfig fields:

  • APIKey — GitHub personal access token (required)

import "github.com/lookatitude/beluga-ai/rag/loader/providers/notion"

Package notion provides a DocumentLoader that loads pages from Notion via its API. It implements the [loader.DocumentLoader] interface using direct HTTP calls to the Notion API.

The provider registers as “notion” in the loader registry:

import _ "github.com/lookatitude/beluga-ai/rag/loader/providers/notion"
l, err := loader.New("notion", config.ProviderConfig{
APIKey: "ntn_...",
})
docs, err := l.Load(ctx, "page-id-here")

ProviderConfig fields:

  • APIKey — Notion integration token (required)

import "github.com/lookatitude/beluga-ai/rag/loader/providers/unstructured"

Package unstructured provides a DocumentLoader that uses the Unstructured.io API to extract structured content from files (PDFs, DOCX, images, etc.).

The loader uploads files to the Unstructured.io partition API and returns the extracted elements as documents.

The provider registers as “unstructured” in the loader registry:

import _ "github.com/lookatitude/beluga-ai/rag/loader/providers/unstructured"
l, err := loader.New("unstructured", config.ProviderConfig{
APIKey: "key-...",
BaseURL: "https://api.unstructured.io",
})
docs, err := l.Load(ctx, "/path/to/document.pdf")

ProviderConfig fields: