Docling Document Loader
The Docling loader implements the loader.DocumentLoader interface using the IBM Docling API to convert documents (PDFs, DOCX, images, and more) into structured text. Docling extracts text, tables, and layout information and returns the content as Markdown or plain text.
Choose Docling when you need structured document conversion that preserves tables and layout information as Markdown. Docling handles PDFs, DOCX, images, and more, and can be self-hosted via Docker for data privacy. For a broader range of file types with element-level extraction, consider Unstructured. For web scraping, consider Firecrawl.
Installation
Section titled “Installation”go get github.com/lookatitude/beluga-ai/rag/loader/providers/doclingQuick Start
Section titled “Quick Start”package main
import ( "context" "fmt" "log"
"github.com/lookatitude/beluga-ai/config" "github.com/lookatitude/beluga-ai/rag/loader" _ "github.com/lookatitude/beluga-ai/rag/loader/providers/docling")
func main() { l, err := loader.New("docling", config.ProviderConfig{ BaseURL: "http://localhost:5001", }) if err != nil { log.Fatal(err) }
docs, err := l.Load(context.Background(), "/path/to/document.pdf") if err != nil { log.Fatal(err) } fmt.Printf("Content: %s\n", docs[0].Content)}Configuration
Section titled “Configuration”| Parameter | Type | Default | Description |
|---|---|---|---|
BaseURL | string | http://localhost:5001 | Docling API endpoint |
APIKey | string | "" | Optional Bearer token for authentication |
Timeout | time.Duration | 0 (no timeout) | HTTP request timeout |
Source Types
Section titled “Source Types”The loader accepts two types of sources:
Local Files
Section titled “Local Files”File paths are uploaded to the Docling API as multipart form data:
docs, err := l.Load(ctx, "/path/to/document.pdf")HTTP/HTTPS URLs are passed to the Docling API as a JSON body for server-side download:
docs, err := l.Load(ctx, "https://example.com/report.pdf")Content Output
Section titled “Content Output”The Docling API returns both Markdown and plain text representations. The loader prefers Markdown content when available, falling back to plain text:
- Markdown content (
md_content) is used if present - Plain text content (
text_content) is used as fallback - If both are empty,
nilis returned (no documents)
Document Metadata
Section titled “Document Metadata”| Field | Type | Description |
|---|---|---|
source | string | Original file path or URL |
format | string | Always "docling" |
loader | string | Always "docling" |
Supported File Types
Section titled “Supported File Types”Docling supports a wide range of document formats including:
- PDF documents
- Microsoft Word (DOCX)
- Microsoft PowerPoint (PPTX)
- Images (PNG, JPG, TIFF)
- HTML pages
Refer to the Docling documentation for the complete list of supported formats.
Self-Hosted Deployment
Section titled “Self-Hosted Deployment”Docling can be run as a local service using Docker:
docker run -p 5001:5001 ds4sd/docling-serveOnce running, configure the loader to point to your local instance:
l, err := loader.New("docling", config.ProviderConfig{ BaseURL: "http://localhost:5001",})Error Handling
Section titled “Error Handling”docs, err := l.Load(ctx, "/path/to/document.pdf")if err != nil { // Possible errors: // - "docling: source is required" (empty source) // - "docling: open file: ..." (local file not found) // - "docling: API error (status 422): ..." (unsupported format) // - "docling: request: ..." (connection failure) log.Fatal(err)}