Unstructured Document Loader
The Unstructured loader implements the loader.DocumentLoader interface using the Unstructured.io API to extract structured content from a wide range of file types (PDFs, DOCX, images, HTML, and more). It uploads files to the Unstructured partition API and returns the extracted elements as a single consolidated document.
Choose Unstructured when you need to process a wide variety of document formats (PDFs, DOCX, images with OCR, emails, and more) through a single loader. Unstructured provides element-level extraction with metadata about document structure. It can be self-hosted via Docker for data privacy. For Markdown-focused document conversion, consider Docling. For web scraping, consider Firecrawl.
Installation
Section titled “Installation”go get github.com/lookatitude/beluga-ai/rag/loader/providers/unstructuredQuick Start
Section titled “Quick Start”package main
import ( "context" "fmt" "log" "os"
"github.com/lookatitude/beluga-ai/config" "github.com/lookatitude/beluga-ai/rag/loader" _ "github.com/lookatitude/beluga-ai/rag/loader/providers/unstructured")
func main() { l, err := loader.New("unstructured", config.ProviderConfig{ APIKey: os.Getenv("UNSTRUCTURED_API_KEY"), BaseURL: "https://api.unstructured.io", }) if err != nil { log.Fatal(err) }
docs, err := l.Load(context.Background(), "/path/to/document.pdf") if err != nil { log.Fatal(err) } fmt.Printf("Elements extracted: %v\n", docs[0].Metadata["elements"]) fmt.Printf("Content: %s\n", docs[0].Content)}Configuration
Section titled “Configuration”| Parameter | Type | Default | Description |
|---|---|---|---|
APIKey | string | "" | Unstructured API key (set via unstructured-api-key header) |
BaseURL | string | https://api.unstructured.io | API endpoint |
Timeout | time.Duration | 0 (no timeout) | HTTP request timeout |
Source Format
Section titled “Source Format”The source parameter is a local file path. The file is uploaded to the Unstructured API as multipart form data:
docs, err := l.Load(ctx, "/path/to/document.pdf")Content Extraction
Section titled “Content Extraction”The Unstructured API returns an array of structured elements (titles, narrative text, tables, etc.). The loader combines all text elements into a single document, separated by double newlines. Empty elements are skipped.
Document Metadata
Section titled “Document Metadata”| Field | Type | Description |
|---|---|---|
source | string | Original file path |
format | string | Always "unstructured" |
loader | string | Always "unstructured" |
filename | string | Base filename extracted from the path |
elements | int | Number of elements returned by the API |
Supported File Types
Section titled “Supported File Types”Unstructured supports a comprehensive set of document formats:
- PDF, DOCX, DOC, PPTX, PPT, XLSX
- HTML, XML, Markdown, RST
- Plain text, CSV, TSV
- Images (PNG, JPG, TIFF) with OCR
- Email formats (EML, MSG)
- EPUB
Self-Hosted Deployment
Section titled “Self-Hosted Deployment”You can run Unstructured locally using Docker:
docker run -p 8000:8000 quay.io/unstructured-io/unstructured-api:latestThen point the loader to your local instance:
l, err := loader.New("unstructured", config.ProviderConfig{ BaseURL: "http://localhost:8000",})When self-hosting, the APIKey is optional.
Error Handling
Section titled “Error Handling”docs, err := l.Load(ctx, "/path/to/document.pdf")if err != nil { // Possible errors: // - "unstructured: source file path is required" (empty source) // - "unstructured: open file: ..." (file not found) // - "unstructured: API error (status 422): ..." (unsupported format) // - "unstructured: request: ..." (connection failure) log.Fatal(err)}