Skip to content
Docs

Unstructured Document Loader

The Unstructured loader implements the loader.DocumentLoader interface using the Unstructured.io API to extract structured content from a wide range of file types (PDFs, DOCX, images, HTML, and more). It uploads files to the Unstructured partition API and returns the extracted elements as a single consolidated document.

Choose Unstructured when you need to process a wide variety of document formats (PDFs, DOCX, images with OCR, emails, and more) through a single loader. Unstructured provides element-level extraction with metadata about document structure. It can be self-hosted via Docker for data privacy. For Markdown-focused document conversion, consider Docling. For web scraping, consider Firecrawl.

Terminal window
go get github.com/lookatitude/beluga-ai/rag/loader/providers/unstructured
package main
import (
"context"
"fmt"
"log"
"os"
"github.com/lookatitude/beluga-ai/config"
"github.com/lookatitude/beluga-ai/rag/loader"
_ "github.com/lookatitude/beluga-ai/rag/loader/providers/unstructured"
)
func main() {
l, err := loader.New("unstructured", config.ProviderConfig{
APIKey: os.Getenv("UNSTRUCTURED_API_KEY"),
BaseURL: "https://api.unstructured.io",
})
if err != nil {
log.Fatal(err)
}
docs, err := l.Load(context.Background(), "/path/to/document.pdf")
if err != nil {
log.Fatal(err)
}
fmt.Printf("Elements extracted: %v\n", docs[0].Metadata["elements"])
fmt.Printf("Content: %s\n", docs[0].Content)
}
ParameterTypeDefaultDescription
APIKeystring""Unstructured API key (set via unstructured-api-key header)
BaseURLstringhttps://api.unstructured.ioAPI endpoint
Timeouttime.Duration0 (no timeout)HTTP request timeout

The source parameter is a local file path. The file is uploaded to the Unstructured API as multipart form data:

docs, err := l.Load(ctx, "/path/to/document.pdf")

The Unstructured API returns an array of structured elements (titles, narrative text, tables, etc.). The loader combines all text elements into a single document, separated by double newlines. Empty elements are skipped.

FieldTypeDescription
sourcestringOriginal file path
formatstringAlways "unstructured"
loaderstringAlways "unstructured"
filenamestringBase filename extracted from the path
elementsintNumber of elements returned by the API

Unstructured supports a comprehensive set of document formats:

  • PDF, DOCX, DOC, PPTX, PPT, XLSX
  • HTML, XML, Markdown, RST
  • Plain text, CSV, TSV
  • Images (PNG, JPG, TIFF) with OCR
  • Email formats (EML, MSG)
  • EPUB

You can run Unstructured locally using Docker:

Terminal window
docker run -p 8000:8000 quay.io/unstructured-io/unstructured-api:latest

Then point the loader to your local instance:

l, err := loader.New("unstructured", config.ProviderConfig{
BaseURL: "http://localhost:8000",
})

When self-hosting, the APIKey is optional.

docs, err := l.Load(ctx, "/path/to/document.pdf")
if err != nil {
// Possible errors:
// - "unstructured: source file path is required" (empty source)
// - "unstructured: open file: ..." (file not found)
// - "unstructured: API error (status 422): ..." (unsupported format)
// - "unstructured: request: ..." (connection failure)
log.Fatal(err)
}