Skip to content
Docs

Ollama Local Embeddings

Cloud embedding APIs send your text to external servers, which may be unacceptable for sensitive data (medical records, financial documents, proprietary code). Ollama runs embedding models entirely on your own hardware, keeping data local with zero external network calls and no per-token costs.

Choose Ollama when you need data privacy, operate in air-gapped environments, want to eliminate API costs for high-volume workloads, or need embeddings during development without cloud credentials.

Beluga AI’s Embedder interface in the rag/embedding package provides a uniform API for all embedding providers. Ollama registers as "ollama" in the global registry and is instantiated via the standard embedding.New factory. The Ollama provider communicates with a local Ollama server over HTTP.

The recommended model is nomic-embed-text, which produces 768-dimensional vectors and provides strong general-purpose embeddings.

  • Go 1.23 or later
  • Beluga AI framework installed
  • Ollama installed and running (ollama.com)

On Linux and macOS:

Terminal window
curl -fsSL https://ollama.com/install.sh | sh

On macOS, you can also install via Homebrew or download the application from ollama.com.

Terminal window
ollama serve

In a separate terminal, pull an embedding model:

Terminal window
ollama pull nomic-embed-text
Terminal window
curl http://localhost:11434/api/tags

This should return a JSON response listing the installed models.

All embedding providers implement:

type Embedder interface {
Embed(ctx context.Context, texts []string) ([][]float32, error)
EmbedSingle(ctx context.Context, text string) ([]float32, error)
Dimensions() int
}

Create an Ollama embedder via the registry and generate embeddings:

package main
import (
"context"
"fmt"
"log"
"github.com/lookatitude/beluga-ai/config"
"github.com/lookatitude/beluga-ai/rag/embedding"
// Register the Ollama provider
_ "github.com/lookatitude/beluga-ai/rag/embedding/providers/ollama"
)
func main() {
ctx := context.Background()
emb, err := embedding.New("ollama", config.ProviderConfig{
Model: "nomic-embed-text",
BaseURL: "http://localhost:11434",
})
if err != nil {
log.Fatal(err)
}
texts := []string{
"The capital of France is Paris.",
"Go is a statically typed programming language.",
}
vectors, err := emb.Embed(ctx, texts)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Generated %d embeddings of dimension %d\n", len(vectors), emb.Dimensions())
}

For embedding queries:

vector, err := emb.EmbedSingle(ctx, "What is the capital of France?")
if err != nil {
log.Fatal(err)
}
fmt.Printf("Query vector dimension: %d\n", len(vector))
package main
import (
"context"
"fmt"
"log"
"math"
"github.com/lookatitude/beluga-ai/config"
"github.com/lookatitude/beluga-ai/rag/embedding"
_ "github.com/lookatitude/beluga-ai/rag/embedding/providers/ollama"
)
func main() {
ctx := context.Background()
emb, err := embedding.New("ollama", config.ProviderConfig{
Model: "nomic-embed-text",
BaseURL: "http://localhost:11434",
})
if err != nil {
log.Fatal(err)
}
// Index some documents
docs := []string{
"Paris is the capital and most populous city of France.",
"Berlin is the capital of Germany.",
"Go was designed at Google by Robert Griesemer, Rob Pike, and Ken Thompson.",
}
docVecs, err := emb.Embed(ctx, docs)
if err != nil {
log.Fatal(err)
}
// Query
queryVec, err := emb.EmbedSingle(ctx, "capital of France")
if err != nil {
log.Fatal(err)
}
// Find the most similar document
for i, dv := range docVecs {
sim := cosineSimilarity(queryVec, dv)
fmt.Printf("Doc %d (%.4f): %s\n", i, sim, docs[i])
}
}
func cosineSimilarity(a, b []float32) float64 {
var dot, normA, normB float64
for i := range a {
dot += float64(a[i]) * float64(b[i])
normA += float64(a[i]) * float64(a[i])
normB += float64(b[i]) * float64(b[i])
}
return dot / (math.Sqrt(normA) * math.Sqrt(normB))
}
ModelDimensionsSizeUse Case
nomic-embed-text768~274 MBGeneral purpose, recommended
mxbai-embed-large1024~670 MBHigher accuracy, larger vectors
all-minilm384~45 MBLightweight, fast inference
snowflake-arctic-embed1024~670 MBHigh-quality retrieval

Pull additional models as needed:

Terminal window
ollama pull mxbai-embed-large
ollama pull all-minilm

Point to a remote Ollama instance or a non-default port:

emb, err := embedding.New("ollama", config.ProviderConfig{
Model: "nomic-embed-text",
BaseURL: "http://gpu-server.internal:11434",
Timeout: 60 * time.Second,
})

Monitor local embedding performance:

import "log/slog"
hooks := embedding.Hooks{
BeforeEmbed: func(ctx context.Context, texts []string) error {
slog.Info("embedding locally", "count", len(texts), "model", "nomic-embed-text")
return nil
},
AfterEmbed: func(ctx context.Context, embeddings [][]float32, err error) {
if err != nil {
slog.Error("local embedding failed", "error", err)
} else {
slog.Info("local embedding complete", "vectors", len(embeddings))
}
},
}

Connect Ollama embeddings to a vector store for similarity search:

import (
"github.com/lookatitude/beluga-ai/config"
"github.com/lookatitude/beluga-ai/rag/embedding"
"github.com/lookatitude/beluga-ai/rag/vectorstore"
_ "github.com/lookatitude/beluga-ai/rag/embedding/providers/ollama"
_ "github.com/lookatitude/beluga-ai/rag/vectorstore/providers/pgvector"
)
emb, err := embedding.New("ollama", config.ProviderConfig{
Model: "nomic-embed-text",
BaseURL: "http://localhost:11434",
})
if err != nil {
log.Fatal(err)
}
store, err := vectorstore.New("pgvector", config.ProviderConfig{
Options: map[string]any{
"connection_string": os.Getenv("PGVECTOR_URL"),
"dimensions": 768.0, // Must match nomic-embed-text dimensions
"collection": "local_docs",
},
})
if err != nil {
log.Fatal(err)
}

Ollama processes texts sequentially on most hardware. For large document sets, consider batching to provide progress feedback:

batchSize := 50
for i := 0; i < len(texts); i += batchSize {
end := min(i+batchSize, len(texts))
batch := texts[i:end]
vectors, err := emb.Embed(ctx, batch)
if err != nil {
return fmt.Errorf("batch %d: %w", i/batchSize, err)
}
log.Printf("embedded batch %d/%d (%d texts)",
i/batchSize+1, (len(texts)+batchSize-1)/batchSize, len(batch))
// Store vectors...
}
OptionDescriptionDefaultRequired
ModelOllama model name-Yes
BaseURLOllama server URLhttp://localhost:11434No
TimeoutRequest timeout30sNo

Provider-specific options can be passed via the Options map in config.ProviderConfig.

“connection refused” — The Ollama server is not running. Start it with ollama serve. Verify it is accessible at the configured URL with curl http://localhost:11434/api/tags.

“model not found” — The requested model has not been pulled. Download it with ollama pull <model-name>. List installed models with ollama list.

Slow inference — Local embedding speed depends on hardware. GPU acceleration significantly improves throughput. Ensure Ollama detects your GPU with ollama ps. For CPU-only systems, use smaller models like all-minilm.

High memory usage — Embedding models are loaded into memory when first used. The nomic-embed-text model requires approximately 600 MB of RAM. Monitor with ollama ps and unload unused models with ollama stop <model>.

  • Ollama is designed for local and development use. For production deployments, evaluate the security posture of the Ollama server carefully
  • Run the Ollama server in an isolated network segment, not exposed to the public internet
  • Monitor resource usage — embedding models consume significant GPU memory or CPU resources
  • Use GPU acceleration (NVIDIA, AMD, or Apple Silicon) for acceptable throughput on large document sets
  • Consider running Ollama in a container with resource limits to prevent runaway memory usage
  • For multi-tenant deployments, run separate Ollama instances per tenant to isolate model loading and memory