Skip to content
Docs

Production pgvector Sharding

As vector datasets grow beyond 1 million records, single-table scans become too slow and HNSW indexes may not fit in RAM. Table partitioning (sharding) combined with proper indexing provides the foundation for production-scale vector search. This approach leverages PostgreSQL’s built-in partitioning, which means you retain full SQL capabilities (joins, transactions, aggregations) alongside vector search — a significant advantage over purpose-built vector databases that sacrifice query flexibility.

A production-grade pgvector setup with HNSW indexing, table partitioning by tenant, and query performance optimization.

Connect Beluga AI to PostgreSQL with pgvector. The connection uses the same vectorstore.New() registry pattern as all other vector store providers, so switching from in-memory to pgvector requires only a provider name and configuration change.

package main
import (
"context"
"fmt"
"os"
"github.com/lookatitude/beluga-ai/config"
"github.com/lookatitude/beluga-ai/rag/vectorstore"
_ "github.com/lookatitude/beluga-ai/rag/vectorstore/providers/pgvector"
)
func main() {
ctx := context.Background()
store, err := vectorstore.New("pgvector", config.ProviderConfig{
"connection_string": os.Getenv("DATABASE_URL"),
"table_name": "documents",
})
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Println("Connected to pgvector")
_ = ctx
_ = store
}

Set up the table and HNSW index. HNSW (Hierarchical Navigable Small World) is preferred over IVFFlat for most workloads because it provides better recall with comparable latency and does not require a separate training step. The vector_cosine_ops operator class matches the cosine similarity metric used by most embedding models.

-- Enable the vector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create the documents table
CREATE TABLE documents (
id bigserial PRIMARY KEY,
content text NOT NULL,
metadata jsonb DEFAULT '{}',
embedding vector(1536) -- Match your embedder's dimension
);
-- Create HNSW index for cosine similarity
-- m: max connections per layer (16-64, higher = better recall, more memory)
-- ef_construction: build-time list size (64-512, higher = better recall, slower build)
CREATE INDEX idx_documents_embedding
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

HNSW parameters control the trade-off between recall, latency, and memory. The defaults (m=16, ef_construction=64) provide a good balance for most workloads. Increase these values when recall is more important than build time or memory usage — for example, in applications where returning the wrong document has high cost (medical, legal, financial).

ParameterRangeEffect
m16-64Higher = better recall, more memory per node
ef_construction64-512Higher = better index quality, slower build
ef_search40-500Higher = better recall at query time, slower queries

Set ef_search at query time:

-- Increase recall for important queries
SET hnsw.ef_search = 200;
SELECT id, content, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;

For multi-tenant applications, partition by tenant to reduce search scope. Partitioning confines each query to a single partition’s index rather than scanning the entire table, which reduces both latency and resource consumption. Each partition has its own HNSW index, so adding a new tenant creates a small, fast index rather than expanding a single large one.

-- Create partitioned table
CREATE TABLE documents_partitioned (
id bigserial,
tenant_id int NOT NULL,
content text NOT NULL,
metadata jsonb DEFAULT '{}',
embedding vector(1536),
PRIMARY KEY (id, tenant_id)
) PARTITION BY LIST (tenant_id);
-- Create partitions for each tenant
CREATE TABLE documents_tenant_1
PARTITION OF documents_partitioned
FOR VALUES IN (1);
CREATE TABLE documents_tenant_2
PARTITION OF documents_partitioned
FOR VALUES IN (2);
-- Create HNSW index on each partition
CREATE INDEX ON documents_tenant_1
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
CREATE INDEX ON documents_tenant_2
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

In Beluga AI, route queries to specific partitions using dynamic table names or metadata filtering. Creating a per-tenant store instance is the simplest approach — it maps directly to the partitioned table and ensures complete data isolation between tenants.

// Per-tenant store
tenantStore, err := vectorstore.New("pgvector", config.ProviderConfig{
"connection_string": os.Getenv("DATABASE_URL"),
"table_name": fmt.Sprintf("documents_tenant_%d", tenantID),
})

Load the HNSW index into shared buffers at startup. This eliminates cold-start latency for the first queries after a database restart, which is important in auto-scaling environments where new database instances may serve traffic immediately.

-- Requires pg_prewarm extension
CREATE EXTENSION IF NOT EXISTS pg_prewarm;
SELECT pg_prewarm('idx_documents_embedding');

Run regular maintenance on vector tables. VACUUM ANALYZE updates the statistics that the query planner uses to choose efficient execution plans, and REINDEX rebuilds the HNSW graph after large batch inserts that may have degraded its structure.

-- Update statistics for the query planner
VACUUM ANALYZE documents;
-- Reindex periodically after large batch inserts
REINDEX INDEX CONCURRENTLY idx_documents_embedding;

Use a connection pooler (PgBouncer, pgcat) in front of PostgreSQL. Vector similarity queries can be CPU-intensive — limit concurrent queries to avoid overloading the database.

Dataset SizeRecommendation
< 100k vectorsSingle table, HNSW index
100k - 1MHNSW with tuned parameters
1M - 10MTable partitioning by tenant or category
> 10MHorizontal sharding across PostgreSQL instances
  1. Load 10,000 vectors into a table without an index — measure query time.
  2. Create the HNSW index — measure query time again (should be < 10ms).
  3. Create tenant partitions — verify queries only scan the relevant partition.