Skip to content
Docs

Conversational AI with Persistent Memory

Traditional chatbots lose context between sessions, forcing users to repeat themselves. Every interaction starts from scratch — the assistant does not remember the user’s name, preferences, past questions, or the decisions made in previous conversations. This creates a frustrating experience that feels more like filling out a form than talking to an intelligent assistant.

The fundamental challenge is that LLM context windows are finite. You cannot simply concatenate all past conversations into the prompt — it would quickly exceed token limits and degrade response quality. A conversational AI assistant with persistent memory solves this by maintaining three tiers of memory, inspired by the MemGPT architecture: core context always in the prompt, searchable conversation history, and long-term archival storage backed by vector search. This tiered approach keeps the most important information always available while making everything else retrievable on demand.

Beluga AI implements a MemGPT-inspired 3-tier memory system. The three tiers map to different access patterns and latency requirements:

  • Core memory: Always present in the context window. Contains the persona definition and key facts about the user. Self-editable — the agent can update its understanding of the user over time. This tier occupies a fixed budget of the context window (typically 2-4K tokens) and is optimized for prompt cache hits by placing it first in the message sequence.
  • Recall memory: Searchable conversation history. Stores full messages and retrieves relevant past exchanges by semantic similarity. This tier handles the “what did we discuss last time?” use case without loading all past conversations into context.
  • Archival memory: Long-term vector storage for facts, preferences, and knowledge extracted from conversations. This tier handles the “what do I know about this user?” use case, retrieving specific facts across potentially thousands of past interactions.

The MemGPT pattern is chosen over simpler approaches (buffer memory, window memory) because it explicitly manages the tradeoff between context window size and information availability. Buffer memory loses old context, window memory loses specific facts — MemGPT’s three tiers ensure nothing important is lost while keeping the context window lean.

┌──────────────────────────────────────┐
│ Context Window │
│ │
│ ┌────────────────────────────────┐ │
│ │ Core Memory (always present) │ │
│ │ - Persona: "Helpful advisor" │ │
│ │ - Human: "Prefers concise │ │
│ │ answers, works in finance" │ │
│ └────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────┐ │
│ │ Recall Memory (recent turns) │ │
│ │ - Last N messages │ │
│ │ - Relevant past exchanges │ │
│ └────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────┐ │
│ │ Current Conversation │ │
│ │ - User message │ │
│ └────────────────────────────────┘ │
└──────────────────────────────────────┘
│ Search
┌──────────────────────────────────────┐
│ Archival Memory (vector store) │
│ - Extracted facts & preferences │
│ - Past conversation summaries │
│ - Domain knowledge │
└──────────────────────────────────────┘
package main
import (
"context"
"fmt"
"github.com/lookatitude/beluga-ai/memory"
"github.com/lookatitude/beluga-ai/rag/embedding"
"github.com/lookatitude/beluga-ai/rag/vectorstore"
_ "github.com/lookatitude/beluga-ai/memory/stores/inmemory"
_ "github.com/lookatitude/beluga-ai/rag/embedding/providers/openai"
_ "github.com/lookatitude/beluga-ai/rag/vectorstore/providers/pgvector"
)
type ConversationAssistant struct {
core *memory.Core
recall *memory.Recall
archival *memory.Archival
model llm.ChatModel
}
func NewConversationAssistant(ctx context.Context) (*ConversationAssistant, error) {
// Core memory: always in context, self-editable
core := memory.NewCore(memory.CoreConfig{
PersonaLimit: 2000, // Max chars for persona block
HumanLimit: 2000, // Max chars for human info block
SelfEditable: true, // Agent can update its understanding
})
core.SetPersona("You are a helpful personal assistant. You remember " +
"past conversations and user preferences to provide personalized help.")
// Recall memory: searchable conversation history
messageStore, err := memory.NewMessageStore("inmemory", nil)
if err != nil {
return nil, fmt.Errorf("create message store: %w", err)
}
recall := memory.NewRecall(messageStore)
// Archival memory: long-term vector storage
embedder, err := embedding.New("openai", nil)
if err != nil {
return nil, fmt.Errorf("create embedder: %w", err)
}
store, err := vectorstore.New("pgvector", nil)
if err != nil {
return nil, fmt.Errorf("create vector store: %w", err)
}
archival, err := memory.NewArchival(memory.ArchivalConfig{
VectorStore: store,
Embedder: embedder,
})
if err != nil {
return nil, fmt.Errorf("create archival memory: %w", err)
}
model, err := llm.New("openai", nil)
if err != nil {
return nil, fmt.Errorf("create model: %w", err)
}
return &ConversationAssistant{
core: core,
recall: recall,
archival: archival,
model: model,
}, nil
}

Each turn assembles context from all three memory tiers, generates a response, and saves the exchange back into memory. The context assembly order matters: core memory goes first to maximize prompt cache hits (static content first, per Beluga AI’s prompt cache optimization pattern), then recall memory, then archival results. This ordering means the persona and user facts — which rarely change — can be cached across requests.

func (ca *ConversationAssistant) Chat(ctx context.Context, userMessage string) (string, error) {
// 1. Build context from all memory tiers
msgs := ca.buildContext(ctx, userMessage)
// 2. Add the current user message
humanMsg := &schema.HumanMessage{Parts: []schema.ContentPart{
schema.TextPart{Text: userMessage},
}}
msgs = append(msgs, humanMsg)
// 3. Generate response
resp, err := ca.model.Generate(ctx, msgs)
if err != nil {
return "", fmt.Errorf("generate: %w", err)
}
responseText := resp.Parts[0].(schema.TextPart).Text
// 4. Save to recall memory
if err := ca.recall.Save(ctx, humanMsg, resp); err != nil {
return "", fmt.Errorf("save recall: %w", err)
}
// 5. Extract and archive important facts
ca.archiveIfRelevant(ctx, userMessage, responseText)
return responseText, nil
}
func (ca *ConversationAssistant) buildContext(ctx context.Context, query string) []schema.Message {
var msgs []schema.Message
// Core memory: always first (optimizes prompt caching)
msgs = append(msgs, ca.core.ToMessages()...)
// Recall memory: recent conversation history
recent, err := ca.recall.Load(ctx, query)
if err == nil {
msgs = append(msgs, recent...)
}
// Archival memory: relevant long-term facts
archived, err := ca.archival.Search(ctx, query, 3)
if err == nil && len(archived) > 0 {
var archiveContext string
for _, doc := range archived {
archiveContext += "- " + doc.Content + "\n"
}
msgs = append(msgs, &schema.SystemMessage{Parts: []schema.ContentPart{
schema.TextPart{Text: "Relevant facts from past conversations:\n" + archiveContext},
}})
}
return msgs
}

The assistant can update its core memory as it learns about the user. This self-updating capability is the key differentiator of the MemGPT pattern: the agent uses structured output (llm.NewStructured[Facts]) to extract facts from each exchange, then stores them in archival memory and promotes fundamental facts to core memory. Over time, the assistant builds a rich understanding of the user without any manual configuration.

func (ca *ConversationAssistant) archiveIfRelevant(ctx context.Context, userMsg, response string) {
// Use the LLM to decide if this exchange contains important facts
msgs := []schema.Message{
&schema.SystemMessage{Parts: []schema.ContentPart{
schema.TextPart{Text: "Extract any new facts about the user from this exchange. " +
"Return a JSON array of facts, or an empty array if none."},
}},
&schema.HumanMessage{Parts: []schema.ContentPart{
schema.TextPart{Text: fmt.Sprintf("User: %s\nAssistant: %s", userMsg, response)},
}},
}
type Facts struct {
Items []string `json:"items"`
}
structured := llm.NewStructured[Facts](ca.model)
facts, err := structured.Generate(ctx, msgs)
if err != nil || len(facts.Items) == 0 {
return
}
// Store extracted facts in archival memory
for _, fact := range facts.Items {
doc := schema.Document{
Content: fact,
Metadata: map[string]any{"source": "conversation", "timestamp": time.Now().Unix()},
}
embedding, err := ca.archival.cfg.Embedder.EmbedSingle(ctx, fact)
if err != nil {
continue
}
ca.archival.cfg.VectorStore.Add(ctx, []schema.Document{doc}, [][]float32{embedding})
}
// Update core memory if we learned something fundamental about the user
for _, fact := range facts.Items {
if isCoreFact(fact) {
current := ca.core.GetHuman()
ca.core.SetHuman(current + "\n- " + fact)
}
}
}

Stream responses token by token for a natural conversational feel:

func (ca *ConversationAssistant) StreamChat(ctx context.Context, userMessage string) iter.Seq2[schema.StreamChunk, error] {
msgs := ca.buildContext(ctx, userMessage)
msgs = append(msgs, &schema.HumanMessage{Parts: []schema.ContentPart{
schema.TextPart{Text: userMessage},
}})
return ca.model.Stream(ctx, msgs)
}

For production deployments, use durable stores so conversations survive restarts:

import (
_ "github.com/lookatitude/beluga-ai/memory/stores/redis"
_ "github.com/lookatitude/beluga-ai/memory/stores/postgres"
)
// Redis for recall memory (fast read/write)
messageStore, err := memory.NewMessageStore("redis", config.ProviderConfig{
"addr": "localhost:6379",
"prefix": "user:" + userID,
})
// PostgreSQL + pgvector for archival memory (persistent, searchable)
store, err := vectorstore.New("pgvector", config.ProviderConfig{
"connection_string": os.Getenv("DATABASE_URL"),
"table_name": "archival_memory",
})

Core memory consumes a fixed portion of the context window. Monitor and manage it:

// Check core memory size before adding facts
persona := ca.core.GetPersona()
human := ca.core.GetHuman()
if len(persona)+len(human) > 3000 {
// Summarize the human profile to fit within limits
summarized, err := summarizeProfile(ctx, ca.model, human)
if err == nil {
ca.core.SetHuman(summarized)
}
}

Track memory operations, context sizes, and retrieval quality:

span.SetAttributes(
attribute.Int("memory.core_size", len(ca.core.GetPersona())+len(ca.core.GetHuman())),
attribute.Int("memory.recall_messages", len(recent)),
attribute.Int("memory.archival_results", len(archived)),
attribute.Int("memory.total_context_tokens", estimateTokens(msgs)),
)
  • Encrypt memory stores at rest (database-level encryption for PostgreSQL, TLS for Redis)
  • Implement per-user memory isolation — each user has their own core, recall, and archival stores
  • Provide a Clear() method for users to delete their data (GDPR right to erasure)
  • Set TTLs on recall memory to automatically expire old conversations
  • Never store PII in core memory — use the guard pipeline to screen before saving
  • Core memory: In-memory per session, persisted on session end. Lightweight.
  • Recall memory: Redis for sub-millisecond lookups. Shard by user ID.
  • Archival memory: pgvector with HNSW index for fast approximate nearest neighbor search at scale.
  • Deploy the assistant as a stateless service. All state lives in the memory stores.