Conversational AI with Persistent Memory
Traditional chatbots lose context between sessions, forcing users to repeat themselves. Every interaction starts from scratch — the assistant does not remember the user’s name, preferences, past questions, or the decisions made in previous conversations. This creates a frustrating experience that feels more like filling out a form than talking to an intelligent assistant.
The fundamental challenge is that LLM context windows are finite. You cannot simply concatenate all past conversations into the prompt — it would quickly exceed token limits and degrade response quality. A conversational AI assistant with persistent memory solves this by maintaining three tiers of memory, inspired by the MemGPT architecture: core context always in the prompt, searchable conversation history, and long-term archival storage backed by vector search. This tiered approach keeps the most important information always available while making everything else retrievable on demand.
Solution Architecture
Section titled “Solution Architecture”Beluga AI implements a MemGPT-inspired 3-tier memory system. The three tiers map to different access patterns and latency requirements:
- Core memory: Always present in the context window. Contains the persona definition and key facts about the user. Self-editable — the agent can update its understanding of the user over time. This tier occupies a fixed budget of the context window (typically 2-4K tokens) and is optimized for prompt cache hits by placing it first in the message sequence.
- Recall memory: Searchable conversation history. Stores full messages and retrieves relevant past exchanges by semantic similarity. This tier handles the “what did we discuss last time?” use case without loading all past conversations into context.
- Archival memory: Long-term vector storage for facts, preferences, and knowledge extracted from conversations. This tier handles the “what do I know about this user?” use case, retrieving specific facts across potentially thousands of past interactions.
The MemGPT pattern is chosen over simpler approaches (buffer memory, window memory) because it explicitly manages the tradeoff between context window size and information availability. Buffer memory loses old context, window memory loses specific facts — MemGPT’s three tiers ensure nothing important is lost while keeping the context window lean.
┌──────────────────────────────────────┐│ Context Window ││ ││ ┌────────────────────────────────┐ ││ │ Core Memory (always present) │ ││ │ - Persona: "Helpful advisor" │ ││ │ - Human: "Prefers concise │ ││ │ answers, works in finance" │ ││ └────────────────────────────────┘ ││ ││ ┌────────────────────────────────┐ ││ │ Recall Memory (recent turns) │ ││ │ - Last N messages │ ││ │ - Relevant past exchanges │ ││ └────────────────────────────────┘ ││ ││ ┌────────────────────────────────┐ ││ │ Current Conversation │ ││ │ - User message │ ││ └────────────────────────────────┘ │└──────────────────────────────────────┘ │ │ Search ▼┌──────────────────────────────────────┐│ Archival Memory (vector store) ││ - Extracted facts & preferences ││ - Past conversation summaries ││ - Domain knowledge │└──────────────────────────────────────┘Setting Up 3-Tier Memory
Section titled “Setting Up 3-Tier Memory”package main
import ( "context" "fmt"
"github.com/lookatitude/beluga-ai/memory" "github.com/lookatitude/beluga-ai/rag/embedding" "github.com/lookatitude/beluga-ai/rag/vectorstore"
_ "github.com/lookatitude/beluga-ai/memory/stores/inmemory" _ "github.com/lookatitude/beluga-ai/rag/embedding/providers/openai" _ "github.com/lookatitude/beluga-ai/rag/vectorstore/providers/pgvector")
type ConversationAssistant struct { core *memory.Core recall *memory.Recall archival *memory.Archival model llm.ChatModel}
func NewConversationAssistant(ctx context.Context) (*ConversationAssistant, error) { // Core memory: always in context, self-editable core := memory.NewCore(memory.CoreConfig{ PersonaLimit: 2000, // Max chars for persona block HumanLimit: 2000, // Max chars for human info block SelfEditable: true, // Agent can update its understanding }) core.SetPersona("You are a helpful personal assistant. You remember " + "past conversations and user preferences to provide personalized help.")
// Recall memory: searchable conversation history messageStore, err := memory.NewMessageStore("inmemory", nil) if err != nil { return nil, fmt.Errorf("create message store: %w", err) } recall := memory.NewRecall(messageStore)
// Archival memory: long-term vector storage embedder, err := embedding.New("openai", nil) if err != nil { return nil, fmt.Errorf("create embedder: %w", err) }
store, err := vectorstore.New("pgvector", nil) if err != nil { return nil, fmt.Errorf("create vector store: %w", err) }
archival, err := memory.NewArchival(memory.ArchivalConfig{ VectorStore: store, Embedder: embedder, }) if err != nil { return nil, fmt.Errorf("create archival memory: %w", err) }
model, err := llm.New("openai", nil) if err != nil { return nil, fmt.Errorf("create model: %w", err) }
return &ConversationAssistant{ core: core, recall: recall, archival: archival, model: model, }, nil}Conversation Flow
Section titled “Conversation Flow”Each turn assembles context from all three memory tiers, generates a response, and saves the exchange back into memory. The context assembly order matters: core memory goes first to maximize prompt cache hits (static content first, per Beluga AI’s prompt cache optimization pattern), then recall memory, then archival results. This ordering means the persona and user facts — which rarely change — can be cached across requests.
func (ca *ConversationAssistant) Chat(ctx context.Context, userMessage string) (string, error) { // 1. Build context from all memory tiers msgs := ca.buildContext(ctx, userMessage)
// 2. Add the current user message humanMsg := &schema.HumanMessage{Parts: []schema.ContentPart{ schema.TextPart{Text: userMessage}, }} msgs = append(msgs, humanMsg)
// 3. Generate response resp, err := ca.model.Generate(ctx, msgs) if err != nil { return "", fmt.Errorf("generate: %w", err) }
responseText := resp.Parts[0].(schema.TextPart).Text
// 4. Save to recall memory if err := ca.recall.Save(ctx, humanMsg, resp); err != nil { return "", fmt.Errorf("save recall: %w", err) }
// 5. Extract and archive important facts ca.archiveIfRelevant(ctx, userMessage, responseText)
return responseText, nil}
func (ca *ConversationAssistant) buildContext(ctx context.Context, query string) []schema.Message { var msgs []schema.Message
// Core memory: always first (optimizes prompt caching) msgs = append(msgs, ca.core.ToMessages()...)
// Recall memory: recent conversation history recent, err := ca.recall.Load(ctx, query) if err == nil { msgs = append(msgs, recent...) }
// Archival memory: relevant long-term facts archived, err := ca.archival.Search(ctx, query, 3) if err == nil && len(archived) > 0 { var archiveContext string for _, doc := range archived { archiveContext += "- " + doc.Content + "\n" } msgs = append(msgs, &schema.SystemMessage{Parts: []schema.ContentPart{ schema.TextPart{Text: "Relevant facts from past conversations:\n" + archiveContext}, }}) }
return msgs}Self-Updating Memory
Section titled “Self-Updating Memory”The assistant can update its core memory as it learns about the user. This self-updating capability is the key differentiator of the MemGPT pattern: the agent uses structured output (llm.NewStructured[Facts]) to extract facts from each exchange, then stores them in archival memory and promotes fundamental facts to core memory. Over time, the assistant builds a rich understanding of the user without any manual configuration.
func (ca *ConversationAssistant) archiveIfRelevant(ctx context.Context, userMsg, response string) { // Use the LLM to decide if this exchange contains important facts msgs := []schema.Message{ &schema.SystemMessage{Parts: []schema.ContentPart{ schema.TextPart{Text: "Extract any new facts about the user from this exchange. " + "Return a JSON array of facts, or an empty array if none."}, }}, &schema.HumanMessage{Parts: []schema.ContentPart{ schema.TextPart{Text: fmt.Sprintf("User: %s\nAssistant: %s", userMsg, response)}, }}, }
type Facts struct { Items []string `json:"items"` }
structured := llm.NewStructured[Facts](ca.model) facts, err := structured.Generate(ctx, msgs) if err != nil || len(facts.Items) == 0 { return }
// Store extracted facts in archival memory for _, fact := range facts.Items { doc := schema.Document{ Content: fact, Metadata: map[string]any{"source": "conversation", "timestamp": time.Now().Unix()}, } embedding, err := ca.archival.cfg.Embedder.EmbedSingle(ctx, fact) if err != nil { continue } ca.archival.cfg.VectorStore.Add(ctx, []schema.Document{doc}, [][]float32{embedding}) }
// Update core memory if we learned something fundamental about the user for _, fact := range facts.Items { if isCoreFact(fact) { current := ca.core.GetHuman() ca.core.SetHuman(current + "\n- " + fact) } }}Streaming Responses
Section titled “Streaming Responses”Stream responses token by token for a natural conversational feel:
func (ca *ConversationAssistant) StreamChat(ctx context.Context, userMessage string) iter.Seq2[schema.StreamChunk, error] { msgs := ca.buildContext(ctx, userMessage) msgs = append(msgs, &schema.HumanMessage{Parts: []schema.ContentPart{ schema.TextPart{Text: userMessage}, }})
return ca.model.Stream(ctx, msgs)}Multi-Session Persistence
Section titled “Multi-Session Persistence”For production deployments, use durable stores so conversations survive restarts:
import ( _ "github.com/lookatitude/beluga-ai/memory/stores/redis" _ "github.com/lookatitude/beluga-ai/memory/stores/postgres")
// Redis for recall memory (fast read/write)messageStore, err := memory.NewMessageStore("redis", config.ProviderConfig{ "addr": "localhost:6379", "prefix": "user:" + userID,})
// PostgreSQL + pgvector for archival memory (persistent, searchable)store, err := vectorstore.New("pgvector", config.ProviderConfig{ "connection_string": os.Getenv("DATABASE_URL"), "table_name": "archival_memory",})Production Considerations
Section titled “Production Considerations”Context Window Management
Section titled “Context Window Management”Core memory consumes a fixed portion of the context window. Monitor and manage it:
// Check core memory size before adding factspersona := ca.core.GetPersona()human := ca.core.GetHuman()
if len(persona)+len(human) > 3000 { // Summarize the human profile to fit within limits summarized, err := summarizeProfile(ctx, ca.model, human) if err == nil { ca.core.SetHuman(summarized) }}Observability
Section titled “Observability”Track memory operations, context sizes, and retrieval quality:
span.SetAttributes( attribute.Int("memory.core_size", len(ca.core.GetPersona())+len(ca.core.GetHuman())), attribute.Int("memory.recall_messages", len(recent)), attribute.Int("memory.archival_results", len(archived)), attribute.Int("memory.total_context_tokens", estimateTokens(msgs)),)Privacy and Data Retention
Section titled “Privacy and Data Retention”- Encrypt memory stores at rest (database-level encryption for PostgreSQL, TLS for Redis)
- Implement per-user memory isolation — each user has their own core, recall, and archival stores
- Provide a
Clear()method for users to delete their data (GDPR right to erasure) - Set TTLs on recall memory to automatically expire old conversations
- Never store PII in core memory — use the guard pipeline to screen before saving
Scaling
Section titled “Scaling”- Core memory: In-memory per session, persisted on session end. Lightweight.
- Recall memory: Redis for sub-millisecond lookups. Shard by user ID.
- Archival memory: pgvector with HNSW index for fast approximate nearest neighbor search at scale.
- Deploy the assistant as a stateless service. All state lives in the memory stores.
Related Resources
Section titled “Related Resources”- Memory System Guide for detailed memory configuration
- Building Your First Agent for combining memory with planning strategies
- Voice AI Applications for adding voice interfaces to the assistant