Skip to content
Docs

Summary and Window Memory Patterns

LLMs have limited context windows. Sending the entire history of a year-long conversation will exceed token limits, increase cost, and dilute the model’s attention on recent context. Effective memory management is about keeping the most relevant information in the context window while discarding or compressing the rest. This tutorial covers three strategies with different trade-offs: sliding window for simplicity, summarization for infinite history, and a hybrid approach that combines the strengths of both. Beluga AI’s memory system draws from the MemGPT 3-tier model (Core, Recall, Archival), and these patterns map to the Core and Recall tiers.

Three memory management strategies — sliding window (last K messages), summarization (running summary), and a hybrid (summary + recent buffer) — suitable for different use cases.

The simplest memory strategy: keep only the last N messages and discard everything older. This works because LLMs tend to pay most attention to recent messages, and for short tasks like Q&A, the last few exchanges contain all the context needed. The system message is stored separately and always prepended, ensuring the agent’s persona and instructions are never evicted.

The trade-off is clear: predictable token usage and zero latency overhead, but complete loss of older context. If a user says “My name is Alice” in message 5 and your window is 10, the agent will forget the name after 10 more exchanges.

package main
import (
"github.com/lookatitude/beluga-ai/schema"
)
// WindowMemory retains only the last windowSize messages.
type WindowMemory struct {
messages []schema.Message
system *schema.SystemMessage // always retained
windowSize int
}
func NewWindowMemory(systemPrompt string, windowSize int) *WindowMemory {
return &WindowMemory{
system: schema.NewSystemMessage(systemPrompt),
windowSize: windowSize,
}
}
func (m *WindowMemory) AddMessage(msg schema.Message) {
m.messages = append(m.messages, msg)
// Trim to window size
if len(m.messages) > m.windowSize {
m.messages = m.messages[len(m.messages)-m.windowSize:]
}
}
func (m *WindowMemory) GetMessages() []schema.Message {
// System message is always first
result := make([]schema.Message, 0, 1+len(m.messages))
result = append(result, m.system)
result = append(result, m.messages...)
return result
}

Trade-offs:

  • Predictable token usage
  • Loses earlier context (“My name is Alice” from 20 messages ago)
  • Simple to implement

A running summary preserves key facts from the entire conversation by periodically using an LLM to compress the buffer into a text summary. When the buffer reaches maxBuffer messages, the summarizer generates an updated summary that incorporates both the previous summary and the new messages, then flushes the buffer.

This approach enables infinite conversation length — no matter how many messages are exchanged, the context window only contains the system prompt, the summary, and the current buffer. The cost is an extra LLM call each time the buffer fills up, and the inherent lossiness of summarization (nuance, exact wording, and minor details may be lost). The summarizer prompt explicitly instructs the model to preserve names, preferences, and key decisions, which are the facts most likely to be needed later.

import (
"context"
"fmt"
"github.com/lookatitude/beluga-ai/llm"
)
// SummaryMemory maintains a running summary of the conversation.
type SummaryMemory struct {
summary string
summarizer llm.ChatModel
system *schema.SystemMessage
buffer []schema.Message // Recent unsummarized messages
maxBuffer int
}
func NewSummaryMemory(systemPrompt string, summarizer llm.ChatModel, maxBuffer int) *SummaryMemory {
return &SummaryMemory{
system: schema.NewSystemMessage(systemPrompt),
summarizer: summarizer,
maxBuffer: maxBuffer,
}
}
func (m *SummaryMemory) AddMessage(ctx context.Context, msg schema.Message) error {
m.buffer = append(m.buffer, msg)
// When buffer exceeds max, summarize and flush
if len(m.buffer) >= m.maxBuffer {
if err := m.summarize(ctx); err != nil {
return err
}
}
return nil
}
func (m *SummaryMemory) summarize(ctx context.Context) error {
// Build the text to summarize
var conversation string
for _, msg := range m.buffer {
role := msg.GetRole()
text := ""
for _, part := range msg.GetContent() {
if tp, ok := part.(schema.TextPart); ok {
text = tp.Text
break
}
}
conversation += fmt.Sprintf("[%s]: %s\n", role, text)
}
prompt := fmt.Sprintf(`Current summary:
%s
New conversation:
%s
Write an updated summary that captures all important information, including names, preferences, and key decisions.`, m.summary, conversation)
msgs := []schema.Message{
schema.NewSystemMessage("You are a conversation summarizer. Produce a concise summary preserving key facts."),
schema.NewHumanMessage(prompt),
}
resp, err := m.summarizer.Generate(ctx, msgs, llm.WithMaxTokens(500))
if err != nil {
return fmt.Errorf("summarize: %w", err)
}
m.summary = resp.Text()
m.buffer = nil // Clear the buffer
return nil
}
func (m *SummaryMemory) GetMessages() []schema.Message {
result := []schema.Message{m.system}
if m.summary != "" {
result = append(result,
schema.NewSystemMessage(fmt.Sprintf("Summary of previous conversation:\n%s", m.summary)),
)
}
result = append(result, m.buffer...)
return result
}

Trade-offs:

  • Infinite conversation length
  • Preserves key facts across the entire history
  • Loses detail and nuance (summarization is lossy)
  • Adds latency (requires LLM call for each summarization)

The production-recommended approach combines the strengths of both strategies. The last K messages are kept verbatim for full-fidelity recent context (the current topic, exact phrasing, tool call details), while everything older is compressed into a running summary that preserves long-term facts (user preferences, decisions, identities).

The hybrid memory summarizes the older half of the buffer when it overflows, rather than the entire buffer. This ensures that the most recent messages are never summarized prematurely — they remain verbatim in the buffer where the model can reference exact details. The summary grows incrementally, adding new facts from each batch of summarized messages while retaining all previous summary content.

// HybridMemory combines a running summary with a recent message buffer.
type HybridMemory struct {
summary string
summarizer llm.ChatModel
system *schema.SystemMessage
recent []schema.Message // Recent messages (verbatim)
maxRecent int // Max recent messages before summarizing
}
func NewHybridMemory(systemPrompt string, summarizer llm.ChatModel, maxRecent int) *HybridMemory {
return &HybridMemory{
system: schema.NewSystemMessage(systemPrompt),
summarizer: summarizer,
maxRecent: maxRecent,
}
}
func (m *HybridMemory) AddMessage(ctx context.Context, msg schema.Message) error {
m.recent = append(m.recent, msg)
// When recent buffer is full, summarize the older half
if len(m.recent) > m.maxRecent {
half := len(m.recent) / 2
toSummarize := m.recent[:half]
m.recent = m.recent[half:]
if err := m.summarizeMessages(ctx, toSummarize); err != nil {
return err
}
}
return nil
}
func (m *HybridMemory) GetMessages() []schema.Message {
result := []schema.Message{m.system}
if m.summary != "" {
result = append(result,
schema.NewSystemMessage(fmt.Sprintf("Previous conversation summary:\n%s", m.summary)),
)
}
result = append(result, m.recent...)
return result
}

This approach provides:

  • Full fidelity for recent context (the current topic)
  • Long-term memory via summarization (names, preferences, decisions)
  • Bounded token usage
StrategyContext LengthDetailLatencyUse Case
WindowFixedHigh (recent)NoneShort tasks, Q&A
SummaryUnlimitedMedium+LLM callLong sessions, support
HybridBoundedHigh + Medium+LLM callProduction agents
  1. Use the Hybrid memory with a max buffer of 10 messages.
  2. Send 50 messages to the agent, including “My name is Alice” early in the conversation.
  3. After summarization occurs, ask “What is my name?”
  4. Verify the agent answers correctly from the summary.