grube.ai

Memory in AI Agents

Your agent forgets everything the moment the conversation ends. Next session, it has no idea who you are, what you asked for last time, or that you already told it you prefer TypeScript over Python.

If you read How AI Agents Actually Work, you know we didn't really get into memory. We mentioned it, but didn't explain how it works or how to build it. This article is that explanation.

Context window is not memory

The biggest misconception in the space: "We have a million tokens now, memory problem solved." No. A context window is a scratchpad. It exists for one conversation and gets wiped immediately after.

Context windows reset every session. They treat all tokens equally with no sense of priority. They scale in cost with input size. And recall (how accurately the model can find and use information from its input) degrades with distance: accuracy drops over 30% when the relevant information sits in the middle of a long context. This is the well-documented "lost in the middle" effect.

Memory works differently. It persists across sessions, prioritizes what matters, and retrieves by meaning, not position. A 200K context window means you can paste a lot of text into one conversation. It doesn't mean the agent learns anything. Close the tab, it's gone.

RAG is not memory either

RAG retrieves external documents and injects them into the prompt. Useful for grounding answers in facts. But RAG is stateless. No sense of time, no concept of who's asking, no ability to learn from past interactions. Each query starts fresh.

RAG says "here are the docs about Postgres migrations." Memory says "you asked about this last week and the solution was to run the diff first." You want both. RAG to ground the agent in facts, memory to shape how it behaves.

What memory actually is

A common way to think about agents is: Agent = LLM + Memory + Planning + Tool Use. Memory is one of four equal parts, not an afterthought.

Agent memory lets an agent store, retrieve, and use information across sessions. This is different from stuffing chat history into a prompt. It's a persistent internal state that evolves with every interaction, even weeks apart.

Three pillars define it:

  • State: knowing what's happening right now
  • Persistence: retaining knowledge across sessions
  • Selection: deciding what's worth remembering

That third one is the hardest. Storing everything leads to noise. Noise degrades retrieval quality. Every token spent on irrelevant memory is a token not available for reasoning.

Types of memory

Human memory is the best blueprint we have. Not because AI needs to perfectly replicate it, but because the categories map cleanly to what agents need.

Working memory

Short-term

The context window itself. What the agent is thinking about right now: the current conversation, system prompt, retrieved context, and tool results. When it fills up, something has to go. Oldest messages get dropped or summarized.

Current context window

systemYou are a helpful coding assistant.
userHow do I fix the N+1 query in my dashboard?
tool_resultFound 3 queries in /api/dashboard.ts that load relations without eager loading.
assistantThe issue is in your findMany call on line 42...

These four are the standard taxonomy, but some systems add their own. Personas (stored agent identity and tone rules), toolbox memory (tool definitions retrieved per query instead of stuffed into every prompt), conversation logs. What matters is that you pick the types that fit your use case.

The four core types aren't exclusive either. Episodic memories tell you what happened. Semantic memories tell you what you know. Procedural memories tell you what to do. Working memory is where they all come together for the current task.

The memory lifecycle

Memory isn't just "save and retrieve." Five stages, each with design decisions that matter.

Generate
Store
Retrieve
Integrate
Forget

Generate

A memory is created from conversation. Can be explicit (agent calls a save_memory tool) or implicit (the system extracts facts automatically after each turn).

User

I'm Alex, I work at Acme Corp

Extracted memories

+User's name is Alex
+User works at Acme Corp
1 / 5

How storage works

Where you store memories determines what you can do with them.

Three approaches to storing agent memories
Flatkey-value
user_pref:TypeScript
project:SaaS dashboard
last_topic:auth flow
Hierarchicaltiered
STM
recent turns
auth discussion
deploy question
MTM
topic segments
pricing review
onboarding flow
LPM
stable traits
prefers dark mode
senior engineer
Graphnodes + edges
Format determines what retrieval strategies are available

Flat storage

The simplest option. Memories go in as key-value pairs or plain text files. Easy to implement, easy to understand. But retrieval is limited to exact match or basic search. No hierarchy, no relationships between memories.

const memories = {
  "user_preference": "prefers TypeScript",
  "project_context": "building a SaaS dashboard",
  "last_session": "discussed auth implementation"
}

Hierarchical storage

Inspired by how operating systems manage memory. Multiple tiers with different capacities and access speeds. Recent memories live in a fast, small tier (like RAM). Older memories get pushed to larger, slower storage (like disk). The MemoryOS research paper uses three tiers:

  • Short-term memory (STM): a FIFO queue of recent dialogue turns
  • Mid-term memory (MTM): topic-coherent segments organized by subject
  • Long-term persona memory (LPM): stable user traits, preferences, and knowledge

Memories move between tiers based on a "heat" score that factors in how often a memory is accessed, how recently, and how long the interaction lasted. Hot memories stay close. Cold memories sink to deep storage. This approach improved F1 scores (a metric that combines precision and recall into a single number, where 1.0 is perfect) by 49% on the LoCoMo long-conversation benchmark.

Graph-based storage

Memories as nodes, relationships as edges. This lets you answer questions like "who does the user work with?" or "what topics are related to this project?" that flat storage can't handle. The tradeoff is complexity: you need more infrastructure to set it up and maintain.

How retrieval works

Having the right memory stored is useless if you can't find it when you need it.

Lexical search: keyword matching. Fast, exact, but misses semantic similarity. "deploy pipeline" won't match a memory about "CI/CD workflow."

Vector search: converts memories and queries into embeddings, then finds the closest matches by semantic similarity. This is the backbone of most modern memory systems. It catches meaning, not just words. But it struggles when the answer requires connecting multiple facts together.

Graph traversal: follows relationships between memories. Walk the edges and multi-hop connections become obvious. The tradeoff is infrastructure: you need a graph database and entity extraction (automatically identifying people, projects, and concepts mentioned in text) to build the graph in the first place.

QueryWas Alice affected by Tuesday's outage?
Keyword matching: Alice, Tuesday, outage
Alice leads Project Atlas
Project Atlas runs on PostgreSQL
PostgreSQL went down Tuesday
Exact keyword matching

Hybrid search: lexical + vector together. You get both exact matches and semantic matches. Most production systems land here. Still can't connect facts across multiple hops though. You can also re-query multiple times, letting the agent search again based on what it found in the first search.

A fifth pattern is gaining traction: agentic retrieval. Instead of a fixed pipeline, the agent gets tools to search its own memory however it sees fit. It decides when to search, what to search for, and which method to use.

The hardest problem: what to remember and what to forget

Storage is straightforward. Retrieval is getting better every year. The harder problem is deciding what deserves to be a memory in the first place.

Store everything? You get noise. Every throwaway comment, every typo correction, every "never mind" clutters the memory store and degrades retrieval quality.

Store nothing? The agent never learns. Every session starts from zero. You're back to a stateless chatbot.

The answer is intelligent filtering. A few strategies that work:

  • Priority scoring: before storing, score how important something is. For example, an LLM can rate each piece of information on a 1-10 scale, and you only store things above a threshold.
  • Contextual tagging: attach metadata to each memory so retrieval can filter by it. For example, tag a memory with user:alex, topic:deployment, date:2026-04-15 so you can later search "what does Alex know about deployment?" and get precise results.
  • Decay functions: you store everything, but each memory gets a relevance score that drops over time unless the memory keeps getting accessed. The MemoryOS heat score uses: Heat = α × visits + β × interaction_length + γ × recency, where recency decays exponentially. Memories that nobody looks at gradually become invisible to retrieval, even though they're still in storage.
  • Consolidation: move short-term memories into long-term storage, summarizing along the way. What decides what's important? Usually a combination of frequency (how often something comes up), the decay score, and sometimes an LLM call that asks "is this worth keeping long-term?"
  • Conflict resolution: when a new memory contradicts an old one, update or archive the old one

Forgetting in humans is a feature, not a bug. It prevents noise from overwhelming useful recall. A computational system can achieve the same effect without actually deleting anything: keep everything, but use relevance scoring to surface only what matters. The "forgetting" happens at retrieval time, not at storage time.

Memory architectures in practice

The OS approach (Letta)

Treat the context window like RAM and external storage like disk. The agent pages information in and out as needed. Letta (formerly MemGPT) pioneered this idea: the agent gets function calls to read from and write to its own memory, swapping relevant context into the limited window on demand. MemoryOS, an academic project, extends this with three formal tiers and heat-based eviction. The result is agents that maintain coherent conversations over hundreds of turns without losing track.

The managed layer approach

External services that handle memory for you. You connect them to your agent and they automatically extract memories from conversations, store them, and retrieve relevant ones for future prompts. Mem0 extracts facts and stores them. Zep builds a temporal knowledge graph (Graphiti) that tracks when facts were valid. Supermemory uses semantic search.

Building memory into your agent

Three approaches, ranked by effort.

1. Use a memory provider

Plug in a service like Mem0, Letta, or Supermemory. They handle storage, retrieval, and injection. You write almost no memory code. Good for getting started fast. The tradeoff is you depend on an external service and have less visibility into what gets stored.

import { createMem0 } from '@mem0/vercel-ai-provider'
import { ToolLoopAgent } from 'ai'

const mem0 = createMem0({
  provider: 'openai',
  mem0ApiKey: process.env.MEM0_API_KEY,
  apiKey: process.env.OPENAI_API_KEY,
})

const agent = new ToolLoopAgent({
  model: mem0('gpt-4.1', { user_id: 'user-123' }),
})

const { text } = await agent.generate({
  prompt: 'Remember that my favorite editor is Neovim',
})

2. Use a provider-defined memory tool

Some model providers ship memory as a built-in tool. Anthropic has one for Claude that gives the model a structured interface for managing a /memories directory. The model has been trained to use it, so it works well out of the box. You provide the storage backend.

import { anthropic } from '@ai-sdk/anthropic'
import { ToolLoopAgent } from 'ai'

const memory = anthropic.tools.memory_20250818({
  execute: async action => {
    // action contains: command, path, and other fields
    // commands: view, create, str_replace, insert, delete, rename
    // Map these to your storage backend (filesystem, database, etc.)
  },
})

const agent = new ToolLoopAgent({
  model: 'anthropic/claude-haiku-4.5',
  tools: { memory },
})

The tradeoff is provider lock-in. This specific tool only works with Claude.

3. Build your own

Define your own memory tools from scratch. You control what gets stored, how it's structured, how it's retrieved, and when things get forgotten. No lock-in, no external dependencies.

The simplest version gives the agent two tools: one to save a memory and one to search memories.

const tools = {
  save_memory: {
    description: 'Save important information for future reference',
    parameters: {
      content: { type: 'string', description: 'The information to remember' },
      category: { type: 'string', enum: ['preference', 'fact', 'event', 'procedure'] },
    },
    execute: async ({ content, category }) => {
      const embedding = await embed(content)
      await db.memories.insertOne({
        content,
        category,
        embedding,
        createdAt: new Date(),
        accessCount: 0,
      })
      return 'Memory saved.'
    },
  },

  search_memories: {
    description: 'Search past memories for relevant information',
    parameters: {
      query: { type: 'string', description: 'What to search for' },
    },
    execute: async ({ query }) => {
      const embedding = await embed(query)
      const results = await db.memories.aggregate([
        {
          $vectorSearch: {
            queryVector: embedding,
            path: 'embedding',
            numCandidates: 50,
            limit: 5,
          },
        },
      ]).toArray()
      return results.map(r => r.content).join('\n')
    },
  },
}

The missing piece: how do retrieved memories actually get into the LLM call? You build the system prompt at request time:

async function buildSystemPrompt(userId, currentQuery) {
  const memories = await searchMemories(userId, currentQuery)

  return `You are a helpful assistant.

=== MEMORIES ===
${memories.map(m => `- [${m.category}] ${m.content}`).join('\n')}

Use these memories to personalize your response. If a memory
contradicts the user's current message, trust the current message.`
}

This is a starting point. A production system would add decay scoring, deduplication, conflict resolution, and memory consolidation. But the core pattern is simple: give the agent tools to read and write its own memories, and inject relevant ones into the prompt at query time.

How the pieces fit together

Memory types feed into each other through a process called consolidation. Repeated episodic memories distill into semantic knowledge. "User asked about Postgres migrations three times this week" becomes "user is working on database migrations," which informs a procedural memory: "when the user asks about schema changes, suggest running a diff first." Without consolidation, the agent replays individual events instead of learning from them.

The lifecycle runs continuously: new interactions generate potential memories, the system decides what to store, retrieval surfaces relevant context, integration makes it available to the LLM, and decay keeps the store clean.

Memory is what turns an agent from a tool into an assistant.

Sources

crafted by bart stefanski