How I Built a Persistent Memory System for AI Agents (And What It Changed)

Every time you start a new Claude Code session, it starts cold. No memory of the architecture decisions you made last Tuesday. No awareness of the debugging insight that saved three hours last week. No recollection that you decided to use PostgreSQL over SQLite for a specific reason — a reason that matters and that you'll have to re-explain next time the topic comes up.

This isn't a model limitation. It's a session design problem. And it's fixable.

Over the past few weeks building Kite — an autonomous multi-agent orchestration system — I've run 232 completed tasks across dozens of sessions. Without persistent memory, each of those sessions would have started from scratch. With memory, every session starts with the relevant context already loaded. Here's how I built it.

Why Session Memory Isn't Enough

Claude Code's built-in session memory is excellent within a session. The model maintains context, builds on earlier reasoning, and references files it already read. The problem is the boundary: close the terminal, the context dies.

This creates a recurring tax on every new session:

Re-explaining project architecture to a session that should already know it
Rediscovering decisions you already made and documented somewhere you can't reference in-context
Debugging the same class of problem twice because the insight from the first fix didn't survive the session boundary

For a solo developer running one or two sessions per day, this overhead is manageable. For an autonomous system running dozens of sessions continuously across engineering, strategy, and growth workstreams — it's a structural bottleneck.

The fix is persistent memory: a system that extracts knowledge from sessions, stores it in a searchable vault, and automatically injects the relevant subset into every new session.

The Architecture: Three Components

Kite's memory system has three pieces that work together: an extraction daemon, a searchable vault, and an injection hook.

1. The Extraction Daemon

Claude Code logs every conversation to ~/.claude/projects/*.jsonl — one file per project, append-only, newline-delimited JSON. Each line is a message: user prompt, tool call, assistant response.

The extraction daemon (kite-memory.sh) polls these log files every five minutes. When it finds new content since the last run, it sends the recent conversation segment to Claude Haiku with a structured extraction prompt. Haiku is fast and cheap — extraction runs in under two seconds.

The extraction prompt asks Haiku to identify:

Decisions: architectural choices, tradeoffs accepted, approaches rejected
Insights: debugging findings, performance observations, behavioral patterns
Project facts: key file paths, system boundaries, integration details
User preferences: workflow preferences, communication patterns, tool choices

Haiku returns structured JSON. The daemon writes one markdown file per insight to the vault, using a slug like decisions/postgres-over-sqlite-feb26.md. Files use wikilinks ([[PostgreSQL]], [[Mycelia]]) to connect related concepts across notes.

The daemon runs in the background, invisible. You don't manage it. You just work, and knowledge accumulates.

2. The Memory Vault

The vault lives at ~/.kite/memory/ with a semantic folder structure:

memory/
  decisions/    — architectural and strategic choices
  topics/       — technical concepts and how-tos
  projects/     — per-project context and state
  people/       — collaborators, users, contacts
  ideas/        — proposals not yet acted on
  journals/     — time-indexed observations

Each file is plain markdown with wikilinks. This makes the vault useful beyond the AI layer — you can read it in any text editor, sync it to Obsidian, grep through it. No proprietary format, no lock-in.

The interesting part is what happens alongside the markdown: every note also gets a vector embedding. I'm using Voyage AI's voyage-3-lite model — it's optimized for retrieval tasks and cheap enough to run on every note upsert. Embeddings are stored in a sidecar index alongside the markdown files.

The result is a vault that supports two search modes:

Keyword search: kite-memory.sh search "postgres" does fuzzy text matching across all notes
Semantic search: the injection hook uses vector similarity to find conceptually related notes, even when the exact keywords don't match

3. The Injection Hook

Claude Code supports hooks — shell scripts that run in response to events like session start and prompt submission. The injection hook fires on every user prompt.

The hook takes the incoming prompt text, computes a vector embedding for it, and runs a similarity search against the vault. It returns the top two matching notes — not the most recent, not the highest-scored globally, but the notes most semantically relevant to what you're about to ask.

These notes get injected into the session context as a <vault-context> block before your prompt reaches the model. The model sees:

[Relevant vault context:]

--- [decisions/postgres-over-sqlite-feb26.md] ---
You chose PostgreSQL over SQLite because...

--- [projects/mycelia-auth-architecture.md] ---
Auth uses JWT with 24h expiry...

Then your actual prompt.

The model doesn't need to be told to use this context. It's just there, part of the conversation. Relevant knowledge surfaces automatically. If the vault has nothing relevant, the block is empty and nothing is injected — no noise, no false matches.

What This Changes in Practice

After a few weeks of running this system, the practical difference is hard to understate.

Sessions start informed. When I open a new session to work on Mycelia's API layer, the vault context block already contains the relevant schema decisions, the JWT architecture note, and the key file paths. I don't type any of that. The session knows.

Decisions stay made. The classic problem with solo development: you make a decision, do something else for two weeks, and then make the opposite decision without remembering why the first one was wrong. With every decision captured in the vault and injected when relevant, I've caught three cases where I was about to undo something I'd already debated and rejected.

Insights compound. A debugging finding from week one (Rybbit analytics gets blocked by Arc even with uBlock disabled — proxy through a first-party route instead) is now a permanent vault note. Every future session touching analytics gets that note injected. The insight propagates.

The 232-task track record. Across 232 completed tasks and 0 failures in autonomous operation, the memory system is a major reason the failure rate is zero. Tasks that require project context — architectural changes, integration work, convention-consistent code generation — get that context automatically. Sessions don't fail because they made decisions inconsistent with established patterns.

Implementation Notes

A few design choices worth calling out:

Haiku for extraction, not Sonnet. Extraction is a structured transformation task — read text, produce JSON. Haiku handles this cleanly at ~1/10 the cost. Sonnet is reserved for the actual work sessions.

Top 2 matches, not top 10. It was tempting to inject more context — the vault has hundreds of notes after a few weeks of operation. But relevance degrades fast after the top two matches. Injecting too much adds noise that competes with the actual task. Two notes is almost always the right amount.

Plain markdown, not a database. Every vault note is readable without any tooling. This is deliberate. When the extraction daemon has a bug (it has had bugs), I can read the raw files to audit what was captured. When I want to search manually, grep works. The vector index is an enhancement layer, not the primary storage.

Wikilinks are the secret. The [[wikilink]] convention creates a graph across notes. When I ask the daemon to synthesize a MEMORY.md summary, it uses the link graph to understand which concepts are central (many inbound links) and which are peripheral. The summary stays accurate as the vault grows.

The Remaining Problem

The system has one weak point: extraction latency. The daemon runs every five minutes, which means a key insight from a session might not be in the vault until five minutes after the session ends. In practice this rarely matters — the next session usually starts longer after the previous one ended. But for rapid-fire sequential sessions (one finishes, next starts immediately), there's a window where the fresh insight hasn't propagated yet.

The fix is a write-through path: when a session explicitly flags something as important (/remember this), it writes directly to the vault in real time rather than waiting for the extraction daemon. That's on the roadmap. The polling daemon handles 90% of the cases well enough that the remaining 10% doesn't block shipping.

The TL;DR: AI agents forget. A semantic memory vault with automatic extraction and injection fixes the forgetting at the session boundary. Two weeks of running this in production across 232 tasks convinced me it's the single highest-leverage infrastructure piece in the system.

The full implementation is part of Kite, the orchestration layer I'm building in public. More at kiteaiagent.com.