Featured image of post Two Agent-Memory Architectures — MemPalace's Structured Index vs Hermes Agent's Self-Curating Scratchpad

Two Agent-Memory Architectures — MemPalace's Structured Index vs Hermes Agent's Self-Curating Scratchpad

A side-by-side reading of MemPalace and Hermes Agent surfaces two competing primitives for agent memory — explicit graph-indexed recall versus emergent tool-mediated scratchpads

Overview

Two repos surfaced alongside each other on 2026-05-10 — MemPalace/mempalace and NousResearch/hermes-agent — and they put two opposite primitives for agent memory in head-to-head contact. One is a structured index (wings/rooms/drawers plus a temporal knowledge graph), the other is an emergent scratchpad + self-improving skills + FTS5 recall. If the previous OS-layer post traced how the memory and workflow slots are forming, this post pulls on the memory slot itself and finds it splitting in two design philosophies.

1. MemPalace — push structured indexing to its limit

MemPalace/mempalace bills itself as “the best-benchmarked open-source AI memory system.” Created 2026-04-05, MIT, 51,879 stars at the 2026-05-11 push. Its bet collapses to one sentence — store the original text without summarizing, and let pre-existing structure narrow the semantic search.

The palace structure

  • wings — one per person or project; queries scope into a wing.
  • rooms — topic groups inside a wing.
  • drawers — the smallest unit, the verbatim text itself. No summarizing, no extraction, no paraphrase.
  • knowledge graph — local SQLite with entities, relationships, and validity windows. When a fact stops being true, the layer marks it explicitly instead of leaving the LLM to figure it out.
  • agent diaries — every specialist agent gets its own wing and journal, discoverable at runtime via mempalace_list_agents so the system prompt stays small.

Benchmarks

LongMemEval, 500 questions:

ModeR@5LLM required
Raw semantic search (no heuristics, no LLM)96.6%None
Hybrid v4, 450q held-out98.4%None
Hybrid v4 + LLM rerank, 500q≥99%Any capable model

Plus LoCoMo R@10 88.9% (hybrid v5, 1,986 questions), ConvoMem 92.9% recall across 250 items, MemBench (ACL 2025) R@5 80.3% across 8,500 items. Compared with agentmemory’s 95.2% on the same LongMemEval cut, MemPalace’s raw mode is +1.4pp ahead — the clearest signal that the marginal value of pre-baked structure shows up as retrieval recall.

Setup

uv tool install mempalace
mempalace init ~/projects/myapp

# Mine
mempalace mine ~/projects/myapp                   # project files
mempalace mine ~/.claude/projects/ --mode convos  # Claude Code sessions

# Search / load
mempalace search "why did we switch to GraphQL"
mempalace wake-up

No API key, no cloud call, ChromaDB as the default, with a pluggable interface at mempalace/backends/base.py. 29 MCP tools cover palace reads/writes, graph operations, cross-wing navigation, drawer management, and agent diaries.

What it argues

MemPalace bets that memory quality is index quality. Compression and summarization lose information, so it keeps drawers verbatim and lets wing/room scope shrink what the LLM has to wade through. The knowledge graph’s validity windows are the more interesting move — they push fact decay over time out of LLM reasoning and into the index layer.

2. Hermes Agent — push the emergent scratchpad to its limit

NousResearch/hermes-agent bills itself as “the agent that grows with you.” MIT, built by Nous Research, created 2025-07-22, 142,575 stars by 2026-05-11 — the larger crowd in this comparison set. Its bet is the opposite — memory is not a separate index, it is an emergent product of the agent operating itself.

Four streams that make up its memory

  1. agent-curated memory + periodic nudges — the agent decides what is worth keeping; nudges enforce persistence.
  2. self-authored skills — after a complex task, the agent can register a skill to the Skills Hub. Skills self-improve in use. Compatible with the agentskills.io open standard.
  3. FTS5 session search + LLM summarization — past conversations are searched via SQLite FTS5; the LLM summarizes hits for cross-session recall.
  4. user modelingplastic-labs/honcho dialectic user modeling builds a deepening picture of who you are across sessions.

Where it runs

Telegram · Discord · Slack · WhatsApp · Signal · Email · CLI, all from one gateway process. Seven terminal backends — local, Docker, SSH, Singularity, Modal, Daytona, Vercel Sandbox — with Modal and Daytona offering hibernation between sessions so idle cost is nearly zero. Not tied to a laptop.

Model freedom

A single hermes model swaps between Nous Portal, OpenRouter, NVIDIA NIM, Xiaomi MiMo, z.ai/GLM, Kimi/Moonshot, MiniMax, Hugging Face, OpenAI, or any custom endpoint. Because memory is an emergent operational byproduct rather than a model artifact, it follows the agent across model swaps.

What it argues

Hermes bets that memory has to be invoked — by the LLM itself. Retrieval correctness is not the index’s job; the LLM decides mid-turn what slice of the past it needs, calls the FTS5 search tool, builds a summary, and threads it into its own context. Skills are not written once but rewritten while being used — living procedural memory.

3. Head-to-head

FieldMemPalaceHermes Agent
MakerMemPalaceNous Research
LicenseMITMIT
Created2026-04-052025-07-22
Stars (5/11)51,879142,575
Memory modelstructured index + KGscratchpad + emergent skills + FTS
Storageverbatim drawersconversations, notes, skills; summarize on demand
Time handlinggraph validity windowsLLM reconstructs by summarizing
Retrieval ownerthe index (96.6% raw R@5)the LLM via tools
Model couplingmodel-agnostic (raw = 0 LLM calls)model-agnostic (10+ providers)
Interface29 MCP tools + CLITUI + 6 messaging gateways
Atomic unitmempalace searcha hermes session

4. Which scales for which task

  • When fact recall is the KPI — customer history, codebase decision logs, the “when and why did we switch X” class of questions — MemPalace is the better fit. 96.6% raw R@5 is a number nobody else has matched without an LLM in the loop.
  • When the agent has to live across days and modalities — start on Telegram, continue on Slack, run a cron job at 3am that ships a report — Hermes wins. You trade away some retrieval precision for operational continuity.
  • Single-session, single-task workloads — both are overkill. Today’s Claude and GPT context windows (hundreds of thousands to a million tokens) already absorb most of this. That is the load-bearing point — at one human, one session, neither is needed. The price tag only shows up at agent-team scale.

Where the design split pays off at team scale

  • N specialists must share the same fact pool → MemPalace’s wings + cross-wing navigation is the direct answer.
  • N channels must hold the same persona → Hermes’ Honcho dialectic modeling is the direct answer.
  • N days of evolving procedure → Hermes’ self-improving skills are the direct answer.
  • N years of fact decay → MemPalace’s temporal knowledge graph is the direct answer.

A one-line summary the community surfaced — MemPalace is “accuracy infrastructure,” Hermes is “operations infrastructure.” They share a word (“memory”) but their responsibilities barely overlap.

Insights

The thing worth taking from this digest is that two projects sitting at 51K and 142K stars at the same moment have defined “memory” in opposite directions. MemPalace sees memory as a searchable factual index and has spent its design budget on retrieval accuracy (96.6% raw R@5) plus a temporal graph with validity windows. Hermes sees memory as an operational flow the LLM invokes and has spent the same budget on scratchpads, self-improving skills, and continuity across messaging channels. Both deliberately decouple from the model — same direction as the prior OS-layer reading — but they draw the boundary between “what counts as the index” and “what counts as the agent” in opposite places. With current context windows nearly swallowing a single-user session whole, neither tool feels urgent today. The moment agents start operating as teams, the two designs convert directly into different cost, accuracy, and operational stability tradeoffs. The interesting question for the next quarter is whether the index camp absorbs emergent scratchpads into the index, or whether the scratchpad camp pulls explicit graphs in as just another tool. Convergence in one direction looks more likely than a stable equilibrium.

References

Core repos

Adjacent memory tools / comparison set

Protocols and runtimes

Benchmarks and papers

Built with Hugo
Theme Stack designed by Jimmy