MemoryAtlas

Benchmarks

Published results, grouped by benchmark. Each row keeps its backbone LLM, embedder, source, and a trust badge — because a memory score is only as meaningful as the pipeline and the party that measured it. The context-window baseline shows how far naive prompt-stuffing gets.

IndependentA neutral party ran it.
Self-reportedThe framework's own vendor reported it.
UnverifiedSource is neutral but not yet reproduced.

LoCoMo

32K-context era (ACL 2024) · 1,982 questions

Long-term, multi-session conversational recall (single-hop, multi-hop, open-domain, temporal).

Caveats

  • Average context length is modest by 2026 standards; a 'dump everything into the prompt' baseline now scores competitively.
  • Does not explicitly score knowledge updates.
FrameworkValueBackboneEmbedderTrustSourceDate
ByteRover
96.1accuracy
Gemini 3 Flash (curation/query) + Gemini 3.1 Pro (justifier)Self-reportedByteRover team (Nguyen et al.)2026-04-02
Mem0
92.5accuracy
Self-reportedMem02026-04-01
ByteRover
92.2accuracy
Gemini 3 Flash (curation/judge) + Gemini 3 Pro (answer/justifier, best run)Self-reportedByteRover2026-02-27
Honcho
89.9accuracy
Self-reportedHoncho (Plastic Labs)2026-05-26
MIRIX
85.38accuracy
gpt-4.1-miniSelf-reportedMIRIX (Wang & Chen)2025-07-10
Memori
81.95accuracy
Self-reportedMemori (MemoriLabs)2026-05-28
MemOS
75.8accuracy
GPT-4o-miniSelf-reportedMemOS (MemTensor et al.)2025-07-04
Letta (MemGPT)
74accuracy
gpt-4o-minitext-embedding-3-largeSelf-reportedLetta (MemGPT authors — Packer, Wooders et al.)2025-08-12
LiCoMemory
67.2accuracy
gpt-4o-miniBGE-M3Self-reportedLiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank)2025-11-03
Mem0
66.88accuracy
IndependentHindsight/Vectorize (competitor re-run)2026-04-02
LiCoMemory
62.99accuracy
Llama-3.1-70B-Instruct-TurboBGE-M3Self-reportedLiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank)2025-11-03
Mem0
54.68accuracy
gpt-4o-miniBGE-M3IndependentLiCoMemory (Huang et al., HKUST et al.) — competitor re-run2025-11-03
A-MEM
48.59accuracy
gpt-4o-miniBGE-M3IndependentLiCoMemory (Huang et al., HKUST et al.) — competitor re-run2025-11-03
A-MEM
48.38accuracy
gpt-4o-miniIndependentMIRIX (Wang & Chen) — competitor re-run2025-07-10
Zep (Graphiti)
44.76accuracy
gpt-4o-miniBGE-M3IndependentLiCoMemory (Huang et al., HKUST et al.) — competitor re-run2025-11-03

LongMemEval

32K-context era (2024) · 500 questions

Multi-session recall including knowledge updates across ~500 questions.

Caveats

  • Like LoCoMo, large modern context windows weaken it as an isolation test of memory.
  • LongMemEval-S (~103k tokens) fits inside a 128k context window, so a full-context baseline can solve much of it without memory — 'borderline' saturation risk per Jiang et al., 'Anatomy of Agentic Memory' (arXiv:2602.19320, 2026).
FrameworkValueBackboneEmbedderTrustSourceDate
agentmemory
95.2recall
all-MiniLM-L6-v2Self-reportedrohitg00 (agentmemory authors)2026-05-20
Mem0
94.4accuracy
Self-reportedMem02026-04-01
Hindsight
91.4accuracy
Gemini 3 ProSelf-reportedHindsight (Vectorize)2026-04-02
Honcho
90.4accuracy
Self-reportedHoncho (Plastic Labs)2026-05-26
Zep (Graphiti)
90.2accuracy
gpt-5.4 (reasoning=medium)Self-reportedZep2026-05-28
RetainDB
79accuracy
gpt-5.4Self-reportedRetainDB2026-03-01
MemOS
77.8accuracy
GPT-4o-miniSelf-reportedMemOS (MemTensor et al.)2025-07-04
LiCoMemory
73.8accuracy
gpt-4o-miniBGE-M3Self-reportedLiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank)2025-11-03
Zep (Graphiti)
71.2accuracy
GPT-4oIndependentHindsight/Vectorize (competitor re-run)2026-04-02
LiCoMemory
69.2accuracy
Llama-3.1-70B-Instruct-TurboBGE-M3Self-reportedLiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank)2025-11-03
Zep (Graphiti)
63.8accuracy
GPT-4oSelf-reportedZep2026-02-01
Mem0
62.6accuracy
gpt-4o-miniBGE-M3IndependentLiCoMemory (Huang et al., HKUST et al.) — competitor re-run2025-11-03
Zep (Graphiti)
58.6accuracy
gpt-4o-miniBGE-M3IndependentLiCoMemory (Huang et al., HKUST et al.) — competitor re-run2025-11-03
A-MEM
55accuracy
gpt-4o-miniBGE-M3IndependentLiCoMemory (Huang et al., HKUST et al.) — competitor re-run2025-11-03
Mem0
49accuracy
GPT-4oIndependentZep (competitor harness)2026-02-01
MIRIX
43.49accuracy
GPT-4o-miniIndependentMemOS (MemTensor) — competitor re-run2025-07-04

BEAM (1M)

ICLR 2026

Long-term memory across ~1M-token conversations spanning multiple domains.

Caveats

  • Built specifically to escape the context-window-rot that affects LoCoMo/LongMemEval.
FrameworkValueBackboneEmbedderTrustSourceDate
Mem0
64.1accuracy
Self-reportedMem02026-04-01
Context-window baseline
64.1accuracy
UnverifiedMem0 (benchmark summary)2026-03-01

BEAM (10M)

ICLR 2026

Long-term memory stressed to ~10M-token scale.

Caveats

  • Hardest tier; scores drop sharply, exposing real retention limits.
FrameworkValueBackboneEmbedderTrustSourceDate
Hindsight
64.1accuracy
Self-reportedHindsight (Vectorize)2026-04-02
Mem0
48.6accuracy
Self-reportedMem02026-04-01
Cognee
0.67accuracy
Self-reportedcognee maintainers (README Benchmarks section)2026-06-28