Making LLMs Truly Remember You | LightMem: Lightweight and Efficient Memory-Augmented Generation
本文也提供中文版本 中文。
Introduction
When we say to an AI assistant, "Can you continue with the travel plan you helped me design last time?" it politely replies: "I'm sorry, I don't have access to our previous conversation." This is not a minor inconvenience — it is a fundamental limitation of current large language models in real-world deployment: LLMs have no long-term memory.
Every conversation starts from a blank slate. Users are forced to repeatedly re-introduce their background, preferences, and context. As AI Agents are increasingly deployed in real-world scenarios, this problem becomes ever more critical: a truly useful personal assistant must remember who you are, what you like, and what you have said before.
LightMem is built to solve exactly this problem. It is a lightweight, efficient, and modular Memory-Augmented Generation (MAG) framework that provides large language models and AI Agents with structured long-term memory storage, retrieval, and update mechanisms. LightMem has been officially accepted at ICLR 2026 and is fully open-sourced on GitHub.
Code Repository: https://github.com/zjunlp/LightMem
Paper (ICLR 2026): https://arxiv.org/abs/2510.18866
1. Why Do LLMs Need Long-Term Memory?
Large language models have demonstrated remarkable capabilities in reasoning, dialogue, and code generation. However, they are fundamentally stateless — each inference is completed within an isolated context window, and any information beyond that window is completely invisible to the model.
The field currently offers two primary workarounds, each with significant limitations:
Approach 1: Expanding the Context Window. Model context lengths have grown from 4K to 128K tokens and beyond in recent years. But this comes with quadratic computational overhead, and the "lost-in-the-middle" problem — where models fail to attend to information buried in long contexts — remains severe. Simply stuffing all conversation history into the context is not a sustainable solution.
Approach 2: External Vector Databases (RAG). Storing conversation history in a vector database and retrieving relevant snippets at query time is the most common approach today. However, raw conversations are full of redundant information, making direct storage inefficient. More critically, a user's memories often contradict or supersede each other ("I used to love spicy food, but I've switched to a milder diet recently"). Simple vector retrieval cannot handle this kind of memory conflict and evolution.
This is precisely where LightMem starts: memory is not just storage — it requires understanding, compression, conflict resolution, and organization. LightMem transforms conversation history into structured, semantically rich memory units, and updates them at the right moment with minimal cost.
2. LightMem: A Systematic Solution to Memory Management
Core Design Principles
LightMem is built around three core principles:
Lightweight: Memory storage and retrieval must be efficient enough not to become a system bottleneck. LightMem minimizes token consumption through pre-compression, summarization, and offline batch updates.
Structured: Raw conversations are unstructured streams of text, whereas truly valuable memories are structured factual units. LightMem extracts key information from conversations into memory entries enriched with metadata (timestamps, topic labels, entity tags), enabling precise retrieval.
Evolvable: A user's state and preferences change over time. The memory system must be able to identify and resolve conflicting information, updating existing memories rather than simply appending new entries indefinitely.
Architecture Overview
LightMem adopts a modular pipeline architecture, decomposing the full memory lifecycle into clearly defined processing stages:
This pipeline design makes every module independently replaceable, allowing users to flexibly configure the system based on their own resources and requirements.
3. Key Technical Modules
Pre-Compression and Topic Segmentation
Real conversations are full of redundancy: small talk, repeated confirmations, and filler expressions. Before storing conversations into the memory system, LightMem uses LLMLingua-2 or an entropy-based compression algorithm to distill the raw text, substantially reducing the token cost of downstream LLM calls while preserving core semantics.
At the same time, a single conversation often spans multiple topics — drifting from travel planning to work issues to dietary preferences. The topic segmentation module identifies semantic boundaries in the conversation and splits long dialogues into independent topic segments. This not only refines the granularity of memory entries but also prevents information from different topics from interfering with each other. The two modules can share intermediate computation via the precomp_topic_shared configuration, further reducing overhead.
Memory Extraction and Offline Update
Each topic segment is processed by an LLM and distilled into a structured Memory Entry containing: core facts, associated entities, timestamps, topic labels, and a compressed summary. This is the fundamental distinction between LightMem and naive RAG — what is stored is not the raw conversation, but semantically understood and organized knowledge units.
Offline update is LightMem's key mechanism for handling memory evolution. When a new memory entry has high semantic overlap with an existing one (exceeding a configurable score_threshold), the system triggers conflict detection and invokes an LLM to perform knowledge fusion, updating the old memory to reflect the latest state rather than appending a duplicate. This process runs as a batch job, minimizing the number of LLM calls and token consumption.
Hybrid Retrieval Strategy
LightMem supports three retrieval modes to accommodate different scenarios:
embedding: Pure semantic vector retrieval, best suited for open-ended queries with fuzzy semanticscontext: BM25-based contextual retrieval, best suited for structured queries containing precise keywords or timestampshybrid: Combined strategy that first filters candidates via context retrieval, then re-ranks with vector similarity — achieving a better balance between recall and precision
LightMem also supports hierarchical retrieval: first retrieving session-level summaries to identify relevant time periods, then drilling into fine-grained memory entries within those periods. This improves retrieval efficiency while reducing interference from irrelevant noise.
4. Quick Start
Installation
git clone https://github.com/zjunlp/LightMem.git
cd LightMem
conda create -n lightmem python=3.11 -y
conda activate lightmem
pip install -e .
Initialize the Memory System
from lightmem.memory.lightmem import LightMemory
config_dict = {
"pre_compress": True,
"pre_compressor": {
"model_name": "llmlingua-2",
"configs": {
"llmlingua_config": {
"model_name": "/path/to/llmlingua-2-bert-base-multilingual-cased-meetingbank",
"device_map": "cuda",
"use_llmlingua2": True,
}
}
},
"topic_segment": True,
"precomp_topic_shared": True,
"memory_manager": {
"model_name": "openai",
"configs": {
"model": "gpt-4o-mini",
"api_key": "your_api_key",
"max_tokens": 16000,
"openai_base_url": "your_base_url",
}
},
"index_strategy": "embedding",
"text_embedder": {
"model_name": "huggingface",
"configs": {
"model": "/path/to/all-MiniLM-L6-v2",
"embedding_dims": 384,
"model_kwargs": {"device": "cuda"},
},
},
"retrieve_strategy": "embedding",
"embedding_retriever": {
"model_name": "qdrant",
"configs": {
"collection_name": "my_chat_memory",
"embedding_model_dims": 384,
"path": "./my_chat_memory",
}
},
"update": "offline",
}
lightmem = LightMemory.from_config(config_dict)
Store Memories
session = {
"timestamp": "2025-06-15",
"turns": [
[
{"role": "user", "content": "My favorite ice cream flavor is pistachio, and my cat's name is Mochi."},
{"role": "assistant", "content": "Got it! Pistachio ice cream and Mochi."}
],
]
}
for turn_messages in session["turns"]:
for msg in turn_messages:
msg["time_stamp"] = session["timestamp"]
lightmem.add_memory(messages=turn_messages, force_extract=True)
Offline Update and Memory Retrieval
# Batch conflict resolution and knowledge fusion
lightmem.construct_update_queue_all_entries()
lightmem.offline_update_all_entries(score_threshold=0.8)
# Retrieve relevant memories
question = "What is my cat's name?"
memories = lightmem.retrieve(question, limit=5)
print(memories)
MCP Server Support
LightMem also provides a Model Context Protocol (MCP) server, enabling direct integration with MCP-compatible clients such as Claude Desktop and Cursor:
pip install '.[mcp]'
# Start the server via HTTP (port 8000)
fastmcp run mcp/server.py:mcp --transport http --port 8000
Example client MCP configuration:
{
"yourMcpServers": {
"LightMem": {
"url": "http://127.0.0.1:8000/mcp"
}
}
}
5. Experimental Results
We conducted comprehensive evaluations on two long-term memory benchmarks — LongMemEval and LoCoMo — comparing LightMem against leading memory frameworks including Mem0, A-MEM, and MemoryOS.
Results on the LoCoMo dataset (backbone: gpt-4o-mini, judge: gpt-4o-mini):
| Method | Overall ACC ↑ | Multi | Open | Single | Temp | Total Tokens (k) ↓ | Runtime (s) ↓ |
|---|---|---|---|---|---|---|---|
| FullText | 73.83 | 68.79 | 56.25 | 86.56 | 50.16 | 54,884 | 6,971 |
| NaiveRAG | 63.64 | 55.32 | 47.92 | 70.99 | 56.39 | 3,870 | 1,884 |
| A-MEM | 64.16 | 56.03 | 31.25 | 72.06 | 60.44 | 21,665 | 67,084 |
| MemoryOS | 58.25 | 56.74 | 45.83 | 67.06 | 40.19 | 10,519 | 26,129 |
| Mem0 | 36.49 | 30.85 | 34.38 | 38.41 | 37.07 | 25,793 | 120,175 |
The results reveal a striking pattern: feeding the full conversation history into context (FullText) achieves the highest accuracy, but at the cost of enormous token consumption. Meanwhile, most existing memory frameworks incur substantial overhead in memory construction and updates, yet still fall significantly short of the FullText accuracy baseline. This highlights that preserving critical information through the compression and structuring process is the central challenge in memory system design.
Full experimental data — including results across different backbone models and judge models — is available on Google Drive, along with one-click reproduction scripts. We welcome the community to download and verify.
6. StructMem: Hierarchical Memory with Event-Level Structure
In February 2026, we released StructMem as an extension of LightMem, further enhancing the system's ability to handle complex narrative scenarios.
Standard flat memory extraction treats facts extracted from conversations as independent, isolated knowledge fragments. In reality, however, much important information has a natural event structure: something happens at a specific time, involves specific people, is caused by specific reasons, and produces specific outcomes. The relationships between these elements are what make a memory truly valuable.
StructMem introduces event-level memory binding and cross-event memory connections, organizing memories into a hierarchical structure that preserves temporal bindings and causal relationships. In complex question-answering scenarios involving temporal reasoning, causal tracing, and character relationship understanding, StructMem significantly reduces the loss of critical information compared to flat memory extraction.
Enabling StructMem in LightMem requires only a single configuration change:
config_dict["extraction_mode"] = "event"
For detailed documentation, see: StructMem.md
7. Looking Ahead: The Future of Memory Systems
LightMem represents an important step forward in memory-augmented generation, but we recognize there is still a long road ahead. The following directions are ones we are actively exploring:
KV Cache Pre-computation
For fixed long-term memory content, it is possible to pre-compute and cache the corresponding KV representations, reusing them directly at inference time without re-encoding. This enables lossless acceleration (offline pre-computation) or lossy but highly efficient online pre-computation, substantially reducing the latency of memory-augmented inference.
Coordinated Use of Context and Long-Term Memory
Current memory systems primarily follow a "retrieve-then-concatenate" paradigm. How to dynamically decide which memories should be placed in the context window versus retrieved on demand is a scheduling problem that deserves deeper investigation.
Multimodal Memory
LightMem currently focuses on textual memory. As multimodal Agents become increasingly prevalent, storing and retrieving key information from images and audio — "the floor plan you showed me last time" — will be an important frontier.
Multi-Agent Collaboration and Knowledge Sharing
In multi-agent collaborative settings, memory is no longer the private state of a single agent — it needs to be shared, synchronized, and negotiated across agents. Designing memory architectures that support knowledge sharing among multiple intelligent agents will be a critical infrastructure challenge for next-generation AI systems.
We sincerely hope LightMem can serve as a practical foundation for researchers building AI systems with genuine long-term memory. The project is fully open-source — stars, issues, and pull requests are all warmly welcome.
👉 Code & Docs: https://github.com/zjunlp/LightMem
👉 Paper (ICLR 2026): https://arxiv.org/abs/2510.18866
👉 Demo Video: YouTube | Bilibili

