Making LLMs Truly Remember You | LightMem: Lightweight and Efficient Memory-Augmented Generation

Community Article Published February 28, 2026

Upvote

Introduction

1. Why Do LLMs Need Long-Term Memory?

2. LightMem: A Systematic Solution to Memory Management
Core Design Principles

Architecture Overview

3. Key Technical Modules
Pre-Compression and Topic Segmentation

Memory Extraction and Offline Update

Hybrid Retrieval Strategy

4. Quick Start
Installation

Initialize the Memory System

Store Memories

Offline Update and Memory Retrieval

MCP Server Support

5. Experimental Results

6. StructMem: Hierarchical Memory with Event-Level Structure

7. Looking Ahead: The Future of Memory Systems
KV Cache Pre-computation

Coordinated Use of Context and Long-Term Memory

Multimodal Memory

Multi-Agent Collaboration and Knowledge Sharing

本文也提供中文版本中文。

Introduction

When we say to an AI assistant, "Can you continue with the travel plan you helped me design last time?" it politely replies: "I'm sorry, I don't have access to our previous conversation." This is not a minor inconvenience — it is a fundamental limitation of current large language models in real-world deployment: LLMs have no long-term memory.

Every conversation starts from a blank slate. Users are forced to repeatedly re-introduce their background, preferences, and context. As AI Agents are increasingly deployed in real-world scenarios, this problem becomes ever more critical: a truly useful personal assistant must remember who you are, what you like, and what you have said before.

LightMem is built to solve exactly this problem. It is a lightweight, efficient, and modular Memory-Augmented Generation (MAG) framework that provides large language models and AI Agents with structured long-term memory storage, retrieval, and update mechanisms. LightMem has been officially accepted at ICLR 2026 and is fully open-sourced on GitHub.

Code Repository: https://github.com/zjunlp/LightMem
Paper (ICLR 2026): https://arxiv.org/abs/2510.18866

1. Why Do LLMs Need Long-Term Memory?

Large language models have demonstrated remarkable capabilities in reasoning, dialogue, and code generation. However, they are fundamentally stateless — each inference is completed within an isolated context window, and any information beyond that window is completely invisible to the model.

The field currently offers two primary workarounds, each with significant limitations:

Approach 1: Expanding the Context Window. Model context lengths have grown from 4K to 128K tokens and beyond in recent years. But this comes with quadratic computational overhead, and the "lost-in-the-middle" problem — where models fail to attend to information buried in long contexts — remains severe. Simply stuffing all conversation history into the context is not a sustainable solution.

Approach 2: External Vector Databases (RAG). Storing conversation history in a vector database and retrieving relevant snippets at query time is the most common approach today. However, raw conversations are full of redundant information, making direct storage inefficient. More critically, a user's memories often contradict or supersede each other ("I used to love spicy food, but I've switched to a milder diet recently"). Simple vector retrieval cannot handle this kind of memory conflict and evolution.

This is precisely where LightMem starts: memory is not just storage — it requires understanding, compression, conflict resolution, and organization. LightMem transforms conversation history into structured, semantically rich memory units, and updates them at the right moment with minimal cost.

2. LightMem: A Systematic Solution to Memory Management

Core Design Principles

LightMem is built around three core principles:

Lightweight: Memory storage and retrieval must be efficient enough not to become a system bottleneck. LightMem minimizes token consumption through pre-compression, summarization, and offline batch updates.

Structured: Raw conversations are unstructured streams of text, whereas truly valuable memories are structured factual units. LightMem extracts key information from conversations into memory entries enriched with metadata (timestamps, topic labels, entity tags), enabling precise retrieval.

Evolvable: A user's state and preferences change over time. The memory system must be able to identify and resolve conflicting information, updating existing memories rather than simply appending new entries indefinitely.

Architecture Overview

LightMem adopts a modular pipeline architecture, decomposing the full memory lifecycle into clearly defined processing stages:

This pipeline design makes every module independently replaceable, allowing users to flexibly configure the system based on their own resources and requirements.

3. Key Technical Modules

Pre-Compression and Topic Segmentation

Real conversations are full of redundancy: small talk, repeated confirmations, and filler expressions. Before storing conversations into the memory system, LightMem uses LLMLingua-2 or an entropy-based compression algorithm to distill the raw text, substantially reducing the token cost of downstream LLM calls while preserving core semantics.

At the same time, a single conversation often spans multiple topics — drifting from travel planning to work issues to dietary preferences. The topic segmentation module identifies semantic boundaries in the conversation and splits long dialogues into independent topic segments. This not only refines the granularity of memory entries but also prevents information from different topics from interfering with each other. The two modules can share intermediate computation via the precomp_topic_shared configuration, further reducing overhead.

Memory Extraction and Offline Update

Each topic segment is processed by an LLM and distilled into a structured Memory Entry containing: core facts, associated entities, timestamps, topic labels, and a compressed summary. This is the fundamental distinction between LightMem and naive RAG — what is stored is not the raw conversation, but semantically understood and organized knowledge units.

Offline update is LightMem's key mechanism for handling memory evolution. When a new memory entry has high semantic overlap with an existing one (exceeding a configurable score_threshold), the system triggers conflict detection and invokes an LLM to perform knowledge fusion, updating the old memory to reflect the latest state rather than appending a duplicate. This process runs as a batch job, minimizing the number of LLM calls and token consumption.

Hybrid Retrieval Strategy

LightMem supports three retrieval modes to accommodate different scenarios:

embedding: Pure semantic vector retrieval, best suited for open-ended queries with fuzzy semantics
context: BM25-based contextual retrieval, best suited for structured queries containing precise keywords or timestamps
hybrid: Combined strategy that first filters candidates via context retrieval, then re-ranks with vector similarity — achieving a better balance between recall and precision

LightMem also supports hierarchical retrieval: first retrieving session-level summaries to identify relevant time periods, then drilling into fine-grained memory entries within those periods. This improves retrieval efficiency while reducing interference from irrelevant noise.

4. Quick Start

Installation

git clone https://github.com/zjunlp/LightMem.git
cd LightMem
conda create -n lightmem python=3.11 -y
conda activate lightmem
pip install -e .

Initialize the Memory System

from lightmem.memory.lightmem import LightMemory

config_dict = {
    "pre_compress": True,
    "pre_compressor": {
        "model_name": "llmlingua-2",
        "configs": {
            "llmlingua_config": {
                "model_name": "/path/to/llmlingua-2-bert-base-multilingual-cased-meetingbank",
                "device_map": "cuda",
                "use_llmlingua2": True,
            }
        }
    },
    "topic_segment": True,
    "precomp_topic_shared": True,
    "memory_manager": {
        "model_name": "openai",
        "configs": {
            "model": "gpt-4o-mini",
            "api_key": "your_api_key",
            "max_tokens": 16000,
            "openai_base_url": "your_base_url",
        }
    },
    "index_strategy": "embedding",
    "text_embedder": {
        "model_name": "huggingface",
        "configs": {
            "model": "/path/to/all-MiniLM-L6-v2",
            "embedding_dims": 384,
            "model_kwargs": {"device": "cuda"},
        },
    },
    "retrieve_strategy": "embedding",
    "embedding_retriever": {
        "model_name": "qdrant",
        "configs": {
            "collection_name": "my_chat_memory",
            "embedding_model_dims": 384,
            "path": "./my_chat_memory",
        }
    },
    "update": "offline",
}

lightmem = LightMemory.from_config(config_dict)

Store Memories

session = {
    "timestamp": "2025-06-15",
    "turns": [
        [
            {"role": "user", "content": "My favorite ice cream flavor is pistachio, and my cat's name is Mochi."},
            {"role": "assistant", "content": "Got it! Pistachio ice cream and Mochi."}
        ],
    ]
}

for turn_messages in session["turns"]:
    for msg in turn_messages:
        msg["time_stamp"] = session["timestamp"]
    lightmem.add_memory(messages=turn_messages, force_extract=True)

Offline Update and Memory Retrieval

# Batch conflict resolution and knowledge fusion
lightmem.construct_update_queue_all_entries()
lightmem.offline_update_all_entries(score_threshold=0.8)

# Retrieve relevant memories
question = "What is my cat's name?"
memories = lightmem.retrieve(question, limit=5)
print(memories)

MCP Server Support

LightMem also provides a Model Context Protocol (MCP) server, enabling direct integration with MCP-compatible clients such as Claude Desktop and Cursor:

pip install '.[mcp]'

# Start the server via HTTP (port 8000)
fastmcp run mcp/server.py:mcp --transport http --port 8000

Example client MCP configuration:

{
  "yourMcpServers": {
    "LightMem": {
      "url": "http://127.0.0.1:8000/mcp"
    }
  }
}

5. Experimental Results

We conducted comprehensive evaluations on two long-term memory benchmarks — LongMemEval and LoCoMo — comparing LightMem against leading memory frameworks including Mem0, A-MEM, and MemoryOS.

Results on the LoCoMo dataset (backbone: gpt-4o-mini, judge: gpt-4o-mini):

Method	Overall ACC ↑	Multi	Open	Single	Temp	Total Tokens (k) ↓	Runtime (s) ↓
FullText	73.83	68.79	56.25	86.56	50.16	54,884	6,971
NaiveRAG	63.64	55.32	47.92	70.99	56.39	3,870	1,884
A-MEM	64.16	56.03	31.25	72.06	60.44	21,665	67,084
MemoryOS	58.25	56.74	45.83	67.06	40.19	10,519	26,129
Mem0	36.49	30.85	34.38	38.41	37.07	25,793	120,175

The results reveal a striking pattern: feeding the full conversation history into context (FullText) achieves the highest accuracy, but at the cost of enormous token consumption. Meanwhile, most existing memory frameworks incur substantial overhead in memory construction and updates, yet still fall significantly short of the FullText accuracy baseline. This highlights that preserving critical information through the compression and structuring process is the central challenge in memory system design.

Full experimental data — including results across different backbone models and judge models — is available on Google Drive, along with one-click reproduction scripts. We welcome the community to download and verify.

6. StructMem: Hierarchical Memory with Event-Level Structure

In February 2026, we released StructMem as an extension of LightMem, further enhancing the system's ability to handle complex narrative scenarios.

Standard flat memory extraction treats facts extracted from conversations as independent, isolated knowledge fragments. In reality, however, much important information has a natural event structure: something happens at a specific time, involves specific people, is caused by specific reasons, and produces specific outcomes. The relationships between these elements are what make a memory truly valuable.

StructMem introduces event-level memory binding and cross-event memory connections, organizing memories into a hierarchical structure that preserves temporal bindings and causal relationships. In complex question-answering scenarios involving temporal reasoning, causal tracing, and character relationship understanding, StructMem significantly reduces the loss of critical information compared to flat memory extraction.

Enabling StructMem in LightMem requires only a single configuration change:

config_dict["extraction_mode"] = "event"

For detailed documentation, see: StructMem.md

7. Looking Ahead: The Future of Memory Systems

LightMem represents an important step forward in memory-augmented generation, but we recognize there is still a long road ahead. The following directions are ones we are actively exploring:

KV Cache Pre-computation

For fixed long-term memory content, it is possible to pre-compute and cache the corresponding KV representations, reusing them directly at inference time without re-encoding. This enables lossless acceleration (offline pre-computation) or lossy but highly efficient online pre-computation, substantially reducing the latency of memory-augmented inference.

Coordinated Use of Context and Long-Term Memory

Current memory systems primarily follow a "retrieve-then-concatenate" paradigm. How to dynamically decide which memories should be placed in the context window versus retrieved on demand is a scheduling problem that deserves deeper investigation.

Multimodal Memory

LightMem currently focuses on textual memory. As multimodal Agents become increasingly prevalent, storing and retrieving key information from images and audio — "the floor plan you showed me last time" — will be an important frontier.

Multi-Agent Collaboration and Knowledge Sharing

In multi-agent collaborative settings, memory is no longer the private state of a single agent — it needs to be shared, synchronized, and negotiated across agents. Designing memory architectures that support knowledge sharing among multiple intelligent agents will be a critical infrastructure challenge for next-generation AI systems.

We sincerely hope LightMem can serve as a practical foundation for researchers building AI systems with genuine long-term memory. The project is fully open-source — stars, issues, and pull requests are all warmly welcome.

👉 Code & Docs: https://github.com/zjunlp/LightMem
👉 Paper (ICLR 2026): https://arxiv.org/abs/2510.18866
👉 Demo Video: YouTube | Bilibili

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote