DeepSeek has unveiled Engram, a groundbreaking architectural innovation that fundamentally rethinks how large language models handle memory and computation. Named after the neuroscientific term for a "memory trace," Engram introduces conditional memory via scalable lookup, creating a new axis of sparsity that complements existing Mixture-of-Experts approaches. This research, led by Liang Wenfeng in collaboration with Peking University, directly addresses one of the Transformer architecture's most significant inefficiencies.

The Two Jobs Problem

Standard Transformer architectures suffer from a fundamental inefficiency that DeepSeek researchers call the "two jobs problem." Every parameter in a traditional model serves two conflicting purposes simultaneously:

  • Static Recall: Remembering fixed information like programming syntax, factual knowledge, and common phrases
  • Dynamic Reasoning: Performing logic, algorithms, and complex multi-step inference

Using the same expensive GPU-bound parameters for both tasks is inherently wasteful. Static pattern recall, like knowing that "New York" is a city or that Python uses indentation for blocks, doesn't require the same computational machinery as solving a novel logic puzzle or writing complex code.

While Mixture-of-Experts (MoE) architectures addressed the question of "how to compute less" through conditional computation, Transformers still lacked a native primitive for knowledge lookup. As the DeepSeek researchers put it: "MoE only solves the problem of how to calculate less, while Engram directly solves the problem of don't calculate blindly."

The Engram Solution

Engram introduces a separate, scalable lookup module designed specifically for memory retrieval. Its core innovation uses Modernized Hashed N-gram Embeddings to offload static pattern reconstruction from the neural network.

Instead of processing every token through deep neural layers, Engram takes slices of the input text (N-grams) and uses hash functions to map them directly to a massive, learnable lookup table. This operation has a deterministic O(1) time complexity, meaning retrieval speed and cost remain constant regardless of how many facts are stored.

The architecture enables a clean separation of concerns:

  • Engram Module: Handles static pattern retrieval using cheap CPU RAM
  • Transformer Backbone: Focuses purely on dynamic reasoning using GPU computation

Technical Architecture

The Engram module is placed in the early layers of the Transformer backbone, where its role is "pattern reconstruction," quickly retrieving relevant factual patterns and historical context from its vast memory table. The system employs a two-stage processing approach:

Stage 1: Retrieval

Maps local context to static memory through tokenizer compression and hashed retrieval. The module maintains separate branches for different N-gram orders (bigrams, trigrams, etc.) to capture patterns at multiple linguistic scales.

Stage 2: Fusion

Integrates retrieved information back into the model using context-aware gating. The model's current hidden state acts as a query that determines how much weight to give the retrieved memory versus the standard neural computation.

Four Key Design Elements

1. Tokenizer Compression

Engram compresses equivalent tokens (such as the same word with different capitalization forms) to the same canonical concept. This technique reduced the vocabulary size for the conditional memory module by 23%, making the lookup tables more efficient.

2. Multi-Head Hashing

Storing a table for every possible N-gram is intractable. Engram uses multi-head hashing with K hash heads per N-gram order to map compressed contexts to embedding tables via deterministic functions. This approach avoids memory explosion while mitigating hash collisions.

3. Context-Aware Gating

To resolve ambiguity from hash collisions, Engram employs attention-inspired mechanisms with learnable projections (W_K, W_V). The hidden state serves as a dynamic query that weighs the relevance of retrieved memories.

4. Multi-Scale N-grams

Natural language contains patterns at multiple scales. Bigrams like "New York" coexist with trigrams like "in order to" and longer phrases. Engram handles this by maintaining separate branches for different N-gram orders, then combining them through learned convolutions.

The U-Shaped Scaling Law

Through systematic experiments, DeepSeek discovered a "U-shaped scaling law" governing the optimal allocation between MoE capacity and Engram memory. This mathematical relationship guides resource distribution across different model scales.

The research found that the optimal balance allocates approximately 75% of sparse model capacity to dynamic reasoning (MoE experts) and 25% to static lookups (Engram memory). This ratio held consistent across different model sizes tested.

Model Configurations and Results

The researchers validated Engram at the 27B parameter scale. The Engram-27B model shares the same Transformer backbone as the MoE-27B baseline: a 30-block Transformer with 2560 hidden dimension and 32-head Multi-head Latent Attention.

While MoE-27B uses 72 routed experts and 2 shared experts, Engram-27B reduces routed experts from 72 to 55 and reallocates those parameters into a 5.7B Engram memory, keeping total parameters at 26.7B with 3.8B activated parameters.

Benchmark Performance

Under equivalent parameter and FLOP budgets, Engram-27B demonstrated consistent improvements over MoE baselines:

Benchmark MoE-27B Engram-27B Improvement
MMLU (Knowledge) 57.4 60.4 +3.0
CMMLU (Chinese) 57.9 61.9 +4.0
C-Eval 58.0 62.7 +4.7
HumanEval (Code) 37.8 40.8 +3.0
GSM8K (Math) 58.4 60.6 +2.2

Long Context Performance

At 32K context length using YaRN position encoding, Engram-27B matched or exceeded baseline performance on RULER benchmarks. Particularly notable was the improvement in Multi-Query Needle-in-Haystack accuracy: 99.6% versus 73.0% for the MoE baseline. This suggests Engram's memory architecture excels at retrieval tasks within extended contexts.

Mechanistic Insights

Analysis of the trained models reveals that Engram "relieves early layers from static pattern reconstruction," potentially preserving model depth for complex reasoning tasks. This distribution of labor between memory lookup and dynamic computation represents a fundamental architectural shift.

The implications extend beyond raw performance numbers. By decoupling memory from computation, Engram enables new deployment strategies:

  • Memory Offloading: The deterministic addressing allows embedding tables to be stored in host RAM rather than expensive GPU HBM, with minimal inference overhead
  • Scalable Knowledge: Adding more factual knowledge doesn't require retraining the entire model, just expanding the lookup tables
  • Efficient Updates: Static knowledge can potentially be updated independently of the reasoning components

Implications for DeepSeek V4

Industry observers widely believe Engram will form the architectural foundation for DeepSeek V4, rumored for release around mid-February 2026. If the leaked specifications prove accurate, V4 could represent a significant leap in efficient AI architecture.

Expected V4 characteristics based on the Engram research:

  • Potentially 1 trillion parameters with Engram-based conditional memory
  • Enhanced long-context handling through improved memory utilization
  • Optimized for consumer hardware deployment (dual RTX 4090s or 5090s)
  • Strong coding performance, building on DeepSeek's existing strength in software engineering tasks

Enterprise Implications

For enterprises evaluating LLM strategies, Engram represents several important considerations:

Cost Efficiency: By offloading static knowledge to cheaper memory, Engram-style architectures could significantly reduce inference costs for knowledge-intensive applications.

Deployment Flexibility: The ability to separate memory from computation enables more flexible deployment architectures, potentially allowing enterprises to scale knowledge independently of reasoning capacity.

Fine-tuning Strategies: If Engram's memory tables can be updated independently, enterprises might gain new options for incorporating domain-specific knowledge without full model retraining.

Hardware Planning: The shift toward CPU RAM for memory storage could influence infrastructure decisions, potentially reducing dependence on expensive GPU memory.

Key Takeaways

  • Engram introduces conditional memory as a new sparsity axis, separating static recall from dynamic reasoning
  • O(1) lookup complexity enables constant-time retrieval regardless of knowledge base size
  • The U-shaped scaling law suggests optimal allocation of ~75% computation, ~25% memory
  • Benchmark results show consistent improvements across knowledge, reasoning, code, and math domains
  • Long-context performance improves dramatically, especially for retrieval-heavy tasks
  • Memory offloading to CPU RAM could reduce inference costs and enable new deployment strategies

"MoE only solves the problem of how to calculate less, while Engram directly solves the problem of don't calculate blindly."

DeepSeek Research Team

References