Large Language Models LLMs and Natural Language Processing (NLP)

Googe Nested Learning

In their recent seminal work, Google Research introduces the “Nested Learning” paradigm—an evolution in machine-learning architecture that reframes the model-training problem as a hierarchy of interconnected optimisation loops rather than a monolithic process.

Historically, large language models (LLMs) excel through scale and context windows, yet they face a foundational limitation: the inability to retain and build upon new knowledge while preserving prior capabilities, otherwise known as catastrophic forgetting. Unlike neural networks frozen after pre-training, the human brain adapts continuously: short-term working memory, mid-term skill acquisition, long-term memory consolidation. Nested Learning borrows this multi-time-scale structure.

At its core, the paradigm posits that architecture and optimisation are not separate—they are nested levels of the same learning system.
Each level (“component”) possesses its own context flow and update frequency.
The proposed architecture, named HOPE, exemplifies this: a self-modifying recurrent system with “continuum memory systems” (CMS) that update at diverse frequencies, enabling long-context retention, incremental learning without catastrophic forgetting, and superior language-modelling performance.

Indeed, it is clear that the current way to train and inference LLM is not optimized and extremely costly in terms of time and resources. The future I foresee will resolve this issue is a more dynamic less rigid infrastructure that will learn in real time or much mor easily compared with the current transformers.

Advantages of HOPE architecture

Enhanced capacity for long-context reasoning: HOPE demonstrates lower perplexity and higher accuracy than standard transformers on long-sequence tasks.
Continual learning without degradation: The CMS framework means learning new tasks doesn’t necessarily overwrite older knowledge.
Architectural unification: By treating the optimiser as an associative memory module nested within the model, the paradigm opens new routes for architectural innovation.
Richer memory hierarchy: Unlike traditional models with short-term (context window) and long-term (pre-trained) memory only, Nested Learning introduces a spectrum of update-frequencies enabling “continuum memory”.

Limitations

The architecture of nested optimisation loops and self-modification adds significant design and engineering overhead.
Although HOPE shows promising results in the paper, full benchmarking across domains remains limited; generalisation to all agent-systems remains to be proven.
Maintaining multiple update-frequencies and nested modules may increase compute and memory footprint compared to conventional architectures.
Interpretability and safety concerns: Self-modifying models (e.g., HOPE) may raise issues of control, transparency, and unintended behaviours.

Continual learning in modern ML has been constrained by catastrophic forgetting — Nested Learning tackles this by aligning memory update rates. Treating architecture and optimiser as a unified, nested system opens a new dimension in model-design.
The notion of a “continuum memory system” (CMS) realises memory as a spectrum of modules rather than a binary short/long-term divide. Agents built with Nested Learning principles could shift from “responding” to “experiencing” — true self-improving systems. While scale has dominated ML headlines, the research indicates the next frontier is how models learn, not just how large they are.

Let's dive into the research paper

The paper “Nested Learning: The Illusion of Deep Learning Architectures” provides a substantially richer picture than the popular summaries around HOPE and catastrophic forgetting. It reframes much of modern deep learning—architectures and optimisers—as a hierarchy of associative memory systems operating at different time-scales.

1. Nested Learning as a unifying paradigm

The central claim is that a neural network plus its optimiser is not a single learning loop, but a set of nested optimisation problems, each with its own “context flow” and update frequency. Components are ordered by how often they update:

Fast components: e.g. attention states or internal memories updated at every token or step.
Medium components: e.g. optimiser states (momentum, Adam’s first and second moments).
Slow components: e.g. model weights updated only during pre-training or occasionally in continual learning.

This hierarchy mirrors the brain’s multi-time-scale processing (working memory, synaptic consolidation, systems consolidation), and the authors argue that most current LLMs are effectively “anterograde amnesiacs”: they can process the immediate context and rely on a frozen long-term memory, but cannot reliably integrate new experience into that long-term store once pre-training ends.

Nested Learning (NL) therefore adds a new design axis besides depth/width: the number and structure of learning levels. Deeper in this sense means more nested optimisation processes, not just more layers.

Optimisers reinterpreted as associative memory

A key insight is that classical optimisers (SGD, momentum, Adam) can be viewed as associative memories that compress gradients over time.

When you add momentum, you introduce a second memory:

The main weights Wt: slow memory of task structure.
The momentum term Mt: a fast, key-less memory storing recent gradients.

This is already a two-level nested optimiser. Adam extends this with separate memories for the first and second moments of gradients, creating parallel memories with the same frequency but different roles.

From this standpoint, they propose three directions for “deeper” optimisers:

More expressive associations – treat momentum as a proper key–value memory, mapping gradient features (or preconditioned variants) to update directions, instead of a single aggregated vector.
More expressive objectives – replace simple dot-product style internal losses with ℓ² regression objectives that better manage capacity and avoid pure Hebbian behaviour, giving more stable memorisation of gradient histories.
More expressive memory modules – replace the linear momentum accumulator with an MLP or other non-linear module (their “Deep Momentum GD”), turning the optimiser state into a small neural network that learns how to integrate gradients.

This connects directly to current interest in learned optimisers and test-time training: optimisation itself is an in-context learning process over gradients.

Continuum Memory System and HOPE

The Continuum Memory System (CMS) generalises the classic “short-term vs long-term” distinction. Instead of two buckets, it proposes a chain of memory modules MLP1…MLPn, each updated at its own frequency f.

Given a stream of inputs, each module:

Processes tokens at every step for inference.
Updates its parameters only every X steps, where X is chosen so that slower modules integrate more stable patterns over longer horizons.

Transformers, under this lens, are just the trivial case with k = 1: a single persistent MLP plus a working-memory attention block. Nested Learning instead suggests multiple persistent MLPs, each responsible for compressing its own slice of temporal context into its parameters.

HOPE (“self-referential learning module with continuum memory”) combines:

A Titans-style self-modifying sequence model (weights that learn to update themselves based on context).
The Continuum Memory System, so that different parts of the model adapt at different speeds.

Empirically, HOPE is evaluated against strong baselines (Transformer++, RetNet, DeltaNet, Titans, Samba, etc.) at 340M, 760M, and 1.3B parameters on language modelling (WikiText, LAMBADA) and common-sense reasoning benchmarks (PIQA, HellaSwag, WinoGrande, ARC-E/C, Social IQa, BoolQ). HOPE consistently:

Matches or slightly trails the very best recurrent hybrids in pure perplexity at small scale.
Becomes competitive or superior at larger scale, slightly outperforming Titans on the aggregate accuracy metric at 1.3B parameters (57.23 vs 56.82 average).

Additional experiments (described mainly in the appendix) indicate:

Stronger long-context reasoning, thanks to the dynamic projection layers and deep memory.
Improved continual learning, with reduced catastrophic forgetting when trained incrementally.
Clear evidence of in-context learning emergence shaped by the nested optimisation design, rather than appearing as a mysteriously emergent property of scale alone.

Implications for AI agents and system design

From this more detailed reading, several research-level implications emerge:

Architectural depth vs learning depth
The paper argues that adding more layers does not necessarily increase computational depth or algorithmic expressivity; what matters is how many learning levels exist and how they interact. For agent systems, this suggests new architectures where planner, memory, and optimiser are explicitly distinct nested learners.
Unifying RNNs, attention, and optimisers
Linear attention, state-space models, and fast-weight programmes can all be written as different associative memories with different internal objectives. This gives a common language for comparing new recurrent architectures (e.g. RWKV, Mamba, Titans, HOPE) and may guide principled hybrids rather than ad-hoc designs.
Designing learning-aware agents
If HOPE-style modules are plugged into agents, one can endow different subsystems with different “plasticities”:
- Fast components for session-level adaptation (tasks, users, tools).
- Medium components for skill-level refinement across tasks.
- Slow components that encode robust, safety-critical knowledge with strict update controls.
Safety and interpretability considerations
Self-modifying weights plus continuum memories are powerful but harder to reason about. The paper frames them as mathematically “white-box” in terms of optimisation structure, yet real-world deployment will require additional monitoring tools to track how each memory level changes over time.
Shift in research questions
Instead of only asking, “How big should the model be?” or “How long a context can it handle?”, NL suggests questions such as:
- How many learning levels should this system have?
- Which signals feed each level’s context flow?
- How do we constrain or regularise slower memories to remain stable yet adaptable?

These directions position Nested Learning not as a single architecture, but as a design framework for future AI systems that must be continuously learning, yet controllable—exactly the regime where agentic workflows, test-time training, and long-running tools are heading.

Google Nested Learning – AI memorizes like our brain

Massimo2025-11-250277 views

Google Research’s Nested Learning paradigm reframes the age-old dichotomy of architecture vs optimiser into a unified, hierarchical system of nested learning loops. By deploying multiple modules updating at varied frequencies, the continuum memory system enables long-context retention and mitigates catastrophic forgetting. Their HOPE architecture exemplifies this, outperforming standard models in continual-learning tasks. For AI agents, this suggests a transition from static tools to evolving systems. The real frontier isn’t larger models — it’s learning better models.

SpikingBrain: a revolutionary brain-inspired Chatgpt made in China

Massimo2025-09-180676 views

The Chinese SpikingBrain is a new family of brain-inspired large language models that reimagines how AI can process information more efficiently. SpikingBrain models adopt a biological principle: neurons remain idle until an event triggers them to fire. This event-driven design reduces unnecessary computation, cuts energy use, and enables faster responses. SpikingBrain achieves over 100× speedup in “time to first token” for sequences up to 4 million tokens. Energy consumption drops by 97% compared to traditional LLMs.

Markov Chains, MDPs, and Memory-Augmented MDPs: The Mathematical Core of Agentic AI

Massimo2025-09-010390 views

Markov Chains, Markov Decision Processes (MDP), and Memory-augmented MDPs (M-MDP) form the mathematical backbone of decision-making under uncertainty. While Markov Chains capture stochastic dynamics, MDPs extend them with actions and rewards. Yet, real-world tasks demand memory—this is where M-MDPs shine. By embedding structured memory into the agent’s state, M-MDPs enable agentic AI systems to reason, plan, and adapt across long horizons. This blog post explores the mathematics, technicalities, and the disruptive role of M-MDPs in modern AI architectures.

Why 90% of Generative AI Projects Fail — and How to Avoid Becoming a Statistic

Massimo2025-08-230331 views

MIT’s 2025 report finds 95% of enterprise GenAI pilots fail, blocked by a “learning gap.” Tools that don’t adapt, remember, or integrate into workflows stall, while adaptive, embedded systems cross the GenAI Divide. The winners are startups, not big Companies then, focusing on narrow but high-value use cases, embedding in workflows, and scaling through learning. Again, generic SaaS tools and in-house builds fail. Leaders must focus on strategic partnerships with startups, adaptive systems, back-office ROI, and agentic readiness to ensure AI delivers measurable impact—not hype.

Inside the AceReason-Nemotron LLM of NVIDIA

Massimo2025-05-270434 views

AceReason-Nemotron is a groundbreaking AI model developed by NVIDIA that redefines how we train large language models (LLMs) for math and coding tasks. Unlike traditional models trained through distillation, AceReason uses reinforcement learning (RL) guided by strict verification and binary rewards to push reasoning capabilities further—particularly for small and mid-sized models. Starting with math-focused RL and later fine-tuning on code, the model shows impressive cross-domain generalization: math-only training significantly boosts code performance before even seeing code-related tasks. The new strategies help AceReason-14B outperform strong baselines like DeepSeek-R1-Distill, OpenMath-14B, and OpenCodeReasoning-14B on benchmarks like AIME and LiveCodeBench. It even approaches the capabilities of frontier models like GPT-4 and Qwen-32B in specific reasoning domains. For AI researchers and recruiters, AceReason is a compelling case study in how reinforcement learning—when combined with rigorous training design—can unlock reasoning in smaller models that once seemed exclusive to ultra-large systems.

S1: The Open-Source AI Model Challenging Industry Giants

Massimo2025-02-210461 views

The landscape of AI language models has been dominated by proprietary systems requiring massive computational resources. However, a new contender, S1, is redefining what’s possible with efficient training techniques and open-source transparency. Developed by researchers from Stanford University, the University of Washington, and the Allen Institute for AI, S1 showcases a novel approach to improving reasoning capabilities without exponential increases in computational cost. It seems the next breakthrough will come to the optimization of the reasoning methodologies. I envision two different engineering paths we should follow to better inferencing LLM models: prompt engineering reasoning engineering (I wrote a post about this). Technical Overview S1 employs a test-time scaling approach, allowing the model to enhance its reasoning capabilities dynamically during inference rather…

A New Frontier in AI: Introspection and the Changing Dynamics of Learning

Massimo2024-11-210418 views

Extract knowledge from LLMs for training. Introspection might change the dynamics of learning The landscape of training language models (LLMs) is on the brink of a dramatic transformation. Insights into how LLMs can introspect—access and utilise their own internal knowledge—promise to reshape the costs and strategies of AI development. The implications are profound: the cost of training could collapse in the coming months, accelerating innovation and democratising access to cutting-edge AI technologies. A Past Vision Revisited: Rethinking How LLMs Learn Years ago, I delved into the challenge of optimizing how LLMs acquire and refine knowledge. The central question was whether we could fundamentally alter the training phase itself, bypassing traditional methods that rely on ever-larger datasets and increasingly computationally expensive…

AI and the Future of Work: Job Apocalypse – new report predicts 8 million jobs cancelled because of Generative AI. Innovation and Employment Crisis in the UK

Massimo2024-06-200559 views

AI’s rapid evolution marks a pivotal shift in human civilization, presenting dual potentials to either aid or exacerbate our ecological crisis. Beyond mere technological convenience, AI redefines our existence, challenging the core of societal norms through mastery of language and manipulation. This transformative force could influence every aspect of life, from culture and politics to personal identity, demanding a critical examination of its role in shaping future societies.

Use Artificial Intelligence to implement the Prospect Theory of Daniel Kahneman: Shaping the Understanding of Economic Decision-Making with Large Language Models

Massimo2024-03-300633 views

Kahneman’s groundbreaking contributions to “prospect theory” highlighted the limitations of the expected utility theory, underscoring the significance of psychological biases in economic decision-making. This theory marked a significant departure from the assumption that individuals act purely on rational calculations, acknowledging instead the influence of various biases and heuristics. The human brain is a multifaceted mosaic, with each piece affecting our decision-making processes. Consequently, the images I have generated, inspired by Picasso’s style, reflect this complexity. Picasso was a trailblazer in depicting the multifaceted nature of the human mind. How do we make decisions? The recent passing of Daniel Kahneman at age 90 marks the end of an era for behavioural science but also solidifies a legacy that will persist through the…