Papers of the Month

January Papers: Conditional Memories for LMs, Audio-Visual FMs, and Batch Size Schedulers

Welcome to the first edition of our Paper of the Month newsletter for 2026!

This month, our team went through 21 different papers to find the most insightful new pieces of literature that we think have the potential to leave a mark. From this selection, three papers stood out in particular:

  • Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models. Cheng et al. introduce a simple, scalable memory-augmentation for large language models to offload the cost of simple knowledge-based retrieval to embedding lookups.

  • LTX-2: Efficient Joint Audio-Visual Foundation Model. HaCohen et al. propose a joint text-conditioned audio-visual generation framework built using modality-specific VAEs, a refined text-conditioning module, and an asymmetric dual-stream diffusion transformer.

  • How to Set the Batch Size for Large-Scale Pre-training? Zhou et al. discuss how to identify the optimal batch size for large-scale pretraining, and find that dyamically increasing the batch size through time can improve performance.

December Papers: MoE, Fact-storing and Byteifying Language Models

Despite the holiday season and the busy NeurIPS period, December closed the year with set of insightful papers. Our team reviewed the following three papers:

  • First up, SonicMoE tackles issues of fine-grained and sparse MoEs using hardware-aware optimizations to restore efficiency.
  • Finally, Bolmo presents a method for "byteifying" existing subword-level language models that improves character-level understanding while achieving comparable performance to subword-level models.

November Papers: Perspectives on efficiency

November is back to a favourite topic of ours: efficiency. We reviewed three of our favorite papers looking on LLM efficiency from different angles:

  • First up, How to Scale Second-Order Optimization is looking at optimal tuning of second order optimizers such as Muon.
  • Intelligence per Watt discusses our favorite metric on large language models: energy efficiency. And how to take advantage of edge AI inference.
  • Finally, Int vs FP is contributing to an old-timer topic in quantization: integer vs floating (block) point formats.

October Papers: Fast and Smart Language Models

October was packed with insights into making language models faster and smarter. We reviewed four of our favorite papers for you in detail:

  • First up, Grouped Lattice Vector Quantisation introduces a novel technique for a fine-grained post-training quantisation of LLMs, retaining good performance even at low bit widths.
  • Planned Diffusion combines autoregressive planning with text diffusion, achieving low-latency text generation.
  • Rethinking Thinking addresses the problem of long reasoning chains by distilling intermediate results into a bounded workspace for faster answers.
  • Finally, When Structure Doesn’t Help compares techniques for encoding graphs for consumption by LLMs with surprising results.

September Papers: The L in ML Stands for LLMs

For September, the research team reviewed a whopping 22 papers! Needless to say, competition was fierce, and only four made the final cut for this month’s edition, which is LLM-themed:

  • FlowRL uses GFlowNets to train LLMs on full reward distributions, promoting diverse reasoning paths instead of just reward maximization.
  • Soft Tokens, Hard Truths proposes using continuous “soft” tokens with injected noise to enable reinforcement learning fine-tuning of LLM reasoning.
  • Set Block Decoding accelerates LLM inference by generating multiple tokens in parallel using non-causal attention and iterative entropy-based sampling.
  • Metacognitive Reuse enables LLMs to extract and reuse concise reasoning “behaviors” to improve efficiency and reduce repeated computation.

August Papers: Optimal Dataset Mixtures, Stable Molecule Generation, and Agentic Hypergraph RAG

August, even with its heat waves and holidays, left no shortage of exciting research. Our top papers for this month are the following: - ADMIRE-BayesOpt that investigates how to weight different data sources when they are mixed to make a single training dataset where, using multi-Fidelity Bayesian Optimization, the search for the optimal mixture can be automated; - Stable Molecule Generation that uses a force-field based reward function to fine-tune pre-trained 3D molecule generation diffusion models with the goal of sampling physically stable and valid molecules; and - Graph-R1 that takes an agentic RAG approach with a knowledge hypergraph to effectively represent and retrieve information from a corpus of documents.

July Papers: Subliminal Learning, Mixture of Recursions and Dataset Curation

As July brought tennis at Wimbledon, so too did the ML world serve up a volley of research. This month, we took an eagle-eyed approach—or, perhaps, Hawk Eyed approach—to three papers.

In our first paper, Subliminal Learning addresses the question, "Can we control or filter the distillation training data so that a student learns desirable properties but avoids picking up undesirable traits?" The authors conclude that the student learns all the teacher's traits, whether they're desirable or not!

Next, Mixture of Recursions brings a twist to token-level computation: instead of fixed-depth processing, the model learns to recurse adaptively, allocating compute per token dynamically and efficiently—like a rally whose length depends on the importance of the point.

Last up is DataRater, where the problem of dataset quality is addressed. A 'rater' is meta-learned to curate training data without manual filtering—an ace for data-centric AI.

June Papers: Gradient Norms, LLM Reasoning and Video Generation

This June not only brought us very hot and sunny days (at least here in the UK), but also an excellent selection of new and exciting ML research! Out of the many good candidates, this month we selected three papers, covering quite a lot of different ground.

In the first paper, Why Gradients Rapidly Increase Near the End of Training, a researcher from FAIR explores the puzzling phenomenon of increasing gradient magnitudes during training, offering an elegant mathematical explanation and a simple remedy.

Next, in ProRL, NVIDIA researchers dive into the evolving topic of large language model reasoning, showing how prolonged reinforcement learning can indeed introduce novel reasoning abilities.

Finally, we look at AAPT, a fresh approach from the ByteDance Seed team that turns pre-trained offline diffusion models into real-time video generators via adversarial post-training.

May Papers: Parallel scaling, Evolving code, Understanding LLM reasoning

Hurtling past the NeurIPS submission deadline into the summer months, we switch from huddling around server rooms to keep warm to babysitting experiments whilst basking in the sun. We've had a bumper month of papers to sift through and once again we offer summaries of a few of our favourites.

First, Parallel Scaling Laws for Language Models proposes a novel method of scaling compute with language models inspired by classifier-free guidance that finetunes a model to run multiple forward passes with different learned vector prefixes. We also looked into AlphaEvolve, an evolutionary algorithm from Google DeepMind that generates and refine prompts for Gemini that can advance the state-of-the-art in algorithm design.

Since it has been a particularly exciting month for contributions on LLM reasoning, we picked two papers to dive into deeper. In Soft Thinking the authors attempt to improve on prior work sampling continuous token embeddings rather than discrete tokens during reasoning phases of text generation. Finally, in Spurious Rewards they find that even rewarding random answers can improve reasoning ability, potentially forcing us to reconsider how we understand post-training techniques to improve the use of test-time compute.

April Papers: Motion Prompting, Mamba Reasoning and Modeling Rewards

April has been a busy month for the AI research community, with ICLR (the first of the "big three" AI conferences of the year) taking place in Singapore. We're pleased to share summaries of a few of our favourite papers we've seen this month.

First up, Motion Prompting introduces flexible spatio-temporal trajectories, or "motion prompts", as a powerful new way to control nuanced dynamic actions and motion in video generation, overcoming the limitations of text prompts. This is followed by Inference-Time Scaling for Generalist Reward Modeling, which presents Self-Principled Critique Tuning (SPCT), a method that powers DeepSeek-GRM—a generalist reward model capable of generating adaptive, high-quality rewards and achieving strong performance gains through scalable inference-time compute. Finally, M1 looks at using a Mamba-based architecture to tackle reasoning problems, as a more computationally-efficient approach when compared to transformers with chains-of-thought.