Skip to content
Luke Hudlass-Galley

Luke Hudlass-Galley

Research Team Lead

Posts

September Papers: The L in ML Stands for LLMs

For September, the research team reviewed a whopping 22 papers! Needless to say, competition was fierce, and only four made the final cut for this month’s edition, which is LLM-themed:

  • FlowRL uses GFlowNets to train LLMs on full reward distributions, promoting diverse reasoning paths instead of just reward maximization.
  • Soft Tokens, Hard Truths proposes using continuous “soft” tokens with injected noise to enable reinforcement learning fine-tuning of LLM reasoning.
  • Set Block Decoding accelerates LLM inference by generating multiple tokens in parallel using non-causal attention and iterative entropy-based sampling.
  • Metacognitive Reuse enables LLMs to extract and reuse concise reasoning “behaviors” to improve efficiency and reduce repeated computation.

May Papers: Parallel scaling, Evolving code, Understanding LLM reasoning

Hurtling past the NeurIPS submission deadline into the summer months, we switch from huddling around server rooms to keep warm to babysitting experiments whilst basking in the sun. We've had a bumper month of papers to sift through and once again we offer summaries of a few of our favourites.

First, Parallel Scaling Laws for Language Models proposes a novel method of scaling compute with language models inspired by classifier-free guidance that finetunes a model to run multiple forward passes with different learned vector prefixes. We also looked into AlphaEvolve, an evolutionary algorithm from Google DeepMind that generates and refine prompts for Gemini that can advance the state-of-the-art in algorithm design.

Since it has been a particularly exciting month for contributions on LLM reasoning, we picked two papers to dive into deeper. In Soft Thinking the authors attempt to improve on prior work sampling continuous token embeddings rather than discrete tokens during reasoning phases of text generation. Finally, in Spurious Rewards they find that even rewarding random answers can improve reasoning ability, potentially forcing us to reconsider how we understand post-training techniques to improve the use of test-time compute.

February Papers: Learning to Scale

Welcome to Papers of the Month! This time around, our monthly selection of ML papers revolves around the central theme of scale – and learning how to scale efficiently. Scaling-laws for LLMs, multi-scale quantisation training and scaling test-time compute: it's a rich buffet!

The first paper, Distillation Scaling Laws, presents a thorough study of distillation for Language Models, with the aim of estimating how student performance scales as a function of model size and amount of distillation data used -- offering very useful insights, in an era where distillation pre-training of LLMs is becoming more and more widespread to improve "capability per watt".

The problem of computational efficiency and cost reduction is also at the heart of Matryoshka Quantisation, DeepMind's solution for training a quantised model that can then be easily served at different lower numerical precisions, by leveraging the nested structure of integer data types. And if you are a quantisation geek like we are, make sure to also read our summary of ParetoQ, a new unified framework to investigate the scaling laws that govern the trade-off between quantised model size and accuracy in extremely low-bit regimes.

Finally, we jump from training scaling laws to scaling up test-time compute, with a paper that introduces a recurrent block in LLMs at test-time to allow the model to perform iterative reasoning in latent space, without verbalizing its intermediate thoughts, to improve its performance.

We hope you enjoy these month's papers as much as we did! If you have thoughts or questions, please reach out to us at @GCResearchTeam.

July Papers: All About Scaling

Scaling continues to be a super hot topic of research and our selection of papers for this month all tackle different angles of how to scale models efficiently.

The first paper we cover builds upon the work of muP to give a guide of how we can transfer hyperparameters optimised on small models to the large models we care about, especially as transformer width increases.

Our second chosen paper looks at scaling mixture of expert transformers along the expert dimension. They design an efficient routing strategy that allows them to push the expert number to the extreme for a more compute optimal configuration.

The third paper we discuss addresses the lack of scaling laws for vocabulary parameters in LLMs. They first validate that there exists an optimal vocab size for a given compute budget and then empirically fit power laws to show that vocab parameters should be scaled differently to the other parameters of the model.

Finally, our fourth paper answers the question of whether using long context lengths or retrieval augmented generation is better for scaling in-context learning and if a combination of the two could lead to more efficient inference.

I hope you enjoy these as much as we did. If you have thoughts or questions, keep the conversation going @GCResearchTeam.

April Papers: TriForce, QuaRot & Mixture-of-Depths

For our April selection of AI research papers, there is a clear common thread: efficient LLM inference. But as it happens, ML researchers are showing there are many creative ways to make our LLMs run faster.

The first paper, TriForce, looks at efficient LLM inference from the angle of combining speculative decoding and sparse KV techniques (which could be for instance our recent Graphcore SparQ method), showing that a combined hierarchical approach speeds up inference compared to standard LLM speculative sampling.

The second highlighted work, QuaRot, is taking a more classic, but loved by Graphcore Research team, quantisation route. It elegantly demonstrates how to use Hadamard transforms to solve the outlier problem in the distribution of LLM activations, opening the door to full (i.e. weights, activations and KV cache) LLM 4-bit quantisation with minimal accuracy loss.

Finally, the last paper, Mixture-of-Depths, presents how LLMs can learn to dynamically and independently allocate FLOPs to tokens, achieving better accuracy for the same compute budget. This research work leverages the routing idea from Mixture-of-Experts (MoE) transformers by allowing the model to decide for each layer which tokens should take a standard route (with the FLOPs cost associated with the layer) or a zero FLOPs skip connection.