Skip to content
Douglas Orr

Douglas Orr

Research Team Lead

Posts

1-bit Wonderful Weights for LLMs

Would you rather use 1 million \(\times\) 16-bit weights, 4 million \(\times\) 4-bit weights, or even 16 million \(\times\) 1-bit weights?

In joint work between Aleph Alpha Research and Graphcore, we asked this question of LLMs — the answer encouraged us to embrace the wonder ✨ of 1-bit weights, which can outperform 4-bit and 16-bit weights on a fixed weight memory budget.

1-bit weights rule!

February Papers: Thinking Depth, Latent Actions, Quantization and Riemannian Flows

The stream of papers never ends, even so, in February our team found 4 we'd like to share

  • Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens investiagtes how many layers are actually needed for each token during autoregressive LM rollouts.

  • Factored Latent Action World Models takes videos that contain multiple objects, and instead of encoding them into one latent state for the whole scene, tries to one latent state per object.

  • LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs generalises Hadamard transforms to better handle outliers when block-quantizing LLMs.

  • Riemannian Mean Flow extends MeanFlow for generating proteins within the coresponding structured spaces, e.g. the space of all residue positions and orientations.

October Papers: Fast and Smart Language Models

October was packed with insights into making language models faster and smarter. We reviewed four of our favorite papers for you in detail:

  • First up, Grouped Lattice Vector Quantisation introduces a novel technique for a fine-grained post-training quantisation of LLMs, retaining good performance even at low bit widths.
  • Planned Diffusion combines autoregressive planning with text diffusion, achieving low-latency text generation.
  • Rethinking Thinking addresses the problem of long reasoning chains by distilling intermediate results into a bounded workspace for faster answers.
  • Finally, When Structure Doesn’t Help compares techniques for encoding graphs for consumption by LLMs with surprising results.

September Papers: The L in ML Stands for LLMs

For September, the research team reviewed a whopping 22 papers! Needless to say, competition was fierce, and only four made the final cut for this month’s edition, which is LLM-themed:

  • FlowRL uses GFlowNets to train LLMs on full reward distributions, promoting diverse reasoning paths instead of just reward maximization.
  • Soft Tokens, Hard Truths proposes using continuous “soft” tokens with injected noise to enable reinforcement learning fine-tuning of LLM reasoning.
  • Set Block Decoding accelerates LLM inference by generating multiple tokens in parallel using non-causal attention and iterative entropy-based sampling.
  • Metacognitive Reuse enables LLMs to extract and reuse concise reasoning “behaviors” to improve efficiency and reduce repeated computation.

Optimal Formats and the Cube Root of the PDF

Your boss emails you a point in 128-billion-dimensional space. "Llama 3.1 8B," the message reads. "A not-so-large language model in bfloat16. But it's too big. Trim the fat (ASAP)." You open up your toolbox: quantisation, sparsity, distillation.

Quantisation comes first, with two problems. First, you must choose a space smaller than a 128-billion-dimensional binary number for the model to sit in. Second, you need to find a good point in that space. In our recent work on optimal formats for weight quantisation, we've had a crack at the first question.

In this post, we'll learn how to construct optimal formats for known scalar distributions via the "cube root rule". We'll start with a recap of an existing format that claims optimality for the normal distribution. Then we'll explore the cube root rule — a non-intuitive result from the 1950s — and use it to build our own quantisation formats for scaled normal, Laplace and Student's t distributions.

March Papers: De-Norming, Skill-Scaling, Over-Training and Drug-Generating

We've enjoyed March, bringing improving weather and many excellent ML papers to keep us busy. As usual, we're here to share summaries of four of our favourites.

First, Meta share their work that successfully removes the need for LayerNorm in transformers, replacing them with a reduction-free \(\tanh\) (de-norming). This is followed by two papers on scaling - studying the different scaling laws for skill-based vs knowledge-based downstream tasks (skill-scaling), and whether pretraining can go on too long, making downstream performance worse (over-training). Finally, EPFL share a flow-matching GNN model for generating small molecules for drug design (drug-generating).

Llama 3.2 Vision — A Deep Dive

Vision-Language Models (VLMs) allow LLMs to "see", but how do they work? In this post, we'll walk through the model changes needed to turn an LLM into a VLM for inference. To understand the LLM starting point, please see A transformer walk-through with Gemma, as we shall assume that content here.

Problem — Text generation, conditioned on an image: take an RGB image (below) and a short string prompt "What colour shirt is the person to the left of the laptop wearing?", then use an already-trained VLM (Llama-3.2-11B-Vision-Instruct by Meta) to generate an answer to the prompt.

Image of four people looking at a laptop

December Papers: Spend Your FLOPs Wisely

Welcome to Papers of the Month — Graphcore Research's effort to bring you our pick of the most interesting ML papers. In December we noted a collection of papers which took innovative approaches to allocating compute (FLOPs) to input data.

We start with the Byte Latent Transformer. This modifies the standard transformer to operate on patches, which comprise a variable number of input bytes, as determined by an entropy metric. The consequence of this is that compute is dynamically allocated towards "harder input data". This has some similarities with the Concept Model architecture, which also uses a flexible intermediate representation. The model performs autoregressive sentence generation in this modality-agnostic space, rather than token space.

The Memory Layers architecture allows extra parameters to be added to a model without increasing FLOPs. Decoupling these resources gives model designers more control (e.g. for co-design, to fit their hardware resources) and potentially facilitates more effective models in general.

Finally, the Phi-4 paper presents a rather different FLOPs angle: spending compute in the data-generation process to create higher quality data, leading to "student" models that (in some domains) out-perform their "teachers".

We hope you enjoy these month's papers as much as we did! If you have thoughts or questions, please reach out to us at @GCResearchTeam.

September Papers: Proper Conditioning

We're pleased to share four papers from different domains: LLM self-correction, FP8 training, generative crystals and optimisation. They are united, somewhat tenuously, by the importance of proper conditioning:

  1. DeepMind researchers explain how conditioning on the wrong distribution during supervised fine-tuning for self-correction is harmful but can be overcome using RL.
  2. A novel Smooth-SwiGLU activation "conditions" the numerics by inserting a scaling factor in just the right place, preventing late-training instability in FP8.
  3. The GenMS architecture that generates crystal structures for materials conditions on high-level textual and low-level structural information for high-quality generation.
  4. SOAP is an evolution of Shampoo, with conditioners in the name and preconditioners forming the eigenbasis for optimisation.

You can be the judge of how tenuous the connection is, but we'd encourage you to check out the summaries first or despite this.

I hope you enjoy these as much as we did. Tell us we're wrong; tell us we're right @GCResearchTeam.

Scale-preserving nonlinearities for u-μP

My colleagues and I always get excited when, every once in a while, deep learning research throws up a fun little maths problem. Our recent work on u-μP does just this, and in a reasonably systematic way, since we need to work out how to compensate for changes in scale (standard deviation) through deep learning ops. In this post and the accompanying notebook, we explore this problem.