Blog

June 28, 2024
in Papers of the Month
12 min read

June Papers: Mamba-2 & Matmul-free Models

Improving transformers is now not "just one area" of machine learning research. This is illustrated by the breadth of papers we got excited about this month, all of which claim to improve upon some aspect of the transformer, but in very different ways.

First, Mamba-2 explores the connection between structured state space models and attention, resulting in a new architecture, Mamba-2. (The paper isn't short, so you get value-for-money with this summary!)

SµPar builds upon the maximal update parameterisation to transfer hyperparameters across different sparsity levels, promising predictable training of sparse models.

CoPE identifies deficiencies in current relative positional encodings, which are critical for turning transformers from set models into sequence models, and introduces a new & richer form of encoding.

Finally, "matmul-free LMs" follow the trajectory of BitNet and BitNet b1.58, removing all matrix multiplies from a transformer LM forward pass (in doing so, they make it an RNN), promising compression & compute efficiency.

I hope you enjoy these as much as we did. If you have thoughts or questions, keep the conversation going @GCResearchTeam.

May 31, 2024
in Papers of the Month
9 min read

May Papers: xLSTM, Schedule-Free Optimizers, and Multi-token prediction

May is always an eventful time of year for ML researchers, with final ICML paper decisions and ICLR taking place in early May, and NeurIPS submission deadlines closing the month. As ever, arXiv submissions continue to grow!

This month we take a look at three papers exploring new techniques to challenge the mainstream large-scale pretraining setup: transformers trained with next-token prediction optimized with Adam/AdamW.

The first paper, xLSTM, is a long-awaited deep dive into Sepp Hochreiter's new, improved RNN architecture, nearly 30 years after the original LSTM was published. Drawing inspiration from linear attention, the authors demonstrate scaling comparable to transformers up to 1.3B parameters.

We then take a look at Schedule-Free optimizers from a team at FAIR. The authors propose a new class of optimizers that require no finicky learning rate scheduling. By replacing gradient momentum terms in standard optimizers with parameter averages, the authors show faster convergence than scheduled optimizers on a wide battery of small-scale deep learning tasks.

A further paper from FAIR extends the standard pretraining setup for large language models from next-token to multi-token prediction. This particularly seems to improve performance for larger models and offers a natural choice of model to use for speculative sampling to accelerate inference.

April 30, 2024
in Papers of the Month
10 min read

April Papers: TriForce, QuaRot & Mixture-of-Depths

For our April selection of AI research papers, there is a clear common thread: efficient LLM inference. But as it happens, ML researchers are showing there are many creative ways to make our LLMs run faster.

The first paper, TriForce, looks at efficient LLM inference from the angle of combining speculative decoding and sparse KV techniques (which could be for instance our recent Graphcore SparQ method), showing that a combined hierarchical approach speeds up inference compared to standard LLM speculative sampling.

The second highlighted work, QuaRot, is taking a more classic, but loved by Graphcore Research team, quantisation route. It elegantly demonstrates how to use Hadamard transforms to solve the outlier problem in the distribution of LLM activations, opening the door to full (i.e. weights, activations and KV cache) LLM 4-bit quantisation with minimal accuracy loss.

Finally, the last paper, Mixture-of-Depths, presents how LLMs can learn to dynamically and independently allocate FLOPs to tokens, achieving better accuracy for the same compute budget. This research work leverages the routing idea from Mixture-of-Experts (MoE) transformers by allowing the model to decide for each layer which tokens should take a standard route (with the FLOPs cost associated with the layer) or a zero FLOPs skip connection.

April 24, 2024
in Articles
25 min read

A transformer walk-through, with Gemma

Transformer-based LLMs seem mysterious, but they don't need to. In this post, we'll walk through a modern transformer LLM, Google's Gemma, providing bare-bones PyTorch code and some intuition for why each step is there. If you're a programmer and casual ML enthusiast, this is written for you.

March 30, 2024
in Papers of the Month
13 min read

March Papers: Low-Rank Galore & 1.58-Bit Weights

March was a fruitful month for AI research, with plenty of papers for us to choose from. A trend in the work we've selected is the pushing of previously published methods to their limits, in new creative ways.

We start with GaLore, similar to the popular LoRA method for cheap fine-tuning, but introducing a low-rank approximation to the gradients instead of weights. It turns out this is particularly effective for pre-training.

Our second paper declares "The Era of 1-bit LLMs", showing that the previously published BitNet model can be tweaked for LLM training, such that weights can be rounded to either -1, 0 or 1. This is much stronger quantisation than most people thought possible. We also cover the DiPaCo paper, which demonstrates a method for scaling distributed MoE training, potentially to systems of such scale that they have to be distributed across datacentres.

Investigating a phenomenon that occurs as LLMs get larger, the Massive Activations paper brings valuable insight into why the numerics of LLMs tend to explode for certain tokens/hidden dimensions. We conclude with the G-Retriever paper, which provides a method for applying retrieval augmented generation (RAG) to textual graphs — something valuable in real-world applications where graph structures are commonplace.

February 29, 2024
in Papers of the Month
13 min read

February Papers: Longer RoPEs & Better Quantisation

Improving LLM inference is a key research topic at the moment, and something we're particularly interested in at Graphcore because of its hardware implications. February saw several developments in this area, focussing on both the efficiency and capabilities of LLM inference.

Microsoft contributed two of this month's papers, with the first showing a method of extrapolating to long sequences, and the second an approach to storing 6-bit weights. Researchers from Cornell University have gone further and pushed the limits of quantisation to as few as 3 bits for inference. Apple also introduced their new speculative streaming method, which makes efficiency gains by asking the model to predict multiple future tokens, improving over the popular speculative decoding technique.

January 29, 2024
in Papers of the Month
12 min read

January Papers: Great Teachers & Beyond Chinchilla

For the research community, 2023 was dominated by large transformers and the associated challenges with training, tuning and deploying them. This trend has continued into 2024, with January seeing some particularly useful developments in the area of efficient training.

Google DeepMind's work on active learning and MosaicML's work on updated scaling laws, stood out to us as particularly noteworthy. The latter paper updates the influential Chinchilla scaling laws to account for the additional cost of inference — a key practical consideration that has influenced models like Llama & Mistral.

While scaling laws assume a fixed architecture, there are also benefits to be gained by tweaking model design. Nvidia demonstrate this in their paper on diffusion model training dynamics, where they make various stability-inducing changes (we did something similar in our unit scaling paper). Finally, we note a remarkable application of LLMs to the problem of geometry solving, which had previously appeared too data-constrained and reasoning-dependent for current AI to solve.

December 1, 2023
in Papers of the Month
8 min read

December Papers: FP8 Training & Simpler Transformers

The last month saw impressive developments in the space of efficient transformers and applied ML, from materials discovery to chip design.

Researchers at Microsoft showed that FP8 could be used in parts of the LLM training process that until now had been kept in higher-precision, and work from ETH Zurich suggested a simplified way of designing transformer-like models.

In terms of applications, DeepMind have impressive results showing that GNNs can be used in the discovery of new inorganic crystals — a key building block of many modern technologies. Nvidia have also trained up a model to assist their engineers on chip design. This is a neat feedback loop: their chip design has facilitated better LLMs, and now their LLMs could facilitate better chip design. How useful this will be in practice remains to be seen.

October 18, 2023
in Articles
4 min read

Almost-scaled dot-product attention

TL;DR: Scaled dot product attention isn't properly scaled, and that's a good thing!

Notebook: almost-scaled dot-product attention