Our role within Graphcore is to help define what the next generation of AI compute systems should look like.
Specialised hardware has been the key driver of the progress of AI over the last decade, and we believe that hardware-aware
AI algorithms and AI-aware hardware developments will continue to be critical to the advancement of this exciting field.
We're pleased to share four papers from different domains: LLM self-correction, FP8 training, generative crystals and optimisation. They are united, somewhat tenuously, by the importance of proper conditioning:
DeepMind researchers explain how conditioning on the wrong distribution during supervised fine-tuning for self-correction is harmful but can be overcome using RL.
A novel Smooth-SwiGLU activation "conditions" the numerics by inserting a scaling factor in just the right place, preventing late-training instability in FP8.
The GenMS architecture that generates crystal structures for materials conditions on high-level textual and low-level structural information for high-quality generation.
SOAP is an evolution of Shampoo, with conditioners in the name and preconditioners forming the eigenbasis for optimisation.
You can be the judge of how tenuous the connection is, but we'd encourage you to check out the summaries first or despite this.
I hope you enjoy these as much as we did. Tell us we're wrong; tell us we're right @GCResearchTeam.
With the rapid advances in the capabilities of large language models (LLMs), there is an increasing need for efficient inference platforms that would enable fast and cheap LLM integration, especially following the release of powerful openly-available models such as Meta's Llama 3.1, Google's Gemma 2, and Mistral 7B.
If there's one thing you can count on from Graphcore Research, it's tireless enthusiasm for effective compute utilsation! Our favourite papers from August include:
Spectra, an open suite of 54 LLMs and 500+ intermediate checkpoints from 0.1B to 3.9B, spanning FP16 training, ternary training, and post-training quantisation to 3, 4, 6, and 8 bits. The proposed ternary architecture - TriLM - outperforms BitNet b1.58 models of similar size.
An investigation into two methods for allowing LLMs to improve task performance on challenging prompts by expending more test-time compute. As a result, the authors demonstrate compute-optimal scaling strategies to allocate compute on a per-prompt basis, and show that thoughtful increases in the test-time compute budget for a small model can be more effective than training larger models.
A training dataset derived from a Knowledge Graph where correct answers can always be known, enabling accurate measurement of hallucinations in LLMs. This facilitates an analysis of hallucincation rates and hallucaination detectability as training compute is scaled. So you see, we don't only think about compute!
I hope you enjoy these as much as we did. If you have thoughts or questions, keep the conversation going @GCResearchTeam.
My colleagues and I always get excited when, every once in a while, deep learning research throws up a fun little maths problem. Our recent work on u-μP does just this, and in a reasonably systematic way, since we need to work out how to compensate for changes in scale (standard deviation) through deep learning ops. In this post and the accompanying notebook, we explore this problem.
The 2024 International Conference on Machine Learning (ICML) was held last month in Vienna, Austria. As one of the "big three" AI conferences, alongside ICLR and NeurIPS, it attracted thousands of AI researchers and practitioners from around the globe. In this post, we highlight some of the topics and papers that piqued our interest.
Scaling continues to be a super hot topic of research and our selection of papers for this month all tackle different angles of how to scale models efficiently.
The first paper we cover builds upon the work of muP to give a guide of how we can transfer hyperparameters optimised on small models to the large models we care about, especially as transformer width increases.
Our second chosen paper looks at scaling mixture of expert transformers along the expert dimension. They design an efficient routing strategy that allows them to push the expert number to the extreme for a more compute optimal configuration.
The third paper we discuss addresses the lack of scaling laws for vocabulary parameters in LLMs. They first validate that there exists an optimal vocab size for a given compute budget and then empirically fit power laws to show that vocab parameters should be scaled differently to the other parameters of the model.
Finally, our fourth paper answers the question of whether using long context lengths or retrieval augmented generation is better for scaling in-context learning and if a combination of the two could lead to more efficient inference.
I hope you enjoy these as much as we did. If you have thoughts or questions, keep the conversation going @GCResearchTeam.
When ChatGPT launched in 2022, it became evident how powerful the Transformer architecture, when trained on large corpora of text, is for handling natural language processing tasks. The performance of these Large Language Models (LLMs) has been attributed to the in-context learning capabilities that emerge with large-scale training.
Improving transformers is now not "just one area" of machine learning research. This is illustrated by the breadth of papers we got excited about this month, all of which claim to improve upon some aspect of the transformer, but in very different ways.
First, Mamba-2 explores the connection between structured state space models and attention, resulting in a new architecture, Mamba-2. (The paper isn't short, so you get value-for-money with this summary!)
SµPar builds upon the maximal update parameterisation to transfer hyperparameters across different sparsity levels, promising predictable training of sparse models.
CoPE identifies deficiencies in current relative positional encodings, which are critical for turning transformers from set models into sequence models, and introduces a new & richer form of encoding.
Finally, "matmul-free LMs" follow the trajectory of BitNet and BitNet b1.58, removing all matrix multiplies from a transformer LM forward pass (in doing so, they make it an RNN), promising compression & compute efficiency.
I hope you enjoy these as much as we did. If you have thoughts or questions, keep the conversation going @GCResearchTeam.
May is always an eventful time of year for ML researchers, with final ICML paper decisions and ICLR taking place in early May, and NeurIPS submission deadlines closing the month. As ever, arXiv submissions continue to grow!
This month we take a look at three papers exploring new techniques to challenge the mainstream large-scale pretraining setup: transformers trained with next-token prediction optimized with Adam/AdamW.
The first paper, xLSTM, is a long-awaited deep dive into Sepp Hochreiter's new, improved RNN architecture, nearly 30 years after the original LSTM was published. Drawing inspiration from linear attention, the authors demonstrate scaling comparable to transformers up to 1.3B parameters.
We then take a look at Schedule-Free optimizers from a team at FAIR. The authors propose a new class of optimizers that require no finicky learning rate scheduling. By replacing gradient momentum terms in standard optimizers with parameter averages, the authors show faster convergence than scheduled optimizers on a wide battery of small-scale deep learning tasks.
A further paper from FAIR extends the standard pretraining setup for large language models from next-token to multi-token prediction. This particularly seems to improve performance for larger models and offers a natural choice of model to use for speculative sampling to accelerate inference.