Blog

December 3, 2024
in Papers of the Month
13 min read

November Papers: An LLM Feast

This month we've got an all-LLM menu of papers for you, with summaries of four great works exploring many different aspects of crafting systems for LLM training and inference.

We start with the surprising result that removing a single weight out of billions can completely ruin a model's ability to generate coherent text. Dubbed "super weights", preserving these weights is essential when quantising models to lower precision.

Also, we discuss how researchers at Meta explored using context parallelism, where the hidden states of the tokens are split across multiple processors and attention is computed using collective operations. They experiment with multiple strategies and find that different strategies should be used during different phases of inference.

Next, we cover an extension of scaling laws to account for numerical precision. The authors find, among other things, that neither 16-bit precision (as in current practice) nor very narrow bit widths (e.g. 4-bit precision) seem to be optimal.

Finally, we have a paper about the critical batch size in LLM training, the point at which increasing the global batch size is no longer helpful. The authors investigate how this value scales with the size of the model and the amount of training data, finding that the amount of training data has a much bigger effect.

We hope you enjoy these month's papers as much as we did! If you have thoughts or questions, please reach out to us at @GCResearchTeam.

October 30, 2024
in Papers of the Month
10 min read

October Papers: Improving image generation & making LLMs think

This month brought us some exciting developments in improving image-generating models, as well as some interesting insights into how to make large language models think!

We start with promising results from OpenAI on using consistency models for image generation, challenging the well-established denoising diffusion paradigm. While not quite reaching the same performance, these models require orders of magnitude less compute to generate an image, and may provide a very promising future direction.

At the same time, researchers from Google DeepMind were able to achieve state-of-the-art performance in text-to-image generation, by scaling an autoregressive-type transformer to 10.5 billion parameters, stressing the importance of continuous token representations for images.

Finally, since the introduction of OpenAI's o1 model, there has been a growing interest within the research community in understanding how to make large language models reason. In Thinking LLMs, the authors propose a training method to improve the responses from LLMs by eliciting a thought process before generating the answer.

We hope you enjoy these month's papers as much as we did! If you have thoughts or questions, please reach out to us at @GCResearchTeam.

October 9, 2024
in Articles
2 min read

Graphcore Research is hiring!

We are pleased to have announce we have open positions for Research Scientists and Engineers to join our team.

Apply for Graphcore Research here

Our role within Graphcore is to help define what the next generation of AI compute systems should look like. Specialised hardware has been the key driver of the progress of AI over the last decade, and we believe that hardware-aware AI algorithms and AI-aware hardware developments will continue to be critical to the advancement of this exciting field.

September 30, 2024
in Papers of the Month
12 min read

September Papers: Proper Conditioning

We're pleased to share four papers from different domains: LLM self-correction, FP8 training, generative crystals and optimisation. They are united, somewhat tenuously, by the importance of proper conditioning:

DeepMind researchers explain how conditioning on the wrong distribution during supervised fine-tuning for self-correction is harmful but can be overcome using RL.
A novel Smooth-SwiGLU activation "conditions" the numerics by inserting a scaling factor in just the right place, preventing late-training instability in FP8.
The GenMS architecture that generates crystal structures for materials conditions on high-level textual and low-level structural information for high-quality generation.
SOAP is an evolution of Shampoo, with conditioners in the name and preconditioners forming the eigenbasis for optimisation.

You can be the judge of how tenuous the connection is, but we'd encourage you to check out the summaries first or despite this.

I hope you enjoy these as much as we did. Tell us we're wrong; tell us we're right @GCResearchTeam.

September 8, 2024
in Articles
13 min read

Speeding up LLM inference using SparQ Attention & llama.cpp

With the rapid advances in the capabilities of large language models (LLMs), there is an increasing need for efficient inference platforms that would enable fast and cheap LLM integration, especially following the release of powerful openly-available models such as Meta's Llama 3.1, Google's Gemma 2, and Mistral 7B.

September 1, 2024
in Papers of the Month
10 min read

August Papers: Hallucinations, Quantisations and Test-Time Computations

If there's one thing you can count on from Graphcore Research, it's tireless enthusiasm for effective compute utilsation! Our favourite papers from August include:

Spectra, an open suite of 54 LLMs and 500+ intermediate checkpoints from 0.1B to 3.9B, spanning FP16 training, ternary training, and post-training quantisation to 3, 4, 6, and 8 bits. The proposed ternary architecture - TriLM - outperforms BitNet b1.58 models of similar size.
An investigation into two methods for allowing LLMs to improve task performance on challenging prompts by expending more test-time compute. As a result, the authors demonstrate compute-optimal scaling strategies to allocate compute on a per-prompt basis, and show that thoughtful increases in the test-time compute budget for a small model can be more effective than training larger models.
A training dataset derived from a Knowledge Graph where correct answers can always be known, enabling accurate measurement of hallucinations in LLMs. This facilitates an analysis of hallucincation rates and hallucaination detectability as training compute is scaled. So you see, we don't only think about compute!

I hope you enjoy these as much as we did. If you have thoughts or questions, keep the conversation going @GCResearchTeam.

August 16, 2024
in Articles
5 min read

Scale-preserving nonlinearities for u-μP

My colleagues and I always get excited when, every once in a while, deep learning research throws up a fun little maths problem. Our recent work on u-μP does just this, and in a reasonably systematic way, since we need to work out how to compensate for changes in scale (standard deviation) through deep learning ops. In this post and the accompanying notebook, we explore this problem.

August 7, 2024
in Articles
6 min read

Our ICML 2024 roundup: sparsity, speculative sampling and schnitzel

Image generated by Dreamstudio. Prompt: a machine learning conference happening in Vienna, Austria.

The 2024 International Conference on Machine Learning (ICML) was held last month in Vienna, Austria. As one of the "big three" AI conferences, alongside ICLR and NeurIPS, it attracted thousands of AI researchers and practitioners from around the globe. In this post, we highlight some of the topics and papers that piqued our interest.

August 1, 2024
in Papers of the Month
13 min read

July Papers: All About Scaling

Scaling continues to be a super hot topic of research and our selection of papers for this month all tackle different angles of how to scale models efficiently.

The first paper we cover builds upon the work of muP to give a guide of how we can transfer hyperparameters optimised on small models to the large models we care about, especially as transformer width increases.

Our second chosen paper looks at scaling mixture of expert transformers along the expert dimension. They design an efficient routing strategy that allows them to push the expert number to the extreme for a more compute optimal configuration.

The third paper we discuss addresses the lack of scaling laws for vocabulary parameters in LLMs. They first validate that there exists an optimal vocab size for a given compute budget and then empirically fit power laws to show that vocab parameters should be scaled differently to the other parameters of the model.

Finally, our fourth paper answers the question of whether using long context lengths or retrieval augmented generation is better for scaling in-context learning and if a combination of the two could lead to more efficient inference.

I hope you enjoy these as much as we did. If you have thoughts or questions, keep the conversation going @GCResearchTeam.

July 17, 2024
in Articles
10 min read

Sparser llamas run faster — speed up LLM inference with SparQ Attention

A low-poly llama moving quickly on a running track with lightning bolts

When ChatGPT launched in 2022, it became evident how powerful the Transformer architecture, when trained on large corpora of text, is for handling natural language processing tasks. The performance of these Large Language Models (LLMs) has been attributed to the in-context learning capabilities that emerge with large-scale training.