Posts by Tag

LLMs

Soft Tokens, Hard Truths

5 minute read

The key idea

Set Block Decoding is a Language Model Inference Accelerator

4 minute read

The key idea

Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors

10 minute read

FlowRL: Matching Reward Distributions for LLM Reasoning

4 minute read

The key idea

Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

5 minute read

The key idea The authors present an agentic approach for RAG where, in each step, an LLM-based agent is given the choice to either (1) retrieve more informat...

DataRater: Meta-Learned Dataset Curation

2 minute read

The key idea

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

4 minute read

The key idea

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

6 minute read

The key idea

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

3 minute read

The key idea

Spurious Rewards: Rethinking Training Signals in RLVR

3 minute read

The key idea

Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

4 minute read

The key idea

Parallel Scaling Laws for Language Models

2 minute read

The key idea

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

3 minute read

The key idea Language models applied to reasoning have recently been shown to benefit from longer chain-of-thought sequences, which require the model to proc...

Inference-Time Scaling for Generalist Reward Modeling

4 minute read

The key idea

Compute Optimal Scaling of Skills: Knowledge vs Reasoning

3 minute read

The key idea

Overtrained Language Models Are Harder to Fine-Tune

3 minute read

The key idea

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

3 minute read

The key idea

ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantisation

2 minute read

The key idea

Distillation Scaling Laws

3 minute read

The key idea

Transformer-Squared: Self-Adaptive LLMs

2 minute read

The key idea

Titans: Learning to Memorize at Test Time

5 minute read

The key idea

Evolving Deeper LLM Thinking

4 minute read

The key idea

DeepSeek-V3 & DeepSeek-R1 Technical Reports

6 minute read

With their V3 and R1 models, DeepSeek sets a new state-of-the-art in open-weight models and trades benchmark to benchmark with the best models from Anthropic...

Llama 3.2 Vision — A Deep Dive

13 minute read

Vision-Language Models (VLMs) allow LLMs to “see”, but how do they work? In this post, we’ll walk through the model changes needed to turn an LLM into a VLM ...

Phi-4 Technical Report

6 minute read

The key idea

Byte Latent Transformer: Patches Scale Better Than Tokens

6 minute read

The key idea

The Super Weight in Large Language Models

3 minute read

The key idea

How Does Critical Batch Size Scale in Pre-training?

5 minute read

The key idea

Thinking LLMs: General Instruction Following with Thought Generation

3 minute read

The key idea

Training Language Models to Self-Correct via Reinforcement Learning

5 minute read

The key idea

Generative Hierarchical Materials Search

2 minute read

The key idea

Speeding up LLM inference using SparQ Attention & llama.cpp

19 minute read

With the rapid advances in the capabilities of large language models (LLMs), there is an increasing need for efficient inference platforms that would enable ...

Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

3 minute read

The key idea

Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models

2 minute read

The key idea

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

5 minute read

The key idea

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

2 minute read

The key idea

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach

3 minute read

The key idea

Scalable MatMul-free Language Modeling

2 minute read

The key idea

Contextual Position Encoding: Learning to Count What’s Important

2 minute read

The key idea

xLSTM: Extended Long Short-Term Memory

2 minute read

The key idea

Better & Faster Large Language Models via Multi-token Prediction

2 minute read

The key idea

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

3 minute read

The key idea

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

3 minute read

The key idea

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

5 minute read

The key idea

A transformer walk-through, with Gemma

36 minute read

Transformer-based LLMs seem mysterious, but they don’t need to. In this post, we’ll walk through a modern transformer LLM, Google’s Gemma, providing bare-bon...

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

2 minute read

The key idea

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

2 minute read

The key idea

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

3 minute read

The key idea

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

2 minute read

The key idea

Massive Activations in Large Language Models

3 minute read

The key idea

Speculative Streaming: Fast LLM Inference without Auxiliary Models

4 minute read

The key idea

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

4 minute read

The key idea

Solving olympiad geometry without human demonstrations

4 minute read

The key idea

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

3 minute read

The key idea

Simplifying Transformer Blocks

1 minute read

The key idea

ChipNeMo: Domain-Adapted LLMs for Chip Design

1 minute read

The key idea

Back to Top ↑

efficient-inference

Set Block Decoding is a Language Model Inference Accelerator

4 minute read

The key idea

Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors

10 minute read

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

4 minute read

The key idea

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

6 minute read

The key idea

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

4 minute read

Real-time interactive video generation requires (1) low latency (ideally frame generation requires a single model evaluation) and (2) the model can only use ...

Optimal Formats and the Cube Root of the PDF

9 minute read

Your boss emails you a point in 128-billion-dimensional space. “Llama 3.1 8B,” the message reads. “A not-so-large language model in bfloat16. But it’s too bi...

Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

4 minute read

The key idea

Parallel Scaling Laws for Language Models

2 minute read

The key idea

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

3 minute read

The key idea

ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantisation

2 minute read

The key idea

Matryoshka Quantization

4 minute read

The key idea

Distillation Scaling Laws

3 minute read

The key idea

DeepSeek-V3 & DeepSeek-R1 Technical Reports

6 minute read

With their V3 and R1 models, DeepSeek sets a new state-of-the-art in open-weight models and trades benchmark to benchmark with the best models from Anthropic...

Scaling Laws for Precision

2 minute read

The key idea

Context Parallelism for Scalable Million-Token Inference

4 minute read

The key idea

Simplifying, Stabilizing & Scaling Continuous-Time Consistency Models

4 minute read

The key idea

Scaling FP8 training to trillion-token LLMs

3 minute read

The key idea

Speeding up LLM inference using SparQ Attention & llama.cpp

19 minute read

With the rapid advances in the capabilities of large language models (LLMs), there is an increasing need for efficient inference platforms that would enable ...

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

5 minute read

The key idea

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach

3 minute read

The key idea

Scalable MatMul-free Language Modeling

2 minute read

The key idea

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

4 minute read

The key idea

Better & Faster Large Language Models via Multi-token Prediction

2 minute read

The key idea

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

3 minute read

The key idea

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

3 minute read

The key idea

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

5 minute read

The key idea

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

4 minute read

The key idea

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

2 minute read

The key idea

Speculative Streaming: Fast LLM Inference without Auxiliary Models

4 minute read

The key idea

Back to Top ↑

quantisation

Optimal Formats and the Cube Root of the PDF

9 minute read

Your boss emails you a point in 128-billion-dimensional space. “Llama 3.1 8B,” the message reads. “A not-so-large language model in bfloat16. But it’s too bi...

ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantisation

2 minute read

The key idea

Matryoshka Quantization

4 minute read

The key idea

DeepSeek-V3 & DeepSeek-R1 Technical Reports

6 minute read

With their V3 and R1 models, DeepSeek sets a new state-of-the-art in open-weight models and trades benchmark to benchmark with the best models from Anthropic...

The Super Weight in Large Language Models

3 minute read

The key idea

Scaling Laws for Precision

2 minute read

The key idea

Scaling FP8 training to trillion-token LLMs

3 minute read

The key idea

Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models

2 minute read

The key idea

Scalable MatMul-free Language Modeling

2 minute read

The key idea

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

3 minute read

The key idea

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

2 minute read

The key idea

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

4 minute read

The key idea

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

2 minute read

The key idea

Back to Top ↑

training-dynamics

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

3 minute read

The key idea

Why Gradients Rapidly Increase Near the End of Training

4 minute read

The key idea

Scaling Laws for Precision

2 minute read

The key idea

How Does Critical Batch Size Scale in Pre-training?

5 minute read

The key idea

Scale-preserving nonlinearities for u-μP

5 minute read

My colleagues and I always get excited when, every once in a while, deep learning research throws up a fun little maths problem. Our recent work on u-μP does...

Scaling Exponents Across Parameterizations and Optimizers

5 minute read

The key idea

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

2 minute read

The key idea

The Road Less Scheduled

4 minute read

The key idea

Massive Activations in Large Language Models

3 minute read

The key idea

Analyzing and Improving the Training Dynamics of Diffusion Models

3 minute read

The key idea

Simplifying Transformer Blocks

1 minute read

The key idea

Almost-scaled dot-product attention

3 minute read

TL;DR: Scaled dot product attention isn’t properly scaled, and that’s a good thing!

Back to Top ↑

efficient-training

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

7 minute read

This is a Graphcore co-authored paper.

Transformer-Squared: Self-Adaptive LLMs

2 minute read

The key idea

Memory Layers at Scale

4 minute read

The key idea

Scaling Laws for Precision

2 minute read

The key idea

How Does Critical Batch Size Scale in Pre-training?

5 minute read

The key idea

Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models

2 minute read

The key idea

Mixture of a Million Experts

3 minute read

The key idea

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

4 minute read

The key idea

Better & Faster Large Language Models via Multi-token Prediction

2 minute read

The key idea

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

2 minute read

The key idea

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

2 minute read

The key idea

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

3 minute read

The key idea

Back to Top ↑

reasoning

Soft Tokens, Hard Truths

5 minute read

The key idea

Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors

10 minute read

FlowRL: Matching Reward Distributions for LLM Reasoning

4 minute read

The key idea

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

3 minute read

The key idea

Spurious Rewards: Rethinking Training Signals in RLVR

3 minute read

The key idea

Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

4 minute read

The key idea

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

3 minute read

The key idea Language models applied to reasoning have recently been shown to benefit from longer chain-of-thought sequences, which require the model to proc...

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

3 minute read

The key idea

Evolving Deeper LLM Thinking

4 minute read

The key idea

DeepSeek-V3 & DeepSeek-R1 Technical Reports

6 minute read

With their V3 and R1 models, DeepSeek sets a new state-of-the-art in open-weight models and trades benchmark to benchmark with the best models from Anthropic...

Phi-4 Technical Report

6 minute read

The key idea

Thinking LLMs: General Instruction Following with Thought Generation

3 minute read

The key idea

Back to Top ↑

fine-tuning

Soft Tokens, Hard Truths

5 minute read

The key idea

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

7 minute read

This is a Graphcore co-authored paper.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

3 minute read

The key idea

Spurious Rewards: Rethinking Training Signals in RLVR

3 minute read

The key idea

Parallel Scaling Laws for Language Models

2 minute read

The key idea

Overtrained Language Models Are Harder to Fine-Tune

3 minute read

The key idea

Transformer-Squared: Self-Adaptive LLMs

2 minute read

The key idea

Training Language Models to Self-Correct via Reinforcement Learning

5 minute read

The key idea

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

2 minute read

The key idea

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

3 minute read

The key idea

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

4 minute read

The key idea

Back to Top ↑

transformers

Transformers without Normalisation

2 minute read

The key idea

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

3 minute read

The key idea

Llama 3.2 Vision — A Deep Dive

13 minute read

Vision-Language Models (VLMs) allow LLMs to “see”, but how do they work? In this post, we’ll walk through the model changes needed to turn an LLM into a VLM ...

Memory Layers at Scale

4 minute read

The key idea

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

3 minute read

The key idea

Speeding up LLM inference using SparQ Attention & llama.cpp

19 minute read

With the rapid advances in the capabilities of large language models (LLMs), there is an increasing need for efficient inference platforms that would enable ...

Contextual Position Encoding: Learning to Count What’s Important

2 minute read

The key idea

A transformer walk-through, with Gemma

36 minute read

Transformer-based LLMs seem mysterious, but they don’t need to. In this post, we’ll walk through a modern transformer LLM, Google’s Gemma, providing bare-bon...

Massive Activations in Large Language Models

3 minute read

The key idea

Simplifying Transformer Blocks

1 minute read

The key idea

Back to Top ↑

reinforcement-learning

Soft Tokens, Hard Truths

5 minute read

The key idea

FlowRL: Matching Reward Distributions for LLM Reasoning

4 minute read

The key idea

Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

5 minute read

The key idea The authors present an agentic approach for RAG where, in each step, an LLM-based agent is given the choice to either (1) retrieve more informat...

Guiding Diffusion Models with Reinforcement Learning for Stable Molecule Generation

5 minute read

The key idea

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

3 minute read

The key idea

Spurious Rewards: Rethinking Training Signals in RLVR

3 minute read

The key idea

Inference-Time Scaling for Generalist Reward Modeling

4 minute read

The key idea

Training Language Models to Self-Correct via Reinforcement Learning

5 minute read

The key idea

Back to Top ↑

scaling-laws

Parallel Scaling Laws for Language Models

2 minute read

The key idea

Compute Optimal Scaling of Skills: Knowledge vs Reasoning

3 minute read

The key idea

ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantisation

2 minute read

The key idea

Distillation Scaling Laws

3 minute read

The key idea

How Does Critical Batch Size Scale in Pre-training?

5 minute read

The key idea

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

2 minute read

The key idea

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

3 minute read

The key idea

Back to Top ↑

diffusion

Guiding Diffusion Models with Reinforcement Learning for Stable Molecule Generation

5 minute read

The key idea

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

4 minute read

Real-time interactive video generation requires (1) low latency (ideally frame generation requires a single model evaluation) and (2) the model can only use ...

Motion Prompting: Controlling Video Generation with Motion Trajectories

3 minute read

The key idea The central concept of Motion Prompting is to gain fine-grained control over video generation by conditioning a video diffusion model on spatio-...

Large Concept Models: Language Modeling in a Sentence Representation Space

4 minute read

The key idea

Simplifying, Stabilizing & Scaling Continuous-Time Consistency Models

4 minute read

The key idea

Generative Hierarchical Materials Search

2 minute read

The key idea

Analyzing and Improving the Training Dynamics of Diffusion Models

3 minute read

The key idea

Back to Top ↑

GNNs

Multi-Domain Distribution Learning for De Novo Drug Design

9 minute read

The key idea

Generative Hierarchical Materials Search

2 minute read

The key idea

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

3 minute read

The key idea

Scaling Deep Learning for Materials Discovery

2 minute read

The key idea

Back to Top ↑

computer-vision

Simplifying, Stabilizing & Scaling Continuous-Time Consistency Models

4 minute read

The key idea

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

3 minute read

The key idea

Analyzing and Improving the Training Dynamics of Diffusion Models

3 minute read

The key idea

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

3 minute read

The key idea

Back to Top ↑

long-context

Titans: Learning to Memorize at Test Time

5 minute read

The key idea

Context Parallelism for Scalable Million-Token Inference

4 minute read

The key idea

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach

3 minute read

The key idea

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

4 minute read

The key idea

Back to Top ↑

mixture-of-experts

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

6 minute read

The key idea

Mixture of a Million Experts

3 minute read

The key idea

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

5 minute read

The key idea

DiPaCo: Distributed Path Composition

2 minute read

The key idea

Back to Top ↑

sparsity

Memory Layers at Scale

4 minute read

The key idea

Speeding up LLM inference using SparQ Attention & llama.cpp

19 minute read

With the rapid advances in the capabilities of large language models (LLMs), there is an increasing need for efficient inference platforms that would enable ...

Mixture of a Million Experts

3 minute read

The key idea

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

2 minute read

The key idea

Back to Top ↑

number-formats

Optimal Formats and the Cube Root of the PDF

9 minute read

Your boss emails you a point in 128-billion-dimensional space. “Llama 3.1 8B,” the message reads. “A not-so-large language model in bfloat16. But it’s too bi...

Scaling Laws for Precision

2 minute read

The key idea

FP8-LM: Training FP8 Large Language Models

3 minute read

The key idea

Back to Top ↑

inference

Transformer-Squared: Self-Adaptive LLMs

2 minute read

The key idea

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

4 minute read

The key idea

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

3 minute read

The key idea

Back to Top ↑

speculative-decoding

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

3 minute read

The key idea

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

3 minute read

The key idea

Speculative Streaming: Fast LLM Inference without Auxiliary Models

4 minute read

The key idea

Back to Top ↑

retrieval-augmented-generation

Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

5 minute read

The key idea The authors present an agentic approach for RAG where, in each step, an LLM-based agent is given the choice to either (1) retrieve more informat...

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach

3 minute read

The key idea

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

3 minute read

The key idea

Back to Top ↑

mup

Scale-preserving nonlinearities for u-μP

5 minute read

My colleagues and I always get excited when, every once in a while, deep learning research throws up a fun little maths problem. Our recent work on u-μP does...

Scaling Exponents Across Parameterizations and Optimizers

5 minute read

The key idea

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

2 minute read

The key idea

Back to Top ↑

image-generation

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

4 minute read

Real-time interactive video generation requires (1) low latency (ideally frame generation requires a single model evaluation) and (2) the model can only use ...

Simplifying, Stabilizing & Scaling Continuous-Time Consistency Models

4 minute read

The key idea

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

3 minute read

The key idea

Back to Top ↑

unit-scaling

Scale-preserving nonlinearities for u-μP

5 minute read

My colleagues and I always get excited when, every once in a while, deep learning research throws up a fun little maths problem. Our recent work on u-μP does...

Almost-scaled dot-product attention

3 minute read

TL;DR: Scaled dot product attention isn’t properly scaled, and that’s a good thing!

Back to Top ↑

materials

Generative Hierarchical Materials Search

2 minute read

The key idea

Scaling Deep Learning for Materials Discovery

2 minute read

The key idea

Back to Top ↑

active-learning

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

7 minute read

This is a Graphcore co-authored paper.

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

3 minute read

The key idea

Back to Top ↑

position-embeddings

Contextual Position Encoding: Learning to Count What’s Important

2 minute read

The key idea

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

4 minute read

The key idea

Back to Top ↑

sparse-attention

Speeding up LLM inference using SparQ Attention & llama.cpp

19 minute read

With the rapid advances in the capabilities of large language models (LLMs), there is an increasing need for efficient inference platforms that would enable ...

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

3 minute read

The key idea

Back to Top ↑

not-transformers

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

4 minute read

The key idea

xLSTM: Extended Long Short-Term Memory

2 minute read

The key idea

Back to Top ↑

self-correction

Thinking LLMs: General Instruction Following with Thought Generation

3 minute read

The key idea

Training Language Models to Self-Correct via Reinforcement Learning

5 minute read

The key idea

Back to Top ↑

optimisation

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

7 minute read

This is a Graphcore co-authored paper.

SOAP: Improving and Stabilizing Shampoo using Adam

2 minute read

The key idea

Back to Top ↑

generative-models

Motion Prompting: Controlling Video Generation with Motion Trajectories

3 minute read

The key idea The central concept of Motion Prompting is to gain fine-grained control over video generation by conditioning a video diffusion model on spatio-...

Large Concept Models: Language Modeling in a Sentence Representation Space

4 minute read

The key idea

Back to Top ↑

test-time-compute

Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

4 minute read

The key idea

Inference-Time Scaling for Generalist Reward Modeling

4 minute read

The key idea

Back to Top ↑

RAG

Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors

10 minute read

AlphaEvolve: A coding agent for scientific and algorithmic discovery

3 minute read

AlphaEvolve, evolves (no pun intended) the seminal method FunSearch introduced in late 2023. Powered by a frontier model rather than a smaller LLM, it levera...

Back to Top ↑

dataset

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

7 minute read

This is a Graphcore co-authored paper.

DataRater: Meta-Learned Dataset Curation

2 minute read

The key idea

Back to Top ↑

chip-design

ChipNeMo: Domain-Adapted LLMs for Chip Design

1 minute read

The key idea

Back to Top ↑

fp8

FP8-LM: Training FP8 Large Language Models

3 minute read

The key idea

Back to Top ↑

DFT

Scaling Deep Learning for Materials Discovery

2 minute read

The key idea

Back to Top ↑

multi-modality

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

3 minute read

The key idea

Back to Top ↑

automated-theorem-proving

Solving olympiad geometry without human demonstrations

4 minute read

The key idea

Back to Top ↑

synthetic-data

Solving olympiad geometry without human demonstrations

4 minute read

The key idea

Back to Top ↑

distributed-training

DiPaCo: Distributed Path Composition

2 minute read

The key idea

Back to Top ↑

local-updates

DiPaCo: Distributed Path Composition

2 minute read

The key idea

Back to Top ↑

optimization

The Road Less Scheduled

4 minute read

The key idea

Back to Top ↑

learning-rate-schedules

The Road Less Scheduled

4 minute read

The key idea

Back to Top ↑

RNNs

xLSTM: Extended Long Short-Term Memory

2 minute read

The key idea

Back to Top ↑

state-space-models

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

4 minute read

The key idea

Back to Top ↑

hallucinations

Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

3 minute read

The key idea

Back to Top ↑

batch-size

How Does Critical Batch Size Scale in Pre-training?

5 minute read

The key idea

Back to Top ↑

byte-level

Byte Latent Transformer: Patches Scale Better Than Tokens

6 minute read

The key idea

Back to Top ↑

language-models

Large Concept Models: Language Modeling in a Sentence Representation Space

4 minute read

The key idea

Back to Top ↑

embedding-models

Large Concept Models: Language Modeling in a Sentence Representation Space

4 minute read

The key idea

Back to Top ↑

training

Phi-4 Technical Report

6 minute read

The key idea

Back to Top ↑

synthetic data

Phi-4 Technical Report

6 minute read

The key idea

Back to Top ↑

VLMs

Llama 3.2 Vision — A Deep Dive

13 minute read

Vision-Language Models (VLMs) allow LLMs to “see”, but how do they work? In this post, we’ll walk through the model changes needed to turn an LLM into a VLM ...

Back to Top ↑

reinforcement learning

DeepSeek-V3 & DeepSeek-R1 Technical Reports

6 minute read

With their V3 and R1 models, DeepSeek sets a new state-of-the-art in open-weight models and trades benchmark to benchmark with the best models from Anthropic...

Back to Top ↑

memory

Titans: Learning to Memorize at Test Time

5 minute read

The key idea

Back to Top ↑

flow-matching

Multi-Domain Distribution Learning for De Novo Drug Design

9 minute read

The key idea

Back to Top ↑

drug-design

Multi-Domain Distribution Learning for De Novo Drug Design

9 minute read

The key idea

Back to Top ↑

normalisation

Transformers without Normalisation

2 minute read

The key idea

Back to Top ↑

activation-functions

Transformers without Normalisation

2 minute read

The key idea

Back to Top ↑

reward-modeling

Inference-Time Scaling for Generalist Reward Modeling

4 minute read

The key idea

Back to Top ↑

inference-time-compute

Inference-Time Scaling for Generalist Reward Modeling

4 minute read

The key idea

Back to Top ↑

mamba

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

3 minute read

The key idea Language models applied to reasoning have recently been shown to benefit from longer chain-of-thought sequences, which require the model to proc...

Back to Top ↑

AGI

AlphaEvolve: A coding agent for scientific and algorithmic discovery

3 minute read

AlphaEvolve, evolves (no pun intended) the seminal method FunSearch introduced in late 2023. Powered by a frontier model rather than a smaller LLM, it levera...

Back to Top ↑

Evolutionary Algorithms

AlphaEvolve: A coding agent for scientific and algorithmic discovery

3 minute read

AlphaEvolve, evolves (no pun intended) the seminal method FunSearch introduced in late 2023. Powered by a frontier model rather than a smaller LLM, it levera...

Back to Top ↑

video-generation

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

4 minute read

Real-time interactive video generation requires (1) low latency (ideally frame generation requires a single model evaluation) and (2) the model can only use ...

Back to Top ↑