March Papers: De-Norming, Skill-Scaling, Over-Training and Drug-Generating
We've enjoyed March, bringing improving weather and many excellent ML papers to keep us busy. As usual, we're here to share summaries of four of our favourites.
First, Meta share their work that successfully removes the need for LayerNorm in transformers, replacing them with a reduction-free \(\tanh\) (de-norming). This is followed by two papers on scaling - studying the different scaling laws for skill-based vs knowledge-based downstream tasks (skill-scaling), and whether pretraining can go on too long, making downstream performance worse (over-training). Finally, EPFL share a flow-matching GNN model for generating small molecules for drug design (drug-generating).