3.1.21.1.4. unit_scaling.core.functional.transformer_residual_scaling_rule

unit_scaling.core.functional.transformer_residual_scaling_rule(residual_mult: float = 1.0, residual_attn_ratio: float = 1.0) Callable[[int, int], float][source]

Compute the residual tau ratios for the default transformer rule.

For a transformer stack that starts with embedding, then alternates between attention and MLP layers, this rule ensures:

  • Every attention layer contributes the same scale.

  • Every MLP layer contributes the same scale.

  • The ratio of the average (variance) contribution of all attention and all MLP layers to the embedding layer is residual_mult.

  • The ratio of Attn to MLP contributions is residual_attn_ratio.

If both hyperparameters are set to 1.0, the total contribution of embedding, attention and MLP layers are all equal.

This scheme is described in Appendix G of the u-μP paper,

Parameters:
  • residual_mult (float, optional) – contribution of residual layers (relative to an initial/embedding layer).

  • residual_attn_ratio (float, optional) – contribution of attn layers relative to FFN layers.

Returns:

a function for calculating tau at a given depth.

Return type:

fn(index, layers) -> tau