3.1.19. unit_scaling.TransformerDecoder

class unit_scaling.TransformerDecoder(hidden_size: int, vocab_size: int, layers: int, heads: int, dropout_p: float = 0.0, residual_scaling: ~typing.Callable[[int, int], float] = <function transformer_residual_scaling_rule.<locals>._tau>)[source]

A unit-scaled implementation of a decoder-type transformer.

Note: this class is currently just for demonstrating scaling and lacks key functionality (for example masking, positional embeddings, usage for inference).

Parameters:
  • hidden_size (int) – the hidden dimension size of the input.

  • vocab_size (int) – the number of tokens in the vocabulary.

  • layers (int) – the number of transformer layers.

  • heads (int) – the number of attention heads.

  • dropout_p (float, optional) – the probability of embedding, residual and post-softmax dropout.

  • residual_scaling (Callable[[int, int], float], optional) – scheme for controlling residual weights in the transformer trunk; see unit_scaling.core.functional.transformer_residual_scaling_rule() (default).

append(module: Module) Sequential

Append a given module to the end.

Parameters:

module (nn.Module) – module to append