3.1.18. unit_scaling.TransformerDecoder

class unit_scaling.TransformerDecoder(hidden_size: int, vocab_size: int, layers: int, heads: int, dropout_p: float = 0.0, residual_scaling: ~typing.Callable[[int, int], float] = <function transformer_residual_scaling_rule.<locals>._tau>)[source]

A unit-scaled implementation of a decoder-type transformer.

Note: this class is currently just for demonstrating scaling and lacks key functionality (for example masking, positional embeddings, usage for inference).

Parameters:

hidden_size (int) – the hidden dimension size of the input.
vocab_size (int) – the number of tokens in the vocabulary.
layers (int) – the number of transformer layers.
heads (int) – the number of attention heads.
dropout_p (float, optional) – the probability of embedding, residual and post-softmax dropout.
residual_scaling (Callable[[int, int], float], optional) – scheme for controlling residual weights in the transformer trunk; see unit_scaling.core.functional.transformer_residual_scaling_rule() (default).

append(module: Module) → Sequential

Append a given module to the end.

Parameters:: module (nn.Module) – module to append