3.2.1. unit_scaling.analysis.example_batch

unit_scaling.analysis.example_batch(tokenizer: PreTrainedTokenizerBase, batch_size: int, seq_len: int, dataset_path: str = 'wikitext', dataset_name: str = 'wikitext-103-v1', shuffle_buffer_size: int = 10000, seed: int = 1472) → Tuple[Tensor, Tensor, Tensor][source]

Generates a batch of token IDs from a given dataset, along with an attention mask and labels (just the shifted token IDs).

Parameters:

tokenizer (PreTrainedTokenizerBase) – the tokenizer applied to the text data.
batch_size (int) – the batch size of the returned tensor.
seq_len (int) – the sequence length (number of IDs) of the returned tensor.
dataset_path (str, optional) – huggingface path of the dataset to use for visualisation. Defaults to “wikitext”.
dataset_name (str, optional) – huggingface name of the dataset to use for visualisation. Defaults to “wikitext-103-v1”.
shuffle_buffer_size (int, optional) – the tokenized data is a random sample from a chunk of the full dataset. This determines the chunk size. Defaults to 10_000.
seed (int, optional) – shuffle seed. Defaults to 1472.

Returns:

a tuple of (input_idxs, attn_mask, labels)

Return type:

Tuple[Tensor]