fairseq documentation

Fairseq is a sequence modeling toolkit written in PyTorch that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

Evaluating Pre-trained Models

First, download a pre-trained model along with its vocabularies:

> curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -

This model uses a Byte Pair Encoding (BPE) vocabulary, so we’ll have to apply the encoding to the source text before it can be translated. This can be done with the apply_bpe.py script using the wmt14.en-fr.fconv-cuda/bpecodes file. @@ is used as a continuation marker and the original text can be easily recovered with e.g. sed s/@@ //g or by passing the --remove-bpe flag to fairseq-generate. Prior to BPE, input text needs to be tokenized using tokenizer.perl from mosesdecoder.

Let’s use fairseq-interactive to generate translations interactively. Here, we use a beam size of 5 and preprocess the input with the Moses tokenizer and the given Byte-Pair Encoding vocabulary. It will automatically remove the BPE continuation markers and detokenize the output.

> MODEL_DIR=wmt14.en-fr.fconv-py
> fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --beam 5 --source-lang en --target-lang fr \
    --tokenizer moses \
    --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes
| loading model(s) from wmt14.en-fr.fconv-py/model.pt
| [en] dictionary: 44206 types
| [fr] dictionary: 44463 types
| Type the input sentence and press return:
Why is it rare to discover new marine mammal species?
S-0     Why is it rare to discover new marine mam@@ mal species ?
H-0     -0.0643349438905716     Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
P-0     -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015

This generation script produces three types of outputs: a line prefixed with O is a copy of the original source sentence; H is the hypothesis along with an average log-likelihood; and P is the positional score per token position, including the end-of-sentence marker which is omitted from the text.

Other types of output lines you might see are D, the detokenized hypothesis, T, the reference target, A, alignment info, E the history of generation steps.

See the README for a full list of pre-trained models available.

Training a New Model

The following tutorial is for machine translation. For an example of how to use Fairseq for other tasks, such as Language Modeling, please see the examples/ directory.

Data Pre-processing

Fairseq contains example pre-processing scripts for several translation datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT 2014 (English-German). To pre-process and binarize the IWSLT dataset:

> cd examples/translation/
> bash prepare-iwslt14.sh
> cd ../..
> TEXT=examples/translation/iwslt14.tokenized.de-en
> fairseq-preprocess --source-lang de --target-lang en \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir data-bin/iwslt14.tokenized.de-en

This will write binarized data that can be used for model training to data-bin/iwslt14.tokenized.de-en.

Training

Use fairseq-train to train a new model. Here a few example settings that work well for the IWSLT 2014 dataset:

> mkdir -p checkpoints/fconv
> CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
    --optimizer nag --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
    --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

By default, fairseq-train will use all available GPUs on your machine. Use the CUDA_VISIBLE_DEVICES environment variable to select specific GPUs and/or to change the number of GPU devices that will be used.

Also note that the batch size is specified in terms of the maximum number of tokens per batch (--max-tokens). You may need to use a smaller value depending on the available GPU memory on your system.

Generation

Once your model is trained, you can generate translations using fairseq-generate (for binarized data) or fairseq-interactive (for raw text):

> fairseq-generate data-bin/iwslt14.tokenized.de-en \
    --path checkpoints/fconv/checkpoint_best.pt \
    --batch-size 128 --beam 5
| [de] dictionary: 35475 types
| [en] dictionary: 24739 types
| data-bin/iwslt14.tokenized.de-en test 6750 examples
| model fconv
| loaded checkpoint trainings/fconv/checkpoint_best.pt
S-721   danke .
T-721   thank you .
...

To generate translations with only a CPU, use the --cpu flag. BPE continuation markers can be removed with the --remove-bpe flag.

Advanced Training Options

Large mini-batch training with delayed updates

The --update-freq option can be used to accumulate gradients from multiple mini-batches and delay updating, creating a larger effective batch size. Delayed updates can also improve training speed by reducing inter-GPU communication costs and by saving idle time caused by variance in workload across GPUs. See Ott et al. (2018) for more details.

To train on a single GPU with an effective batch size that is equivalent to training on 8 GPUs:

> CUDA_VISIBLE_DEVICES=0 fairseq-train --update-freq 8 (...)

Training with half precision floating point (FP16)

Note

FP16 training requires a Volta GPU and CUDA 9.1 or greater

Recent GPUs enable efficient half precision floating point computation, e.g., using Nvidia Tensor Cores. Fairseq supports FP16 training with the --fp16 flag:

> fairseq-train --fp16 (...)

Distributed training

Distributed training in fairseq is implemented on top of torch.distributed. The easiest way to launch jobs is with the torch.distributed.launch tool.

For example, to train a large English-German Transformer model on 2 nodes each with 8 GPUs (in total 16 GPUs), run the following command on each node, replacing node_rank=0 with node_rank=1 on the second node and making sure to update --master_addr to the IP address of the first node:

> python -m torch.distributed.launch --nproc_per_node=8 \
    --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
    --master_port=12345 \
    $(which fairseq-train) data-bin/wmt16_en_de_bpe32k \
    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --lr 0.0005 \
    --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 3584 \
    --max-epoch 70 \
    --fp16

On SLURM clusters, fairseq will automatically detect the number of nodes and GPUs, but a port number must be provided:

> salloc --gpus=16 --nodes 2 (...)
> srun fairseq-train --distributed-port 12345 (...).

Sharding very large datasets

It can be challenging to train over very large datasets, particularly if your machine does not have much system RAM. Most tasks in fairseq support training over “sharded” datasets, in which the original dataset has been preprocessed into non-overlapping chunks (or “shards”).

For example, instead of preprocessing all your data into a single “data-bin” directory, you can split the data and create “data-bin1”, “data-bin2”, etc. Then you can adapt your training command like so:

> fairseq-train data-bin1:data-bin2:data-bin3 (...)

Training will now iterate over each shard, one by one, with each shard corresponding to an “epoch”, thus reducing system memory usage.

Command-line Tools

Fairseq provides several command-line tools for training and evaluating models:

fairseq-preprocess

Data pre-processing: build vocabularies and binarize training data.

usage: fairseq-preprocess [-h] [--no-progress-bar]
                          [--log-interval LOG_INTERVAL]
                          [--log-format {json,none,simple,tqdm}]
                          [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                          [--aim-run-hash AIM_RUN_HASH]
                          [--tensorboard-logdir TENSORBOARD_LOGDIR]
                          [--wandb-project WANDB_PROJECT] [--azureml-logging]
                          [--seed SEED] [--cpu] [--tpu] [--bf16]
                          [--memory-efficient-bf16] [--fp16]
                          [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                          [--fp16-init-scale FP16_INIT_SCALE]
                          [--fp16-scale-window FP16_SCALE_WINDOW]
                          [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                          [--on-cpu-convert-precision]
                          [--min-loss-scale MIN_LOSS_SCALE]
                          [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                          [--amp] [--amp-batch-retries AMP_BATCH_RETRIES]
                          [--amp-init-scale AMP_INIT_SCALE]
                          [--amp-scale-window AMP_SCALE_WINDOW]
                          [--user-dir USER_DIR]
                          [--empty-cache-freq EMPTY_CACHE_FREQ]
                          [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                          [--model-parallel-size MODEL_PARALLEL_SIZE]
                          [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                          [--profile] [--reset-logging] [--suppress-crashes]
                          [--use-plasma-view] [--plasma-path PLASMA_PATH]
                          [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                          [--tokenizer {moses,nltk,space}]
                          [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                          [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                          [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                          [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                          [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP]
                          [--validpref FP] [--testpref FP] [--align-suffix FP]
                          [--destdir DIR] [--thresholdtgt N]
                          [--thresholdsrc N] [--tgtdict FP] [--srcdict FP]
                          [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN]
                          [--dataset-impl FORMAT] [--joined-dictionary]
                          [--only-source] [--padding-factor N] [--workers N]
                          [--dict-only]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging
--azureml-logging

Log scalars to AzureML context

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--on-cpu-convert-precision

if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False

--min-loss-scale

minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--amp

use automatic mixed precision

Default: False

--amp-batch-retries

number of retries of same batch after reducing loss scale with AMP

Default: 2

--amp-init-scale

default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--reset-logging

when using Hydra, reset the logging at the beginning of training

Default: False

--suppress-crashes

suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False

--use-plasma-view

Store indices and sizes in shared memory

Default: False

--plasma-path

path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”

--criterion

Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd
--lr-scheduler

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”

--scoring

Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”

--task

Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

Default: “mmap”

Preprocessing

-s, --source-lang source language
-t, --target-lang target language
--trainpref train file prefix (also used to build dictionaries)
--validpref comma separated, valid file prefixes (words missing from train set are replaced with <unk>)
--testpref comma separated, test file prefixes (words missing from train set are replaced with <unk>)
--align-suffix alignment file suffix
--destdir

destination dir

Default: “data-bin”

--thresholdtgt

map words appearing less than threshold times to unknown

Default: 0

--thresholdsrc

map words appearing less than threshold times to unknown

Default: 0

--tgtdict reuse given target dictionary
--srcdict reuse given source dictionary
--nwordstgt

number of target words to retain

Default: -1

--nwordssrc

number of source words to retain

Default: -1

--alignfile an alignment file (optional)
--joined-dictionary

Generate joined dictionary

Default: False

--only-source

Only process the source language

Default: False

--padding-factor

Pad dictionary size to be multiple of N

Default: 8

--workers

number of parallel workers

Default: 1

--dict-only

if true, only builds a dictionary and then exits

Default: False

fairseq-train

Train a new model on one or across multiple GPUs.

usage: fairseq-train [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                     [--log-format {json,none,simple,tqdm}]
                     [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                     [--aim-run-hash AIM_RUN_HASH]
                     [--tensorboard-logdir TENSORBOARD_LOGDIR]
                     [--wandb-project WANDB_PROJECT] [--azureml-logging]
                     [--seed SEED] [--cpu] [--tpu] [--bf16]
                     [--memory-efficient-bf16] [--fp16]
                     [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                     [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW]
                     [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                     [--on-cpu-convert-precision]
                     [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                     [--amp-batch-retries AMP_BATCH_RETRIES]
                     [--amp-init-scale AMP_INIT_SCALE]
                     [--amp-scale-window AMP_SCALE_WINDOW]
                     [--user-dir USER_DIR]
                     [--empty-cache-freq EMPTY_CACHE_FREQ]
                     [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                     [--profile] [--reset-logging] [--suppress-crashes]
                     [--use-plasma-view] [--plasma-path PLASMA_PATH]
                     [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                     [--tokenizer {moses,nltk,space}]
                     [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                     [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                     [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                     [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                     [--task TASK] [--num-workers NUM_WORKERS]
                     [--skip-invalid-size-inputs-valid-test]
                     [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                     [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                     [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                     [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                     [--data-buffer-size DATA_BUFFER_SIZE]
                     [--train-subset TRAIN_SUBSET]
                     [--valid-subset VALID_SUBSET] [--combine-valid-subsets]
                     [--ignore-unused-valid-subsets]
                     [--validate-interval VALIDATE_INTERVAL]
                     [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                     [--validate-after-updates VALIDATE_AFTER_UPDATES]
                     [--fixed-validation-seed FIXED_VALIDATION_SEED]
                     [--disable-validation]
                     [--max-tokens-valid MAX_TOKENS_VALID]
                     [--batch-size-valid BATCH_SIZE_VALID]
                     [--max-valid-steps MAX_VALID_STEPS]
                     [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                     [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                     [--grouped-shuffling]
                     [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                     [--update-ordered-indices-seed]
                     [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                     [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                     [--distributed-rank DISTRIBUTED_RANK]
                     [--distributed-backend DISTRIBUTED_BACKEND]
                     [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                     [--distributed-port DISTRIBUTED_PORT]
                     [--device-id DEVICE_ID] [--distributed-no-spawn]
                     [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                     [--ddp-comm-hook {none,fp16}]
                     [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
                     [--find-unused-parameters] [--gradient-as-bucket-view]
                     [--fast-stat-sync]
                     [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                     [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
                     [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                     [--localsgd-frequency LOCALSGD_FREQUENCY]
                     [--nprocs-per-node NPROCS_PER_NODE]
                     [--pipeline-model-parallel]
                     [--pipeline-balance PIPELINE_BALANCE]
                     [--pipeline-devices PIPELINE_DEVICES]
                     [--pipeline-chunks PIPELINE_CHUNKS]
                     [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                     [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                     [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                     [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                     [--pipeline-checkpoint {always,never,except_last}]
                     [--zero-sharding {none,os}] [--no-reshard-after-forward]
                     [--fp32-reduce-scatter] [--cpu-offload]
                     [--use-sharded-state] [--not-fsdp-flatten-parameters]
                     [--arch ARCH] [--max-epoch MAX_EPOCH]
                     [--max-update MAX_UPDATE]
                     [--stop-time-hours STOP_TIME_HOURS]
                     [--clip-norm CLIP_NORM] [--sentence-avg]
                     [--update-freq UPDATE_FREQ] [--lr LR]
                     [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                     [--skip-remainder-batch] [--save-dir SAVE_DIR]
                     [--restore-file RESTORE_FILE]
                     [--continue-once CONTINUE_ONCE]
                     [--finetune-from-model FINETUNE_FROM_MODEL]
                     [--reset-dataloader] [--reset-lr-scheduler]
                     [--reset-meters] [--reset-optimizer]
                     [--optimizer-overrides OPTIMIZER_OVERRIDES]
                     [--save-interval SAVE_INTERVAL]
                     [--save-interval-updates SAVE_INTERVAL_UPDATES]
                     [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                     [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                     [--keep-last-epochs KEEP_LAST_EPOCHS]
                     [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                     [--no-save] [--no-epoch-checkpoints]
                     [--no-last-checkpoints] [--no-save-optimizer-state]
                     [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                     [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                     [--checkpoint-suffix CHECKPOINT_SUFFIX]
                     [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--load-checkpoint-on-all-dp-ranks]
                     [--write-checkpoints-asynchronously] [--store-ema]
                     [--ema-decay EMA_DECAY]
                     [--ema-start-update EMA_START_UPDATE]
                     [--ema-seed-model EMA_SEED_MODEL]
                     [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging
--azureml-logging

Log scalars to AzureML context

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--on-cpu-convert-precision

if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False

--min-loss-scale

minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--amp

use automatic mixed precision

Default: False

--amp-batch-retries

number of retries of same batch after reducing loss scale with AMP

Default: 2

--amp-init-scale

default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--reset-logging

when using Hydra, reset the logging at the beginning of training

Default: False

--suppress-crashes

suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False

--use-plasma-view

Store indices and sizes in shared memory

Default: False

--plasma-path

path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”

--criterion

Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd
--lr-scheduler

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”

--scoring

Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”

--task

Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)
--ignore-unused-valid-subsets

do not raise error if valid subsets are ignored

Default: False

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

--grouped-shuffling

shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )
--update-ordered-indices-seed

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-num-procs

total number of processes to fork (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”

--ddp-comm-hook

Possible choices: none, fp16

communication hook

Default: “none”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False

--gradient-as-bucket-view

when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--heartbeat-timeout

kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-base-algorithm

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details

Default: “localsgd”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

--no-reshard-after-forward

don’t reshard parameters after forward pass

Default: False

--fp32-reduce-scatter

reduce-scatter grads in FP32

Default: False

--cpu-offload

offload FP32 params to CPU

Default: False

--use-sharded-state

use sharded checkpoint files

Default: False

--not-fsdp-flatten-parameters

not flatten parameter param for fsdp

Default: False

Model configuration

--arch, -a

Possible choices: transformer_tiny, transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t, transformer_from_pretrained_xlm, transformer_align, transformer_wmt_en_de_big_align, fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr, roberta, roberta_prenorm, roberta_base, roberta_large, xlm, roberta_enc_dec, xmod_base_13, xmod_base_30, xmod_base_60, xmod_base_75, xmod_base, xmod_large_prenorm, s2t_berard, s2t_berard_256_3_3, s2t_berard_512_3_2, s2t_berard_512_5_3, convtransformer, convtransformer_espnet, s2t_transformer, s2t_transformer_s, s2t_transformer_xs, s2t_transformer_sp, s2t_transformer_m, s2t_transformer_mp, s2t_transformer_l, s2t_transformer_lp, wav2vec, wav2vec2, wav2vec_ctc, wav2vec_seq2seq, xm_transformer, s2t_conformer, lstm, lstm_wiseman_iwslt_de_en, lstm_luong_wmt_en_de, masked_lm, bert_base, bert_large, xlm_base, tacotron_2, tts_transformer, fastspeech2, lightconv, lightconv_iwslt_de_en, lightconv_wmt_en_de, lightconv_wmt_en_de_big, lightconv_wmt_en_fr_big, lightconv_wmt_zh_en_big, lightconv_lm, lightconv_lm_gbw, lstm_lm, s2ut_transformer, s2ut_transformer_fisher, s2spect_transformer, s2spect_transformer_fisher, s2ut_conformer, hf_gpt2, hf_gpt2_medium, hf_gpt2_large, hf_gpt2_xl, transformer_lm, transformer_lm_big, transformer_lm_baevski_wiki103, transformer_lm_wiki103, transformer_lm_baevski_gbw, transformer_lm_gbw, transformer_lm_gpt, transformer_lm_gpt2_small, transformer_lm_gpt2_tiny, transformer_lm_gpt2_medium, transformer_lm_gpt2_big, transformer_lm_gpt2_big_wide, transformer_lm_gpt2_bigger, transformer_lm_gpt3_small, transformer_lm_gpt3_medium, transformer_lm_gpt3_large, transformer_lm_gpt3_xl, transformer_lm_gpt3_2_7, transformer_lm_gpt3_6_7, transformer_lm_gpt3_13, transformer_lm_gpt3_175, multilingual_transformer, multilingual_transformer_iwslt_de_en, bart_large, bart_base, mbart_large, mbart_base, mbart_base_wmt20, transformer_ulm, transformer_ulm_big, transformer_ulm_tiny, hubert, hubert_ctc, hubert_seq2seq, fconv_self_att, fconv_self_att_wp, fconv_lm, fconv_lm_dauphin_wikitext103, fconv_lm_dauphin_gbw, nonautoregressive_transformer, nonautoregressive_transformer_wmt_en_de, nacrf_transformer, iterative_nonautoregressive_transformer, iterative_nonautoregressive_transformer_wmt_en_de, cmlm_transformer, cmlm_transformer_wmt_en_de, levenshtein_transformer, levenshtein_transformer_wmt_en_de, levenshtein_transformer_vaswani_wmt_en_de_big, levenshtein_transformer_wmt_en_de_big, insertion_transformer, dummy_model, model_parallel_roberta, model_parallel_roberta_v1, model_parallel_roberta_postnorm, model_parallel_roberta_base, model_parallel_roberta_large, transformer_iwslt_de_en_pipeline_parallel, transformer_wmt_en_de_big_pipeline_parallel, transformer_lm_megatron, transformer_lm_megatron_11b

model architecture

optimization

--max-epoch

force stop training at specified epoch

Default: 0

--max-update

force stop training at specified update

Default: 0

--stop-time-hours

force stop training after specified cumulative time (if >0)

Default: 0

--clip-norm

clip threshold of gradients

Default: 0.0

--sentence-avg

normalize gradients by the number of sentences in a batch (default is to normalize by number of tokens)

Default: False

--update-freq

update parameters every N_i batches, when in epoch i

Default: 1

--lr

learning rate for the first N epochs; all epochs >N using LR_N (note: this may be interpreted differently depending on –lr-scheduler)

Default: 0.25

--stop-min-lr

stop training when the learning rate reaches this minimum

Default: -1.0

--use-bmuf

specify global optimizer for syncing models on different GPUs/shards

Default: False

--skip-remainder-batch

if set, include the last (partial) batch of each epoch in training (default is to skip it).

Default: False

checkpoint

--save-dir

path to save checkpoints

Default: “checkpoints”

--restore-file

filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt

Default: “checkpoint_last.pt”

--continue-once continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present
--finetune-from-model finetune from a pretrained model; note that meters and lr scheduler will be reset
--reset-dataloader

if set, does not reload dataloader state from the checkpoint

Default: False

--reset-lr-scheduler

if set, does not load lr scheduler state from the checkpoint

Default: False

--reset-meters

if set, does not load meters from the checkpoint

Default: False

--reset-optimizer

if set, does not load optimizer state from the checkpoint

Default: False

--optimizer-overrides

a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”

--save-interval

save a checkpoint every N epochs

Default: 1

--save-interval-updates

save a checkpoint (and validate) every N updates

Default: 0

--keep-interval-updates

keep the last N checkpoints saved with –save-interval-updates

Default: -1

--keep-interval-updates-pattern

when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0

Default: -1

--keep-last-epochs

keep last N epoch checkpoints

Default: -1

--keep-best-checkpoints

keep best N checkpoints based on scores

Default: -1

--no-save

don’t save models or checkpoints

Default: False

--no-epoch-checkpoints

only store last and best checkpoints

Default: False

--no-last-checkpoints

don’t store last checkpoints

Default: False

--no-save-optimizer-state

don’t save optimizer-state as part of checkpoint

Default: False

--best-checkpoint-metric

metric to use for saving “best” checkpoints

Default: “loss”

--maximize-best-checkpoint-metric

select the largest metric value for saving “best” checkpoints

Default: False

--patience

early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval

Default: -1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--load-checkpoint-on-all-dp-ranks

load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices)

Default: False

--write-checkpoints-asynchronously, --save-async

Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested.

Default: False

EMA configuration

--store-ema Default: False
--ema-decay

decay for exponential moving average model

Default: 0.9999

--ema-start-update

start EMA update after this many model updates

Default: 0

--ema-seed-model Seed to load EMA model from. Used to load EMA model separately from the actual model.
--ema-update-freq

Do EMA update every this many model updates

Default: 1

--ema-fp32

If true, store EMA model in fp32 even if model is in fp16

Default: False

fairseq-generate

Translate pre-processed data with a trained model.

usage: fairseq-generate [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                        [--log-format {json,none,simple,tqdm}]
                        [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                        [--aim-run-hash AIM_RUN_HASH]
                        [--tensorboard-logdir TENSORBOARD_LOGDIR]
                        [--wandb-project WANDB_PROJECT] [--azureml-logging]
                        [--seed SEED] [--cpu] [--tpu] [--bf16]
                        [--memory-efficient-bf16] [--fp16]
                        [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                        [--fp16-init-scale FP16_INIT_SCALE]
                        [--fp16-scale-window FP16_SCALE_WINDOW]
                        [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                        [--on-cpu-convert-precision]
                        [--min-loss-scale MIN_LOSS_SCALE]
                        [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                        [--amp-batch-retries AMP_BATCH_RETRIES]
                        [--amp-init-scale AMP_INIT_SCALE]
                        [--amp-scale-window AMP_SCALE_WINDOW]
                        [--user-dir USER_DIR]
                        [--empty-cache-freq EMPTY_CACHE_FREQ]
                        [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                        [--model-parallel-size MODEL_PARALLEL_SIZE]
                        [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                        [--profile] [--reset-logging] [--suppress-crashes]
                        [--use-plasma-view] [--plasma-path PLASMA_PATH]
                        [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                        [--tokenizer {moses,nltk,space}]
                        [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                        [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                        [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                        [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                        [--task TASK] [--num-workers NUM_WORKERS]
                        [--skip-invalid-size-inputs-valid-test]
                        [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                        [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                        [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                        [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                        [--data-buffer-size DATA_BUFFER_SIZE]
                        [--train-subset TRAIN_SUBSET]
                        [--valid-subset VALID_SUBSET]
                        [--combine-valid-subsets]
                        [--ignore-unused-valid-subsets]
                        [--validate-interval VALIDATE_INTERVAL]
                        [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                        [--validate-after-updates VALIDATE_AFTER_UPDATES]
                        [--fixed-validation-seed FIXED_VALIDATION_SEED]
                        [--disable-validation]
                        [--max-tokens-valid MAX_TOKENS_VALID]
                        [--batch-size-valid BATCH_SIZE_VALID]
                        [--max-valid-steps MAX_VALID_STEPS]
                        [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                        [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                        [--grouped-shuffling]
                        [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                        [--update-ordered-indices-seed]
                        [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                        [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                        [--distributed-rank DISTRIBUTED_RANK]
                        [--distributed-backend DISTRIBUTED_BACKEND]
                        [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                        [--distributed-port DISTRIBUTED_PORT]
                        [--device-id DEVICE_ID] [--distributed-no-spawn]
                        [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                        [--ddp-comm-hook {none,fp16}]
                        [--bucket-cap-mb BUCKET_CAP_MB]
                        [--fix-batches-to-gpus] [--find-unused-parameters]
                        [--gradient-as-bucket-view] [--fast-stat-sync]
                        [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                        [--broadcast-buffers]
                        [--slowmo-momentum SLOWMO_MOMENTUM]
                        [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                        [--localsgd-frequency LOCALSGD_FREQUENCY]
                        [--nprocs-per-node NPROCS_PER_NODE]
                        [--pipeline-model-parallel]
                        [--pipeline-balance PIPELINE_BALANCE]
                        [--pipeline-devices PIPELINE_DEVICES]
                        [--pipeline-chunks PIPELINE_CHUNKS]
                        [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                        [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                        [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                        [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                        [--pipeline-checkpoint {always,never,except_last}]
                        [--zero-sharding {none,os}]
                        [--no-reshard-after-forward] [--fp32-reduce-scatter]
                        [--cpu-offload] [--use-sharded-state]
                        [--not-fsdp-flatten-parameters] [--path PATH]
                        [--post-process [POST_PROCESS]] [--quiet]
                        [--model-overrides MODEL_OVERRIDES]
                        [--results-path RESULTS_PATH] [--beam BEAM]
                        [--nbest NBEST] [--max-len-a MAX_LEN_A]
                        [--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
                        [--match-source-len] [--unnormalized]
                        [--no-early-stop] [--no-beamable-mm] [--lenpen LENPEN]
                        [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
                        [--sacrebleu] [--score-reference]
                        [--prefix-size PREFIX_SIZE]
                        [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
                        [--sampling] [--sampling-topk SAMPLING_TOPK]
                        [--sampling-topp SAMPLING_TOPP]
                        [--constraints [{ordered,unordered}]]
                        [--temperature TEMPERATURE]
                        [--diverse-beam-groups DIVERSE_BEAM_GROUPS]
                        [--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
                        [--diversity-rate DIVERSITY_RATE]
                        [--print-alignment [{hard,soft}]] [--print-step]
                        [--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
                        [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
                        [--iter-decode-max-iter ITER_DECODE_MAX_ITER]
                        [--iter-decode-force-max-iter]
                        [--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
                        [--iter-decode-with-external-reranker]
                        [--retain-iter-history] [--retain-dropout]
                        [--retain-dropout-modules RETAIN_DROPOUT_MODULES]
                        [--decoding-format {unigram,ensemble,vote,dp,bs}]
                        [--no-seed-provided] [--eos-token EOS_TOKEN]
                        [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE]
                        [--continue-once CONTINUE_ONCE]
                        [--finetune-from-model FINETUNE_FROM_MODEL]
                        [--reset-dataloader] [--reset-lr-scheduler]
                        [--reset-meters] [--reset-optimizer]
                        [--optimizer-overrides OPTIMIZER_OVERRIDES]
                        [--save-interval SAVE_INTERVAL]
                        [--save-interval-updates SAVE_INTERVAL_UPDATES]
                        [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                        [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                        [--keep-last-epochs KEEP_LAST_EPOCHS]
                        [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                        [--no-save] [--no-epoch-checkpoints]
                        [--no-last-checkpoints] [--no-save-optimizer-state]
                        [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                        [--maximize-best-checkpoint-metric]
                        [--patience PATIENCE]
                        [--checkpoint-suffix CHECKPOINT_SUFFIX]
                        [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                        [--load-checkpoint-on-all-dp-ranks]
                        [--write-checkpoints-asynchronously]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging
--azureml-logging

Log scalars to AzureML context

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--on-cpu-convert-precision

if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False

--min-loss-scale

minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--amp

use automatic mixed precision

Default: False

--amp-batch-retries

number of retries of same batch after reducing loss scale with AMP

Default: 2

--amp-init-scale

default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--reset-logging

when using Hydra, reset the logging at the beginning of training

Default: False

--suppress-crashes

suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False

--use-plasma-view

Store indices and sizes in shared memory

Default: False

--plasma-path

path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”

--criterion

Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd
--lr-scheduler

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”

--scoring

Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”

--task

Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)
--ignore-unused-valid-subsets

do not raise error if valid subsets are ignored

Default: False

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

--grouped-shuffling

shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )
--update-ordered-indices-seed

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-num-procs

total number of processes to fork (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”

--ddp-comm-hook

Possible choices: none, fp16

communication hook

Default: “none”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False

--gradient-as-bucket-view

when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--heartbeat-timeout

kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-base-algorithm

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details

Default: “localsgd”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

--no-reshard-after-forward

don’t reshard parameters after forward pass

Default: False

--fp32-reduce-scatter

reduce-scatter grads in FP32

Default: False

--cpu-offload

offload FP32 params to CPU

Default: False

--use-sharded-state

use sharded checkpoint files

Default: False

--not-fsdp-flatten-parameters

not flatten parameter param for fsdp

Default: False

Generation

--path path(s) to model file(s), colon separated
--post-process, --remove-bpe post-process text by removing BPE, letter segmentation, etc. Valid options can be found in fairseq.data.utils.post_process.
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)
--beam

beam size

Default: 5

--nbest

number of hypotheses to output

Default: 1

--max-len-a

generate sequences of maximum length ax + b, where x is the source length

Default: 0

--max-len-b

generate sequences of maximum length ax + b, where x is the source length

Default: 200

--min-len

minimum generation length

Default: 1

--match-source-len

generations should match the source length

Default: False

--unnormalized

compare unnormalized hypothesis scores

Default: False

--no-early-stop

deprecated

Default: False

--no-beamable-mm

don’t use BeamableMM in attention layers

Default: False

--lenpen

length penalty: <1.0 favors shorter, >1.0 favors longer sentences

Default: 1

--unkpen

unknown word penalty: <0 produces more unks, >0 produces fewer

Default: 0

--replace-unk perform unknown replacement (optionally with alignment dictionary)
--sacrebleu

score with sacrebleu

Default: False

--score-reference

just score the reference translation

Default: False

--prefix-size

initialize generation by target prefix of given length

Default: 0

--no-repeat-ngram-size

ngram blocking such that this size ngram cannot be repeated in the generation

Default: 0

--sampling

sample hypotheses instead of using beam search

Default: False

--sampling-topk

sample from top K likely next words instead of all words

Default: -1

--sampling-topp

sample from the smallest set whose cumulative probability mass exceeds p for next words

Default: -1.0

--constraints

Possible choices: ordered, unordered

enables lexically constrained decoding

--temperature

temperature for generation

Default: 1.0

--diverse-beam-groups

number of groups for Diverse Beam Search

Default: -1

--diverse-beam-strength

strength of diversity penalty for Diverse Beam Search

Default: 0.5

--diversity-rate

strength of diversity penalty for Diverse Siblings Search

Default: -1.0

--print-alignment

Possible choices: hard, soft

if set, uses attention feedback to compute and print alignment to source tokens (valid options are: hard, soft, otherwise treated as hard alignment)

--print-step

print steps

Default: False

--lm-path path to lm checkpoint for lm fusion
--lm-weight

weight for lm probs for lm fusion

Default: 0.0

--iter-decode-eos-penalty

if > 0.0, it penalized early-stopping in decoding.

Default: 0.0

--iter-decode-max-iter

maximum iterations for iterative refinement.

Default: 10

--iter-decode-force-max-iter

if set, run exact the maximum number of iterations without early stop

Default: False

--iter-decode-with-beam

if > 1, model will generate translations varying by the lengths.

Default: 1

--iter-decode-with-external-reranker

if set, the last checkpoint are assumed to be a reranker to rescore the translations

Default: False

--retain-iter-history

if set, decoding returns the whole history of iterative refinement

Default: False

--retain-dropout

Use dropout at inference time

Default: False

--retain-dropout-modules if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules
--decoding-format

Possible choices: unigram, ensemble, vote, dp, bs

special decoding format for advanced decoding.

--no-seed-provided

if set, dont use seed for initializing random generators

Default: False

--eos-token EOS token

checkpoint

--save-dir

path to save checkpoints

Default: “checkpoints”

--restore-file

filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt

Default: “checkpoint_last.pt”

--continue-once continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present
--finetune-from-model finetune from a pretrained model; note that meters and lr scheduler will be reset
--reset-dataloader

if set, does not reload dataloader state from the checkpoint

Default: False

--reset-lr-scheduler

if set, does not load lr scheduler state from the checkpoint

Default: False

--reset-meters

if set, does not load meters from the checkpoint

Default: False

--reset-optimizer

if set, does not load optimizer state from the checkpoint

Default: False

--optimizer-overrides

a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”

--save-interval

save a checkpoint every N epochs

Default: 1

--save-interval-updates

save a checkpoint (and validate) every N updates

Default: 0

--keep-interval-updates

keep the last N checkpoints saved with –save-interval-updates

Default: -1

--keep-interval-updates-pattern

when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0

Default: -1

--keep-last-epochs

keep last N epoch checkpoints

Default: -1

--keep-best-checkpoints

keep best N checkpoints based on scores

Default: -1

--no-save

don’t save models or checkpoints

Default: False

--no-epoch-checkpoints

only store last and best checkpoints

Default: False

--no-last-checkpoints

don’t store last checkpoints

Default: False

--no-save-optimizer-state

don’t save optimizer-state as part of checkpoint

Default: False

--best-checkpoint-metric

metric to use for saving “best” checkpoints

Default: “loss”

--maximize-best-checkpoint-metric

select the largest metric value for saving “best” checkpoints

Default: False

--patience

early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval

Default: -1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--load-checkpoint-on-all-dp-ranks

load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices)

Default: False

--write-checkpoints-asynchronously, --save-async

Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested.

Default: False

fairseq-interactive

Translate raw text with a trained model. Batches data on-the-fly.

usage: fairseq-interactive [-h] [--no-progress-bar]
                           [--log-interval LOG_INTERVAL]
                           [--log-format {json,none,simple,tqdm}]
                           [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                           [--aim-run-hash AIM_RUN_HASH]
                           [--tensorboard-logdir TENSORBOARD_LOGDIR]
                           [--wandb-project WANDB_PROJECT] [--azureml-logging]
                           [--seed SEED] [--cpu] [--tpu] [--bf16]
                           [--memory-efficient-bf16] [--fp16]
                           [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                           [--fp16-init-scale FP16_INIT_SCALE]
                           [--fp16-scale-window FP16_SCALE_WINDOW]
                           [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                           [--on-cpu-convert-precision]
                           [--min-loss-scale MIN_LOSS_SCALE]
                           [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                           [--amp] [--amp-batch-retries AMP_BATCH_RETRIES]
                           [--amp-init-scale AMP_INIT_SCALE]
                           [--amp-scale-window AMP_SCALE_WINDOW]
                           [--user-dir USER_DIR]
                           [--empty-cache-freq EMPTY_CACHE_FREQ]
                           [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                           [--model-parallel-size MODEL_PARALLEL_SIZE]
                           [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                           [--profile] [--reset-logging] [--suppress-crashes]
                           [--use-plasma-view] [--plasma-path PLASMA_PATH]
                           [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                           [--tokenizer {moses,nltk,space}]
                           [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                           [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                           [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                           [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                           [--task TASK] [--num-workers NUM_WORKERS]
                           [--skip-invalid-size-inputs-valid-test]
                           [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                           [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                           [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                           [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                           [--data-buffer-size DATA_BUFFER_SIZE]
                           [--train-subset TRAIN_SUBSET]
                           [--valid-subset VALID_SUBSET]
                           [--combine-valid-subsets]
                           [--ignore-unused-valid-subsets]
                           [--validate-interval VALIDATE_INTERVAL]
                           [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                           [--validate-after-updates VALIDATE_AFTER_UPDATES]
                           [--fixed-validation-seed FIXED_VALIDATION_SEED]
                           [--disable-validation]
                           [--max-tokens-valid MAX_TOKENS_VALID]
                           [--batch-size-valid BATCH_SIZE_VALID]
                           [--max-valid-steps MAX_VALID_STEPS]
                           [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                           [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                           [--grouped-shuffling]
                           [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                           [--update-ordered-indices-seed]
                           [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                           [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                           [--distributed-rank DISTRIBUTED_RANK]
                           [--distributed-backend DISTRIBUTED_BACKEND]
                           [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                           [--distributed-port DISTRIBUTED_PORT]
                           [--device-id DEVICE_ID] [--distributed-no-spawn]
                           [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                           [--ddp-comm-hook {none,fp16}]
                           [--bucket-cap-mb BUCKET_CAP_MB]
                           [--fix-batches-to-gpus] [--find-unused-parameters]
                           [--gradient-as-bucket-view] [--fast-stat-sync]
                           [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                           [--broadcast-buffers]
                           [--slowmo-momentum SLOWMO_MOMENTUM]
                           [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                           [--localsgd-frequency LOCALSGD_FREQUENCY]
                           [--nprocs-per-node NPROCS_PER_NODE]
                           [--pipeline-model-parallel]
                           [--pipeline-balance PIPELINE_BALANCE]
                           [--pipeline-devices PIPELINE_DEVICES]
                           [--pipeline-chunks PIPELINE_CHUNKS]
                           [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                           [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                           [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                           [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                           [--pipeline-checkpoint {always,never,except_last}]
                           [--zero-sharding {none,os}]
                           [--no-reshard-after-forward]
                           [--fp32-reduce-scatter] [--cpu-offload]
                           [--use-sharded-state]
                           [--not-fsdp-flatten-parameters] [--path PATH]
                           [--post-process [POST_PROCESS]] [--quiet]
                           [--model-overrides MODEL_OVERRIDES]
                           [--results-path RESULTS_PATH] [--beam BEAM]
                           [--nbest NBEST] [--max-len-a MAX_LEN_A]
                           [--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
                           [--match-source-len] [--unnormalized]
                           [--no-early-stop] [--no-beamable-mm]
                           [--lenpen LENPEN] [--unkpen UNKPEN]
                           [--replace-unk [REPLACE_UNK]] [--sacrebleu]
                           [--score-reference] [--prefix-size PREFIX_SIZE]
                           [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
                           [--sampling] [--sampling-topk SAMPLING_TOPK]
                           [--sampling-topp SAMPLING_TOPP]
                           [--constraints [{ordered,unordered}]]
                           [--temperature TEMPERATURE]
                           [--diverse-beam-groups DIVERSE_BEAM_GROUPS]
                           [--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
                           [--diversity-rate DIVERSITY_RATE]
                           [--print-alignment [{hard,soft}]] [--print-step]
                           [--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
                           [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
                           [--iter-decode-max-iter ITER_DECODE_MAX_ITER]
                           [--iter-decode-force-max-iter]
                           [--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
                           [--iter-decode-with-external-reranker]
                           [--retain-iter-history] [--retain-dropout]
                           [--retain-dropout-modules RETAIN_DROPOUT_MODULES]
                           [--decoding-format {unigram,ensemble,vote,dp,bs}]
                           [--no-seed-provided] [--eos-token EOS_TOKEN]
                           [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE]
                           [--continue-once CONTINUE_ONCE]
                           [--finetune-from-model FINETUNE_FROM_MODEL]
                           [--reset-dataloader] [--reset-lr-scheduler]
                           [--reset-meters] [--reset-optimizer]
                           [--optimizer-overrides OPTIMIZER_OVERRIDES]
                           [--save-interval SAVE_INTERVAL]
                           [--save-interval-updates SAVE_INTERVAL_UPDATES]
                           [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                           [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                           [--keep-last-epochs KEEP_LAST_EPOCHS]
                           [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                           [--no-save] [--no-epoch-checkpoints]
                           [--no-last-checkpoints] [--no-save-optimizer-state]
                           [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                           [--maximize-best-checkpoint-metric]
                           [--patience PATIENCE]
                           [--checkpoint-suffix CHECKPOINT_SUFFIX]
                           [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                           [--load-checkpoint-on-all-dp-ranks]
                           [--write-checkpoints-asynchronously]
                           [--buffer-size BUFFER_SIZE] [--input INPUT]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging
--azureml-logging

Log scalars to AzureML context

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--on-cpu-convert-precision

if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False

--min-loss-scale

minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--amp

use automatic mixed precision

Default: False

--amp-batch-retries

number of retries of same batch after reducing loss scale with AMP

Default: 2

--amp-init-scale

default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--reset-logging

when using Hydra, reset the logging at the beginning of training

Default: False

--suppress-crashes

suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False

--use-plasma-view

Store indices and sizes in shared memory

Default: False

--plasma-path

path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”

--criterion

Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd
--lr-scheduler

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”

--scoring

Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”

--task

Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)
--ignore-unused-valid-subsets

do not raise error if valid subsets are ignored

Default: False

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

--grouped-shuffling

shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )
--update-ordered-indices-seed

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-num-procs

total number of processes to fork (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”

--ddp-comm-hook

Possible choices: none, fp16

communication hook

Default: “none”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False

--gradient-as-bucket-view

when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--heartbeat-timeout

kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-base-algorithm

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details

Default: “localsgd”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

--no-reshard-after-forward

don’t reshard parameters after forward pass

Default: False

--fp32-reduce-scatter

reduce-scatter grads in FP32

Default: False

--cpu-offload

offload FP32 params to CPU

Default: False

--use-sharded-state

use sharded checkpoint files

Default: False

--not-fsdp-flatten-parameters

not flatten parameter param for fsdp

Default: False

Generation

--path path(s) to model file(s), colon separated
--post-process, --remove-bpe post-process text by removing BPE, letter segmentation, etc. Valid options can be found in fairseq.data.utils.post_process.
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)
--beam

beam size

Default: 5

--nbest

number of hypotheses to output

Default: 1

--max-len-a

generate sequences of maximum length ax + b, where x is the source length

Default: 0

--max-len-b

generate sequences of maximum length ax + b, where x is the source length

Default: 200

--min-len

minimum generation length

Default: 1

--match-source-len

generations should match the source length

Default: False

--unnormalized

compare unnormalized hypothesis scores

Default: False

--no-early-stop

deprecated

Default: False

--no-beamable-mm

don’t use BeamableMM in attention layers

Default: False

--lenpen

length penalty: <1.0 favors shorter, >1.0 favors longer sentences

Default: 1

--unkpen

unknown word penalty: <0 produces more unks, >0 produces fewer

Default: 0

--replace-unk perform unknown replacement (optionally with alignment dictionary)
--sacrebleu

score with sacrebleu

Default: False

--score-reference

just score the reference translation

Default: False

--prefix-size

initialize generation by target prefix of given length

Default: 0

--no-repeat-ngram-size

ngram blocking such that this size ngram cannot be repeated in the generation

Default: 0

--sampling

sample hypotheses instead of using beam search

Default: False

--sampling-topk

sample from top K likely next words instead of all words

Default: -1

--sampling-topp

sample from the smallest set whose cumulative probability mass exceeds p for next words

Default: -1.0

--constraints

Possible choices: ordered, unordered

enables lexically constrained decoding

--temperature

temperature for generation

Default: 1.0

--diverse-beam-groups

number of groups for Diverse Beam Search

Default: -1

--diverse-beam-strength

strength of diversity penalty for Diverse Beam Search

Default: 0.5

--diversity-rate

strength of diversity penalty for Diverse Siblings Search

Default: -1.0

--print-alignment

Possible choices: hard, soft

if set, uses attention feedback to compute and print alignment to source tokens (valid options are: hard, soft, otherwise treated as hard alignment)

--print-step

print steps

Default: False

--lm-path path to lm checkpoint for lm fusion
--lm-weight

weight for lm probs for lm fusion

Default: 0.0

--iter-decode-eos-penalty

if > 0.0, it penalized early-stopping in decoding.

Default: 0.0

--iter-decode-max-iter

maximum iterations for iterative refinement.

Default: 10

--iter-decode-force-max-iter

if set, run exact the maximum number of iterations without early stop

Default: False

--iter-decode-with-beam

if > 1, model will generate translations varying by the lengths.

Default: 1

--iter-decode-with-external-reranker

if set, the last checkpoint are assumed to be a reranker to rescore the translations

Default: False

--retain-iter-history

if set, decoding returns the whole history of iterative refinement

Default: False

--retain-dropout

Use dropout at inference time

Default: False

--retain-dropout-modules if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules
--decoding-format

Possible choices: unigram, ensemble, vote, dp, bs

special decoding format for advanced decoding.

--no-seed-provided

if set, dont use seed for initializing random generators

Default: False

--eos-token EOS token

checkpoint

--save-dir

path to save checkpoints

Default: “checkpoints”

--restore-file

filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt

Default: “checkpoint_last.pt”

--continue-once continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present
--finetune-from-model finetune from a pretrained model; note that meters and lr scheduler will be reset
--reset-dataloader

if set, does not reload dataloader state from the checkpoint

Default: False

--reset-lr-scheduler

if set, does not load lr scheduler state from the checkpoint

Default: False

--reset-meters

if set, does not load meters from the checkpoint

Default: False

--reset-optimizer

if set, does not load optimizer state from the checkpoint

Default: False

--optimizer-overrides

a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”

--save-interval

save a checkpoint every N epochs

Default: 1

--save-interval-updates

save a checkpoint (and validate) every N updates

Default: 0

--keep-interval-updates

keep the last N checkpoints saved with –save-interval-updates

Default: -1

--keep-interval-updates-pattern

when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0

Default: -1

--keep-last-epochs

keep last N epoch checkpoints

Default: -1

--keep-best-checkpoints

keep best N checkpoints based on scores

Default: -1

--no-save

don’t save models or checkpoints

Default: False

--no-epoch-checkpoints

only store last and best checkpoints

Default: False

--no-last-checkpoints

don’t store last checkpoints

Default: False

--no-save-optimizer-state

don’t save optimizer-state as part of checkpoint

Default: False

--best-checkpoint-metric

metric to use for saving “best” checkpoints

Default: “loss”

--maximize-best-checkpoint-metric

select the largest metric value for saving “best” checkpoints

Default: False

--patience

early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval

Default: -1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--load-checkpoint-on-all-dp-ranks

load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices)

Default: False

--write-checkpoints-asynchronously, --save-async

Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested.

Default: False

Interactive

--buffer-size

read this many sentences into a buffer before processing them

Default: 0

--input

file to read from; use - for stdin

Default: “-”

fairseq-score

BLEU scoring of generated translations against reference translations.

Command-line script for BLEU scoring.

usage: fairseq-score [-h] [-s SYS] -r REF [-o N] [--ignore-case] [--sacrebleu]
                     [--sentence-bleu]

Named Arguments

-s, --sys

system output

Default: “-”

-r, --ref references
-o, --order

consider ngrams up to this order

Default: 4

--ignore-case

case-insensitive scoring

Default: False

--sacrebleu

score with sacrebleu

Default: False

--sentence-bleu

report sentence-level BLEUs (i.e., with +1 smoothing)

Default: False

fairseq-eval-lm

Evaluate the perplexity of a trained language model.

usage: fairseq-eval-lm [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                       [--log-format {json,none,simple,tqdm}]
                       [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                       [--aim-run-hash AIM_RUN_HASH]
                       [--tensorboard-logdir TENSORBOARD_LOGDIR]
                       [--wandb-project WANDB_PROJECT] [--azureml-logging]
                       [--seed SEED] [--cpu] [--tpu] [--bf16]
                       [--memory-efficient-bf16] [--fp16]
                       [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                       [--fp16-init-scale FP16_INIT_SCALE]
                       [--fp16-scale-window FP16_SCALE_WINDOW]
                       [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                       [--on-cpu-convert-precision]
                       [--min-loss-scale MIN_LOSS_SCALE]
                       [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                       [--amp-batch-retries AMP_BATCH_RETRIES]
                       [--amp-init-scale AMP_INIT_SCALE]
                       [--amp-scale-window AMP_SCALE_WINDOW]
                       [--user-dir USER_DIR]
                       [--empty-cache-freq EMPTY_CACHE_FREQ]
                       [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                       [--model-parallel-size MODEL_PARALLEL_SIZE]
                       [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                       [--profile] [--reset-logging] [--suppress-crashes]
                       [--use-plasma-view] [--plasma-path PLASMA_PATH]
                       [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                       [--tokenizer {moses,nltk,space}]
                       [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                       [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                       [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                       [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                       [--task TASK] [--num-workers NUM_WORKERS]
                       [--skip-invalid-size-inputs-valid-test]
                       [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                       [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                       [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                       [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                       [--data-buffer-size DATA_BUFFER_SIZE]
                       [--train-subset TRAIN_SUBSET]
                       [--valid-subset VALID_SUBSET] [--combine-valid-subsets]
                       [--ignore-unused-valid-subsets]
                       [--validate-interval VALIDATE_INTERVAL]
                       [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                       [--validate-after-updates VALIDATE_AFTER_UPDATES]
                       [--fixed-validation-seed FIXED_VALIDATION_SEED]
                       [--disable-validation]
                       [--max-tokens-valid MAX_TOKENS_VALID]
                       [--batch-size-valid BATCH_SIZE_VALID]
                       [--max-valid-steps MAX_VALID_STEPS]
                       [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                       [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                       [--grouped-shuffling]
                       [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                       [--update-ordered-indices-seed]
                       [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                       [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                       [--distributed-rank DISTRIBUTED_RANK]
                       [--distributed-backend DISTRIBUTED_BACKEND]
                       [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                       [--distributed-port DISTRIBUTED_PORT]
                       [--device-id DEVICE_ID] [--distributed-no-spawn]
                       [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                       [--ddp-comm-hook {none,fp16}]
                       [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
                       [--find-unused-parameters] [--gradient-as-bucket-view]
                       [--fast-stat-sync]
                       [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                       [--broadcast-buffers]
                       [--slowmo-momentum SLOWMO_MOMENTUM]
                       [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                       [--localsgd-frequency LOCALSGD_FREQUENCY]
                       [--nprocs-per-node NPROCS_PER_NODE]
                       [--pipeline-model-parallel]
                       [--pipeline-balance PIPELINE_BALANCE]
                       [--pipeline-devices PIPELINE_DEVICES]
                       [--pipeline-chunks PIPELINE_CHUNKS]
                       [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                       [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                       [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                       [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                       [--pipeline-checkpoint {always,never,except_last}]
                       [--zero-sharding {none,os}]
                       [--no-reshard-after-forward] [--fp32-reduce-scatter]
                       [--cpu-offload] [--use-sharded-state]
                       [--not-fsdp-flatten-parameters] [--path PATH]
                       [--post-process [POST_PROCESS]] [--quiet]
                       [--model-overrides MODEL_OVERRIDES]
                       [--results-path RESULTS_PATH] [--output-word-probs]
                       [--output-word-stats] [--context-window CONTEXT_WINDOW]
                       [--softmax-batch SOFTMAX_BATCH]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging
--azureml-logging

Log scalars to AzureML context

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--on-cpu-convert-precision

if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False

--min-loss-scale

minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--amp

use automatic mixed precision

Default: False

--amp-batch-retries

number of retries of same batch after reducing loss scale with AMP

Default: 2

--amp-init-scale

default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--reset-logging

when using Hydra, reset the logging at the beginning of training

Default: False

--suppress-crashes

suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False

--use-plasma-view

Store indices and sizes in shared memory

Default: False

--plasma-path

path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”

--criterion

Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd
--lr-scheduler

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”

--scoring

Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”

--task

Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “language_modeling”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)
--ignore-unused-valid-subsets

do not raise error if valid subsets are ignored

Default: False

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

--grouped-shuffling

shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )
--update-ordered-indices-seed

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-num-procs

total number of processes to fork (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”

--ddp-comm-hook

Possible choices: none, fp16

communication hook

Default: “none”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False

--gradient-as-bucket-view

when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--heartbeat-timeout

kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-base-algorithm

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details

Default: “localsgd”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

--no-reshard-after-forward

don’t reshard parameters after forward pass

Default: False

--fp32-reduce-scatter

reduce-scatter grads in FP32

Default: False

--cpu-offload

offload FP32 params to CPU

Default: False

--use-sharded-state

use sharded checkpoint files

Default: False

--not-fsdp-flatten-parameters

not flatten parameter param for fsdp

Default: False

LM Evaluation

--path path(s) to model file(s), colon separated
--post-process, --remove-bpe post-process text by removing BPE, letter segmentation, etc. Valid options can be found in fairseq.data.utils.post_process.
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)
--output-word-probs

if set, outputs words and their predicted log probabilities to standard output

Default: False

--output-word-stats

if set, outputs word statistics such as word count, average probability, etc

Default: False

--context-window

ensures that every evaluated token has access to a context of at least this size, if possible

Default: 0

--softmax-batch

if BxT is more than this, will batch the softmax over vocab to this amount of tokens, in order to fit into GPU memory

Default: 9223372036854775807

Overview

Fairseq can be extended through user-supplied plug-ins. We support five kinds of plug-ins:

  • Models define the neural network architecture and encapsulate all of the learnable parameters.
  • Criterions compute the loss function given the model outputs and targets.
  • Tasks store dictionaries and provide helpers for loading/iterating over Datasets, initializing the Model/Criterion and calculating the loss.
  • Optimizers update the Model parameters based on the gradients.
  • Learning Rate Schedulers update the learning rate over the course of training.

Training Flow

Given a model, criterion, task, optimizer and lr_scheduler, fairseq implements the following high-level training flow:

for epoch in range(num_epochs):
    itr = task.get_batch_iterator(task.dataset('train'))
    for num_updates, batch in enumerate(itr):
        task.train_step(batch, model, criterion, optimizer)
        average_and_clip_gradients()
        optimizer.step()
        lr_scheduler.step_update(num_updates)
    lr_scheduler.step(epoch)

where the default implementation for task.train_step is roughly:

def train_step(self, batch, model, criterion, optimizer, **unused):
    loss = criterion(model, batch)
    optimizer.backward(loss)
    return loss

Registering new plug-ins

New plug-ins are registered through a set of @register function decorators, for example:

@register_model('my_lstm')
class MyLSTM(FairseqEncoderDecoderModel):
    (...)

Once registered, new plug-ins can be used with the existing Command-line Tools. See the Tutorial sections for more detailed walkthroughs of how to add new plug-ins.

Loading plug-ins from another directory

New plug-ins can be defined in a custom module stored in the user system. In order to import the module, and make the plugin available to fairseq, the command line supports the --user-dir flag that can be used to specify a custom location for additional modules to load into fairseq.

For example, assuming this directory tree:

/home/user/my-module/
└── __init__.py

with __init__.py:

from fairseq.models import register_model_architecture
from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big

@register_model_architecture('transformer', 'my_transformer')
def transformer_mmt_big(args):
    transformer_vaswani_wmt_en_de_big(args)

it is possible to invoke the fairseq-train script with the new architecture with:

fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation

Tutorial: Simple LSTM

In this tutorial we will extend fairseq by adding a new FairseqEncoderDecoderModel that encodes a source sentence with an LSTM and then passes the final hidden state to a second LSTM that decodes the target sentence (without attention).

This tutorial covers:

  1. Writing an Encoder and Decoder to encode/decode the source/target sentence, respectively.
  2. Registering a new Model so that it can be used with the existing Command-line Tools.
  3. Training the Model using the existing command-line tools.
  4. Making generation faster by modifying the Decoder to use Incremental decoding.

1. Building an Encoder and Decoder

In this section we’ll define a simple LSTM Encoder and Decoder. All Encoders should implement the FairseqEncoder interface and Decoders should implement the FairseqDecoder interface. These interfaces themselves extend torch.nn.Module, so FairseqEncoders and FairseqDecoders can be written and used in the same ways as ordinary PyTorch Modules.

Encoder

Our Encoder will embed the tokens in the source sentence, feed them to a torch.nn.LSTM and return the final hidden state. To create our encoder save the following in a new file named fairseq/models/simple_lstm.py:

import torch.nn as nn
from fairseq import utils
from fairseq.models import FairseqEncoder

class SimpleLSTMEncoder(FairseqEncoder):

    def __init__(
        self, args, dictionary, embed_dim=128, hidden_dim=128, dropout=0.1,
    ):
        super().__init__(dictionary)
        self.args = args

        # Our encoder will embed the inputs before feeding them to the LSTM.
        self.embed_tokens = nn.Embedding(
            num_embeddings=len(dictionary),
            embedding_dim=embed_dim,
            padding_idx=dictionary.pad(),
        )
        self.dropout = nn.Dropout(p=dropout)

        # We'll use a single-layer, unidirectional LSTM for simplicity.
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=1,
            bidirectional=False,
            batch_first=True,
        )

    def forward(self, src_tokens, src_lengths):
        # The inputs to the ``forward()`` function are determined by the
        # Task, and in particular the ``'net_input'`` key in each
        # mini-batch. We discuss Tasks in the next tutorial, but for now just
        # know that *src_tokens* has shape `(batch, src_len)` and *src_lengths*
        # has shape `(batch)`.

        # Note that the source is typically padded on the left. This can be
        # configured by adding the `--left-pad-source "False"` command-line
        # argument, but here we'll make the Encoder handle either kind of
        # padding by converting everything to be right-padded.
        if self.args.left_pad_source:
            # Convert left-padding to right-padding.
            src_tokens = utils.convert_padding_direction(
                src_tokens,
                padding_idx=self.dictionary.pad(),
                left_to_right=True
            )

        # Embed the source.
        x = self.embed_tokens(src_tokens)

        # Apply dropout.
        x = self.dropout(x)

        # Pack the sequence into a PackedSequence object to feed to the LSTM.
        x = nn.utils.rnn.pack_padded_sequence(x, src_lengths, batch_first=True)

        # Get the output from the LSTM.
        _outputs, (final_hidden, _final_cell) = self.lstm(x)

        # Return the Encoder's output. This can be any object and will be
        # passed directly to the Decoder.
        return {
            # this will have shape `(bsz, hidden_dim)`
            'final_hidden': final_hidden.squeeze(0),
        }

    # Encoders are required to implement this method so that we can rearrange
    # the order of the batch elements during inference (e.g., beam search).
    def reorder_encoder_out(self, encoder_out, new_order):
        """
        Reorder encoder output according to `new_order`.

        Args:
            encoder_out: output from the ``forward()`` method
            new_order (LongTensor): desired order

        Returns:
            `encoder_out` rearranged according to `new_order`
        """
        final_hidden = encoder_out['final_hidden']
        return {
            'final_hidden': final_hidden.index_select(0, new_order),
        }

Decoder

Our Decoder will predict the next word, conditioned on the Encoder’s final hidden state and an embedded representation of the previous target word – which is sometimes called teacher forcing. More specifically, we’ll use a torch.nn.LSTM to produce a sequence of hidden states that we’ll project to the size of the output vocabulary to predict each target word.

import torch
from fairseq.models import FairseqDecoder

class SimpleLSTMDecoder(FairseqDecoder):

    def __init__(
        self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
        dropout=0.1,
    ):
        super().__init__(dictionary)

        # Our decoder will embed the inputs before feeding them to the LSTM.
        self.embed_tokens = nn.Embedding(
            num_embeddings=len(dictionary),
            embedding_dim=embed_dim,
            padding_idx=dictionary.pad(),
        )
        self.dropout = nn.Dropout(p=dropout)

        # We'll use a single-layer, unidirectional LSTM for simplicity.
        self.lstm = nn.LSTM(
            # For the first layer we'll concatenate the Encoder's final hidden
            # state with the embedded target tokens.
            input_size=encoder_hidden_dim + embed_dim,
            hidden_size=hidden_dim,
            num_layers=1,
            bidirectional=False,
        )

        # Define the output projection.
        self.output_projection = nn.Linear(hidden_dim, len(dictionary))

    # During training Decoders are expected to take the entire target sequence
    # (shifted right by one position) and produce logits over the vocabulary.
    # The *prev_output_tokens* tensor begins with the end-of-sentence symbol,
    # ``dictionary.eos()``, followed by the target sequence.
    def forward(self, prev_output_tokens, encoder_out):
        """
        Args:
            prev_output_tokens (LongTensor): previous decoder outputs of shape
                `(batch, tgt_len)`, for teacher forcing
            encoder_out (Tensor, optional): output from the encoder, used for
                encoder-side attention

        Returns:
            tuple:
                - the last decoder layer's output of shape
                  `(batch, tgt_len, vocab)`
                - the last decoder layer's attention weights of shape
                  `(batch, tgt_len, src_len)`
        """
        bsz, tgt_len = prev_output_tokens.size()

        # Extract the final hidden state from the Encoder.
        final_encoder_hidden = encoder_out['final_hidden']

        # Embed the target sequence, which has been shifted right by one
        # position and now starts with the end-of-sentence symbol.
        x = self.embed_tokens(prev_output_tokens)

        # Apply dropout.
        x = self.dropout(x)

        # Concatenate the Encoder's final hidden state to *every* embedded
        # target token.
        x = torch.cat(
            [x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
            dim=2,
        )

        # Using PackedSequence objects in the Decoder is harder than in the
        # Encoder, since the targets are not sorted in descending length order,
        # which is a requirement of ``pack_padded_sequence()``. Instead we'll
        # feed nn.LSTM directly.
        initial_state = (
            final_encoder_hidden.unsqueeze(0),  # hidden
            torch.zeros_like(final_encoder_hidden).unsqueeze(0),  # cell
        )
        output, _ = self.lstm(
            x.transpose(0, 1),  # convert to shape `(tgt_len, bsz, dim)`
            initial_state,
        )
        x = output.transpose(0, 1)  # convert to shape `(bsz, tgt_len, hidden)`

        # Project the outputs to the size of the vocabulary.
        x = self.output_projection(x)

        # Return the logits and ``None`` for the attention weights
        return x, None

2. Registering the Model

Now that we’ve defined our Encoder and Decoder we must register our model with fairseq using the register_model() function decorator. Once the model is registered we’ll be able to use it with the existing Command-line Tools.

All registered models must implement the BaseFairseqModel interface. For sequence-to-sequence models (i.e., any model with a single Encoder and Decoder), we can instead implement the FairseqEncoderDecoderModel interface.

Create a small wrapper class in the same file and register it in fairseq with the name 'simple_lstm':

from fairseq.models import FairseqEncoderDecoderModel, register_model

# Note: the register_model "decorator" should immediately precede the
# definition of the Model class.

@register_model('simple_lstm')
class SimpleLSTMModel(FairseqEncoderDecoderModel):

    @staticmethod
    def add_args(parser):
        # Models can override this method to add new command-line arguments.
        # Here we'll add some new command-line arguments to configure dropout
        # and the dimensionality of the embeddings and hidden states.
        parser.add_argument(
            '--encoder-embed-dim', type=int, metavar='N',
            help='dimensionality of the encoder embeddings',
        )
        parser.add_argument(
            '--encoder-hidden-dim', type=int, metavar='N',
            help='dimensionality of the encoder hidden state',
        )
        parser.add_argument(
            '--encoder-dropout', type=float, default=0.1,
            help='encoder dropout probability',
        )
        parser.add_argument(
            '--decoder-embed-dim', type=int, metavar='N',
            help='dimensionality of the decoder embeddings',
        )
        parser.add_argument(
            '--decoder-hidden-dim', type=int, metavar='N',
            help='dimensionality of the decoder hidden state',
        )
        parser.add_argument(
            '--decoder-dropout', type=float, default=0.1,
            help='decoder dropout probability',
        )

    @classmethod
    def build_model(cls, args, task):
        # Fairseq initializes models by calling the ``build_model()``
        # function. This provides more flexibility, since the returned model
        # instance can be of a different type than the one that was called.
        # In this case we'll just return a SimpleLSTMModel instance.

        # Initialize our Encoder and Decoder.
        encoder = SimpleLSTMEncoder(
            args=args,
            dictionary=task.source_dictionary,
            embed_dim=args.encoder_embed_dim,
            hidden_dim=args.encoder_hidden_dim,
            dropout=args.encoder_dropout,
        )
        decoder = SimpleLSTMDecoder(
            dictionary=task.target_dictionary,
            encoder_hidden_dim=args.encoder_hidden_dim,
            embed_dim=args.decoder_embed_dim,
            hidden_dim=args.decoder_hidden_dim,
            dropout=args.decoder_dropout,
        )
        model = SimpleLSTMModel(encoder, decoder)

        # Print the model architecture.
        print(model)

        return model

    # We could override the ``forward()`` if we wanted more control over how
    # the encoder and decoder interact, but it's not necessary for this
    # tutorial since we can inherit the default implementation provided by
    # the FairseqEncoderDecoderModel base class, which looks like:
    #
    # def forward(self, src_tokens, src_lengths, prev_output_tokens):
    #     encoder_out = self.encoder(src_tokens, src_lengths)
    #     decoder_out = self.decoder(prev_output_tokens, encoder_out)
    #     return decoder_out

Finally let’s define a named architecture with the configuration for our model. This is done with the register_model_architecture() function decorator. Thereafter this named architecture can be used with the --arch command-line argument, e.g., --arch tutorial_simple_lstm:

from fairseq.models import register_model_architecture

# The first argument to ``register_model_architecture()`` should be the name
# of the model we registered above (i.e., 'simple_lstm'). The function we
# register here should take a single argument *args* and modify it in-place
# to match the desired architecture.

@register_model_architecture('simple_lstm', 'tutorial_simple_lstm')
def tutorial_simple_lstm(args):
    # We use ``getattr()`` to prioritize arguments that are explicitly given
    # on the command-line, so that the defaults defined below are only used
    # when no other value has been specified.
    args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 256)
    args.encoder_hidden_dim = getattr(args, 'encoder_hidden_dim', 256)
    args.decoder_embed_dim = getattr(args, 'decoder_embed_dim', 256)
    args.decoder_hidden_dim = getattr(args, 'decoder_hidden_dim', 256)

3. Training the Model

Now we’re ready to train the model. We can use the existing fairseq-train command-line tool for this, making sure to specify our new Model architecture (--arch tutorial_simple_lstm).

Note

Make sure you’ve already preprocessed the data from the IWSLT example in the examples/translation/ directory.

> fairseq-train data-bin/iwslt14.tokenized.de-en \
  --arch tutorial_simple_lstm \
  --encoder-dropout 0.2 --decoder-dropout 0.2 \
  --optimizer adam --lr 0.005 --lr-shrink 0.5 \
  --max-tokens 12000
(...)
| epoch 052 | loss 4.027 | ppl 16.30 | wps 420805 | ups 39.7 | wpb 9841 | bsz 400 | num_updates 20852 | lr 1.95313e-05 | gnorm 0.218 | clip 0% | oom 0 | wall 529 | train_wall 396
| epoch 052 | valid on 'valid' subset | valid_loss 4.74989 | valid_ppl 26.91 | num_updates 20852 | best 4.74954

The model files should appear in the checkpoints/ directory. While this model architecture is not very good, we can use the fairseq-generate script to generate translations and compute our BLEU score over the test set:

> fairseq-generate data-bin/iwslt14.tokenized.de-en \
  --path checkpoints/checkpoint_best.pt \
  --beam 5 \
  --remove-bpe
(...)
| Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)

4. Making generation faster

While autoregressive generation from sequence-to-sequence models is inherently slow, our implementation above is especially slow because it recomputes the entire sequence of Decoder hidden states for every output token (i.e., it is O(n^2)). We can make this significantly faster by instead caching the previous hidden states.

In fairseq this is called Incremental decoding. Incremental decoding is a special mode at inference time where the Model only receives a single timestep of input corresponding to the immediately previous output token (for teacher forcing) and must produce the next output incrementally. Thus the model must cache any long-term state that is needed about the sequence, e.g., hidden states, convolutional states, etc.

To implement incremental decoding we will modify our model to implement the FairseqIncrementalDecoder interface. Compared to the standard FairseqDecoder interface, the incremental decoder interface allows forward() methods to take an extra keyword argument (incremental_state) that can be used to cache state across time-steps.

Let’s replace our SimpleLSTMDecoder with an incremental one:

import torch
from fairseq.models import FairseqIncrementalDecoder

class SimpleLSTMDecoder(FairseqIncrementalDecoder):

    def __init__(
        self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
        dropout=0.1,
    ):
        # This remains the same as before.
        super().__init__(dictionary)
        self.embed_tokens = nn.Embedding(
            num_embeddings=len(dictionary),
            embedding_dim=embed_dim,
            padding_idx=dictionary.pad(),
        )
        self.dropout = nn.Dropout(p=dropout)
        self.lstm = nn.LSTM(
            input_size=encoder_hidden_dim + embed_dim,
            hidden_size=hidden_dim,
            num_layers=1,
            bidirectional=False,
        )
        self.output_projection = nn.Linear(hidden_dim, len(dictionary))

    # We now take an additional kwarg (*incremental_state*) for caching the
    # previous hidden and cell states.
    def forward(self, prev_output_tokens, encoder_out, incremental_state=None):
        if incremental_state is not None:
            # If the *incremental_state* argument is not ``None`` then we are
            # in incremental inference mode. While *prev_output_tokens* will
            # still contain the entire decoded prefix, we will only use the
            # last step and assume that the rest of the state is cached.
            prev_output_tokens = prev_output_tokens[:, -1:]

        # This remains the same as before.
        bsz, tgt_len = prev_output_tokens.size()
        final_encoder_hidden = encoder_out['final_hidden']
        x = self.embed_tokens(prev_output_tokens)
        x = self.dropout(x)
        x = torch.cat(
            [x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
            dim=2,
        )

        # We will now check the cache and load the cached previous hidden and
        # cell states, if they exist, otherwise we will initialize them to
        # zeros (as before). We will use the ``utils.get_incremental_state()``
        # and ``utils.set_incremental_state()`` helpers.
        initial_state = utils.get_incremental_state(
            self, incremental_state, 'prev_state',
        )
        if initial_state is None:
            # first time initialization, same as the original version
            initial_state = (
                final_encoder_hidden.unsqueeze(0),  # hidden
                torch.zeros_like(final_encoder_hidden).unsqueeze(0),  # cell
            )

        # Run one step of our LSTM.
        output, latest_state = self.lstm(x.transpose(0, 1), initial_state)

        # Update the cache with the latest hidden and cell states.
        utils.set_incremental_state(
            self, incremental_state, 'prev_state', latest_state,
        )

        # This remains the same as before
        x = output.transpose(0, 1)
        x = self.output_projection(x)
        return x, None

    # The ``FairseqIncrementalDecoder`` interface also requires implementing a
    # ``reorder_incremental_state()`` method, which is used during beam search
    # to select and reorder the incremental state.
    def reorder_incremental_state(self, incremental_state, new_order):
        # Load the cached state.
        prev_state = utils.get_incremental_state(
            self, incremental_state, 'prev_state',
        )

        # Reorder batches according to *new_order*.
        reordered_state = (
            prev_state[0].index_select(1, new_order),  # hidden
            prev_state[1].index_select(1, new_order),  # cell
        )

        # Update the cached state.
        utils.set_incremental_state(
            self, incremental_state, 'prev_state', reordered_state,
        )

Finally, we can rerun generation and observe the speedup:

# Before

> fairseq-generate data-bin/iwslt14.tokenized.de-en \
  --path checkpoints/checkpoint_best.pt \
  --beam 5 \
  --remove-bpe
(...)
| Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)

# After

> fairseq-generate data-bin/iwslt14.tokenized.de-en \
  --path checkpoints/checkpoint_best.pt \
  --beam 5 \
  --remove-bpe
(...)
| Translated 6750 sentences (153132 tokens) in 5.5s (1225.54 sentences/s, 27802.94 tokens/s)
| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)

Tutorial: Classifying Names with a Character-Level RNN

In this tutorial we will extend fairseq to support classification tasks. In particular we will re-implement the PyTorch tutorial for Classifying Names with a Character-Level RNN in fairseq. It is recommended to quickly skim that tutorial before beginning this one.

This tutorial covers:

  1. Preprocessing the data to create dictionaries.
  2. Registering a new Model that encodes an input sentence with a simple RNN and predicts the output label.
  3. Registering a new Task that loads our dictionaries and dataset.
  4. Training the Model using the existing command-line tools.
  5. Writing an evaluation script that imports fairseq and allows us to interactively evaluate our model on new inputs.

1. Preprocessing the data

The original tutorial provides raw data, but we’ll work with a modified version of the data that is already tokenized into characters and split into separate train, valid and test sets.

Download and extract the data from here: tutorial_names.tar.gz

Once extracted, let’s preprocess the data using the fairseq-preprocess command-line tool to create the dictionaries. While this tool is primarily intended for sequence-to-sequence problems, we’re able to reuse it here by treating the label as a “target” sequence of length 1. We’ll also output the preprocessed files in “raw” format using the --dataset-impl option to enhance readability:

> fairseq-preprocess \
  --trainpref names/train --validpref names/valid --testpref names/test \
  --source-lang input --target-lang label \
  --destdir names-bin --dataset-impl raw

After running the above command you should see a new directory, names-bin/, containing the dictionaries for inputs and labels.

2. Registering a new Model

Next we’ll register a new model in fairseq that will encode an input sentence with a simple RNN and predict the output label. Compared to the original PyTorch tutorial, our version will also work with batches of data and GPU Tensors.

First let’s copy the simple RNN module implemented in the PyTorch tutorial. Create a new file named fairseq/models/rnn_classifier.py with the following contents:

import torch
import torch.nn as nn

class RNN(nn.Module):

    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

We must also register this model with fairseq using the register_model() function decorator. Once the model is registered we’ll be able to use it with the existing Command-line Tools.

All registered models must implement the BaseFairseqModel interface, so we’ll create a small wrapper class in the same file and register it in fairseq with the name 'rnn_classifier':

from fairseq.models import BaseFairseqModel, register_model

# Note: the register_model "decorator" should immediately precede the
# definition of the Model class.

@register_model('rnn_classifier')
class FairseqRNNClassifier(BaseFairseqModel):

    @staticmethod
    def add_args(parser):
        # Models can override this method to add new command-line arguments.
        # Here we'll add a new command-line argument to configure the
        # dimensionality of the hidden state.
        parser.add_argument(
            '--hidden-dim', type=int, metavar='N',
            help='dimensionality of the hidden state',
        )

    @classmethod
    def build_model(cls, args, task):
        # Fairseq initializes models by calling the ``build_model()``
        # function. This provides more flexibility, since the returned model
        # instance can be of a different type than the one that was called.
        # In this case we'll just return a FairseqRNNClassifier instance.

        # Initialize our RNN module
        rnn = RNN(
            # We'll define the Task in the next section, but for now just
            # notice that the task holds the dictionaries for the "source"
            # (i.e., the input sentence) and "target" (i.e., the label).
            input_size=len(task.source_dictionary),
            hidden_size=args.hidden_dim,
            output_size=len(task.target_dictionary),
        )

        # Return the wrapped version of the module
        return FairseqRNNClassifier(
            rnn=rnn,
            input_vocab=task.source_dictionary,
        )

    def __init__(self, rnn, input_vocab):
        super(FairseqRNNClassifier, self).__init__()

        self.rnn = rnn
        self.input_vocab = input_vocab

        # The RNN module in the tutorial expects one-hot inputs, so we can
        # precompute the identity matrix to help convert from indices to
        # one-hot vectors. We register it as a buffer so that it is moved to
        # the GPU when ``cuda()`` is called.
        self.register_buffer('one_hot_inputs', torch.eye(len(input_vocab)))

    def forward(self, src_tokens, src_lengths):
        # The inputs to the ``forward()`` function are determined by the
        # Task, and in particular the ``'net_input'`` key in each
        # mini-batch. We'll define the Task in the next section, but for
        # now just know that *src_tokens* has shape `(batch, src_len)` and
        # *src_lengths* has shape `(batch)`.
        bsz, max_src_len = src_tokens.size()

        # Initialize the RNN hidden state. Compared to the original PyTorch
        # tutorial we'll also handle batched inputs and work on the GPU.
        hidden = self.rnn.initHidden()
        hidden = hidden.repeat(bsz, 1)  # expand for batched inputs
        hidden = hidden.to(src_tokens.device)  # move to GPU

        for i in range(max_src_len):
            # WARNING: The inputs have padding, so we should mask those
            # elements here so that padding doesn't affect the results.
            # This is left as an exercise for the reader. The padding symbol
            # is given by ``self.input_vocab.pad()`` and the unpadded length
            # of each input is given by *src_lengths*.

            # One-hot encode a batch of input characters.
            input = self.one_hot_inputs[src_tokens[:, i].long()]

            # Feed the input to our RNN.
            output, hidden = self.rnn(input, hidden)

        # Return the final output state for making a prediction
        return output

Finally let’s define a named architecture with the configuration for our model. This is done with the register_model_architecture() function decorator. Thereafter this named architecture can be used with the --arch command-line argument, e.g., --arch pytorch_tutorial_rnn:

from fairseq.models import register_model_architecture

# The first argument to ``register_model_architecture()`` should be the name
# of the model we registered above (i.e., 'rnn_classifier'). The function we
# register here should take a single argument *args* and modify it in-place
# to match the desired architecture.

@register_model_architecture('rnn_classifier', 'pytorch_tutorial_rnn')
def pytorch_tutorial_rnn(args):
    # We use ``getattr()`` to prioritize arguments that are explicitly given
    # on the command-line, so that the defaults defined below are only used
    # when no other value has been specified.
    args.hidden_dim = getattr(args, 'hidden_dim', 128)

3. Registering a new Task

Now we’ll register a new FairseqTask that will load our dictionaries and dataset. Tasks can also control how the data is batched into mini-batches, but in this tutorial we’ll reuse the batching provided by fairseq.data.LanguagePairDataset.

Create a new file named fairseq/tasks/simple_classification.py with the following contents:

import os
import torch

from fairseq.data import Dictionary, LanguagePairDataset
from fairseq.tasks import LegacyFairseqTask, register_task


@register_task('simple_classification')
class SimpleClassificationTask(LegacyFairseqTask):

    @staticmethod
    def add_args(parser):
        # Add some command-line arguments for specifying where the data is
        # located and the maximum supported input length.
        parser.add_argument('data', metavar='FILE',
                            help='file prefix for data')
        parser.add_argument('--max-positions', default=1024, type=int,
                            help='max input length')

    @classmethod
    def setup_task(cls, args, **kwargs):
        # Here we can perform any setup required for the task. This may include
        # loading Dictionaries, initializing shared Embedding layers, etc.
        # In this case we'll just load the Dictionaries.
        input_vocab = Dictionary.load(os.path.join(args.data, 'dict.input.txt'))
        label_vocab = Dictionary.load(os.path.join(args.data, 'dict.label.txt'))
        print('| [input] dictionary: {} types'.format(len(input_vocab)))
        print('| [label] dictionary: {} types'.format(len(label_vocab)))

        return SimpleClassificationTask(args, input_vocab, label_vocab)

    def __init__(self, args, input_vocab, label_vocab):
        super().__init__(args)
        self.input_vocab = input_vocab
        self.label_vocab = label_vocab

    def load_dataset(self, split, **kwargs):
        """Load a given dataset split (e.g., train, valid, test)."""

        prefix = os.path.join(self.args.data, '{}.input-label'.format(split))

        # Read input sentences.
        sentences, lengths = [], []
        with open(prefix + '.input', encoding='utf-8') as file:
            for line in file:
                sentence = line.strip()

                # Tokenize the sentence, splitting on spaces
                tokens = self.input_vocab.encode_line(
                    sentence, add_if_not_exist=False,
                )

                sentences.append(tokens)
                lengths.append(tokens.numel())

        # Read labels.
        labels = []
        with open(prefix + '.label', encoding='utf-8') as file:
            for line in file:
                label = line.strip()
                labels.append(
                    # Convert label to a numeric ID.
                    torch.LongTensor([self.label_vocab.add_symbol(label)])
                )

        assert len(sentences) == len(labels)
        print('| {} {} {} examples'.format(self.args.data, split, len(sentences)))

        # We reuse LanguagePairDataset since classification can be modeled as a
        # sequence-to-sequence task where the target sequence has length 1.
        self.datasets[split] = LanguagePairDataset(
            src=sentences,
            src_sizes=lengths,
            src_dict=self.input_vocab,
            tgt=labels,
            tgt_sizes=torch.ones(len(labels)),  # targets have length 1
            tgt_dict=self.label_vocab,
            left_pad_source=False,
            # Since our target is a single class label, there's no need for
            # teacher forcing. If we set this to ``True`` then our Model's
            # ``forward()`` method would receive an additional argument called
            # *prev_output_tokens* that would contain a shifted version of the
            # target sequence.
            input_feeding=False,
        )

    def max_positions(self):
        """Return the max input length allowed by the task."""
        # The source should be less than *args.max_positions* and the "target"
        # has max length 1.
        return (self.args.max_positions, 1)

    @property
    def source_dictionary(self):
        """Return the source :class:`~fairseq.data.Dictionary`."""
        return self.input_vocab

    @property
    def target_dictionary(self):
        """Return the target :class:`~fairseq.data.Dictionary`."""
        return self.label_vocab

    # We could override this method if we wanted more control over how batches
    # are constructed, but it's not necessary for this tutorial since we can
    # reuse the batching provided by LanguagePairDataset.
    #
    # def get_batch_iterator(
    #     self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
    #     ignore_invalid_inputs=False, required_batch_size_multiple=1,
    #     seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1,
    #     data_buffer_size=0, disable_iterator_cache=False,
    # ):
    #     (...)

4. Training the Model

Now we’re ready to train the model. We can use the existing fairseq-train command-line tool for this, making sure to specify our new Task (--task simple_classification) and Model architecture (--arch pytorch_tutorial_rnn):

Note

You can also configure the dimensionality of the hidden state by passing the --hidden-dim argument to fairseq-train.

> fairseq-train names-bin \
  --task simple_classification \
  --arch pytorch_tutorial_rnn \
  --optimizer adam --lr 0.001 --lr-shrink 0.5 \
  --max-tokens 1000
(...)
| epoch 027 | loss 1.200 | ppl 2.30 | wps 15728 | ups 119.4 | wpb 116 | bsz 116 | num_updates 3726 | lr 1.5625e-05 | gnorm 1.290 | clip 0% | oom 0 | wall 32 | train_wall 21
| epoch 027 | valid on 'valid' subset | valid_loss 1.41304 | valid_ppl 2.66 | num_updates 3726 | best 1.41208
| done training in 31.6 seconds

The model files should appear in the checkpoints/ directory.

5. Writing an evaluation script

Finally we can write a short script to evaluate our model on new inputs. Create a new file named eval_classifier.py with the following contents:

from fairseq import checkpoint_utils, data, options, tasks

# Parse command-line arguments for generation
parser = options.get_generation_parser(default_task='simple_classification')
args = options.parse_args_and_arch(parser)

# Setup task
task = tasks.setup_task(args)

# Load model
print('| loading model from {}'.format(args.path))
models, _model_args = checkpoint_utils.load_model_ensemble([args.path], task=task)
model = models[0]

while True:
    sentence = input('\nInput: ')

    # Tokenize into characters
    chars = ' '.join(list(sentence.strip()))
    tokens = task.source_dictionary.encode_line(
        chars, add_if_not_exist=False,
    )

    # Build mini-batch to feed to the model
    batch = data.language_pair_dataset.collate(
        samples=[{'id': -1, 'source': tokens}],  # bsz = 1
        pad_idx=task.source_dictionary.pad(),
        eos_idx=task.source_dictionary.eos(),
        left_pad_source=False,
        input_feeding=False,
    )

    # Feed batch to the model and get predictions
    preds = model(**batch['net_input'])

    # Print top 3 predictions and their log-probabilities
    top_scores, top_labels = preds[0].topk(k=3)
    for score, label_idx in zip(top_scores, top_labels):
        label_name = task.target_dictionary.string([label_idx])
        print('({:.2f})\t{}'.format(score, label_name))

Now we can evaluate our model interactively. Note that we have included the original data path (names-bin/) so that the dictionaries can be loaded:

> python eval_classifier.py names-bin --path checkpoints/checkpoint_best.pt
| [input] dictionary: 64 types
| [label] dictionary: 24 types
| loading model from checkpoints/checkpoint_best.pt

Input: Satoshi
(-0.61) Japanese
(-1.20) Arabic
(-2.86) Italian

Input: Sinbad
(-0.30) Arabic
(-1.76) English
(-4.08) Russian

Tasks

Tasks store dictionaries and provide helpers for loading/iterating over Datasets, initializing the Model/Criterion and calculating the loss.

Tasks can be selected via the --task command-line argument. Once selected, a task may expose additional command-line arguments for further configuration.

Example usage:

# setup the task (e.g., load dictionaries)
task = fairseq.tasks.setup_task(args)

# build model and criterion
model = task.build_model(args)
criterion = task.build_criterion(args)

# load datasets
task.load_dataset('train')
task.load_dataset('valid')

# iterate over mini-batches of data
batch_itr = task.get_batch_iterator(
    task.dataset('train'), max_tokens=4096,
)
for batch in batch_itr:
    # compute the loss
    loss, sample_size, logging_output = task.get_loss(
        model, criterion, batch,
    )
    loss.backward()

Translation

class fairseq.tasks.translation.TranslationTask(cfg: fairseq.tasks.translation.TranslationConfig, src_dict, tgt_dict)[source]

Translate from one (source) language to another (target) language.

Parameters:
  • src_dict (Dictionary) – dictionary for the source language
  • tgt_dict (Dictionary) – dictionary for the target language

Note

The translation task is compatible with fairseq-train, fairseq-generate and fairseq-interactive.

Language Modeling

class fairseq.tasks.language_modeling.LanguageModelingTask(args, dictionary, output_dictionary=None, targets=None)[source]

Train a language model.

Parameters:
  • dictionary (Dictionary) – the dictionary for the input of the language model
  • output_dictionary (Dictionary) – the dictionary for the output of the language model. In most cases it will be the same as dictionary, but could possibly be a more limited version of the dictionary (if --output-dictionary-size is used).
  • targets (List[str]) – list of the target types that the language model should predict. Can be one of “self”, “future”, and “past”. Defaults to “future”.

Note

The language modeling task is compatible with fairseq-train, fairseq-generate, fairseq-interactive and fairseq-eval-lm.

The language modeling task provides the following additional command-line arguments:

usage:  [--task language_modeling]
        [--sample-break-mode {none,complete,complete_doc,eos}]
        [--tokens-per-sample TOKENS_PER_SAMPLE]
        [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target]
        [--future-target] [--past-target] [--add-bos-token]
        [--max-target-positions MAX_TARGET_POSITIONS]
        [--shorten-method {none,truncate,random_crop}]
        [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST]
        [--pad-to-fixed-length] [--pad-to-fixed-bsz]
        data

Task name

--task Enable this task with: --task=language_modeling

Additional command-line arguments

data path to data directory
--sample-break-mode

Possible choices: none, complete, complete_doc, eos

If omitted or “none”, fills each sample with tokens-per-sample tokens. If set to “complete”, splits samples only at the end of sentence, but may include multiple sentences per sample. “complete_doc” is similar but respects doc boundaries. If set to “eos”, includes only one sentence per sample.

Default: “none”

--tokens-per-sample

max number of tokens per sample for LM dataset

Default: 1024

--output-dictionary-size

limit the size of output dictionary

Default: -1

--self-target

include self target

Default: False

--future-target

include future target

Default: False

--past-target

include past target

Default: False

--add-bos-token

prepend beginning of sentence token (<s>)

Default: False

--max-target-positions max number of tokens in the target sequence
--shorten-method

Possible choices: none, truncate, random_crop

if not none, shorten sequences that exceed –tokens-per-sample

Default: “none”

--shorten-data-split-list

comma-separated list of dataset splits to apply shortening to, e.g., “train,valid” (default: all dataset splits)

Default: “”

--pad-to-fixed-length

pad to fixed length

Default: False

--pad-to-fixed-bsz

boolean to pad to fixed batch size

Default: False

Adding new tasks

fairseq.tasks.register_task(name, dataclass=None)[source]

New tasks can be added to fairseq with the register_task() function decorator.

For example:

@register_task('classification')
class ClassificationTask(FairseqTask):
    (...)

Note

All Tasks must implement the FairseqTask interface.

Parameters:name (str) – the name of the task
class fairseq.tasks.FairseqTask(cfg: fairseq.dataclass.configs.FairseqDataclass, **kwargs)[source]

Tasks store dictionaries and provide helpers for loading/iterating over Datasets, initializing the Model/Criterion and calculating the loss.

Tasks have limited statefulness. In particular, state that needs to be saved to/loaded from checkpoints needs to be stored in the self.state StatefulContainer object. For example:

self.state.add_factory("dictionary", self.load_dictionary)
print(self.state.dictionary)  # calls self.load_dictionary()

This is necessary so that when loading checkpoints, we can properly recreate the task state after initializing the task instance.

classmethod add_args(parser)[source]

Add task-specific arguments to the parser.

aggregate_logging_outputs(logging_outputs, criterion)[source]

[deprecated] Aggregate logging outputs from data parallel training.

begin_epoch(epoch, model)[source]

Hook function called before the start of each epoch.

begin_valid_epoch(epoch, model)[source]

Hook function called before the start of each validation epoch.

build_bpe(args)[source]

Build the tokenizer for this task.

build_criterion(cfg: omegaconf.dictconfig.DictConfig)[source]

Build the FairseqCriterion instance for this task.

Parameters:cfg (omegaconf.DictConfig) – configration object
Returns:a FairseqCriterion instance
build_dataset_for_inference(src_tokens: List[torch.Tensor], src_lengths: List[int], **kwargs) → torch.utils.data.dataset.Dataset[source]
classmethod build_dictionary(filenames, workers=1, threshold=-1, nwords=-1, padding_factor=8)[source]

Build the dictionary

Parameters:
  • filenames (list) – list of filenames
  • workers (int) – number of concurrent workers
  • threshold (int) – defines the minimum word count
  • nwords (int) – defines the total number of words in the final dictionary, including special symbols
  • padding_factor (int) – can be used to pad the dictionary size to be a multiple of 8, which is important on some hardware (e.g., Nvidia Tensor Cores).
build_generator(models, args, seq_gen_cls=None, extra_gen_cls_kwargs=None, prefix_allowed_tokens_fn=None)[source]

Build a SequenceGenerator instance for this task.

Parameters:
  • models (List[FairseqModel]) – ensemble of models
  • args (fairseq.dataclass.configs.GenerationConfig) – configuration object (dataclass) for generation
  • extra_gen_cls_kwargs (Dict[str, Any]) – extra options to pass through to SequenceGenerator
  • prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]]) – If provided, this function constrains the beam search to allowed tokens only at each step. The provided function should take 2 arguments: the batch ID (batch_id: int) and a unidimensional tensor of token ids (inputs_ids: torch.Tensor). It has to return a List[int] with the allowed tokens for the next generation step conditioned on the previously generated tokens (inputs_ids) and the batch ID (batch_id). This argument is useful for constrained generation conditioned on the prefix, as described in “Autoregressive Entity Retrieval” (https://arxiv.org/abs/2010.00904) and https://github.com/facebookresearch/GENRE.
build_model(cfg: fairseq.dataclass.configs.FairseqDataclass, from_checkpoint=False)[source]

Build the BaseFairseqModel instance for this task.

Parameters:cfg (FairseqDataclass) – configuration object
Returns:a BaseFairseqModel instance
build_tokenizer(args)[source]

Build the pre-tokenizer for this task.

can_reuse_epoch_itr(dataset)[source]
dataset(split)[source]

Return a loaded dataset split.

Parameters:split (str) – name of the split (e.g., train, valid, test)
Returns:a FairseqDataset corresponding to split
filter_indices_by_size(indices, dataset, max_positions=None, ignore_invalid_inputs=False)[source]

Filter examples that are too large

Parameters:
  • indices (np.array) – original array of sample indices
  • dataset (FairseqDataset) – dataset to batch
  • max_positions (optional) – max sentence length supported by the model (default: None).
  • ignore_invalid_inputs (bool, optional) – don’t raise Exception for sentences that are too long (default: False).
Returns:

array of filtered sample indices

Return type:

np.array

get_batch_iterator(dataset, max_tokens=None, max_sentences=None, max_positions=None, ignore_invalid_inputs=False, required_batch_size_multiple=1, seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1, data_buffer_size=0, disable_iterator_cache=False, skip_remainder_batch=False, grouped_shuffling=False, update_epoch_batch_itr=False)[source]

Get an iterator that yields batches of data from the given dataset.

Parameters:
  • dataset (FairseqDataset) – dataset to batch
  • max_tokens (int, optional) – max number of tokens in each batch (default: None).
  • max_sentences (int, optional) – max number of sentences in each batch (default: None).
  • max_positions (optional) – max sentence length supported by the model (default: None).
  • ignore_invalid_inputs (bool, optional) – don’t raise Exception for sentences that are too long (default: False).
  • required_batch_size_multiple (int, optional) – require batch size to be a multiple of N (default: 1).
  • seed (int, optional) – seed for random number generator for reproducibility (default: 1).
  • num_shards (int, optional) – shard the data iterator into N shards (default: 1).
  • shard_id (int, optional) – which shard of the data iterator to return (default: 0).
  • num_workers (int, optional) – how many subprocesses to use for data loading. 0 means the data will be loaded in the main process (default: 0).
  • epoch (int, optional) – the epoch to start the iterator from (default: 1).
  • data_buffer_size (int, optional) – number of batches to preload (default: 0).
  • disable_iterator_cache (bool, optional) – don’t cache the EpochBatchIterator (ignores FairseqTask::can_reuse_epoch_itr) (default: False).
  • skip_remainder_batch (bool, optional) –

    if set, discard the last batch in each training epoch, as the last batch is often smaller than

    local_batch_size * distributed_word_size (default: True).
  • grouped_shuffling (bool, optional) – group batches with each groups containing num_shards batches and shuffle groups. Reduces difference between sequence lengths among workers for batches sorted by length.
  • update_epoch_batch_itr (bool optional) – if true then donot use the cached batch iterator for the epoch
Returns:

a batched iterator over the

given dataset split

Return type:

EpochBatchIterator

get_interactive_tokens_and_lengths(lines, encode_fn)[source]
has_sharded_data(split)[source]
inference_step(generator, models, sample, prefix_tokens=None, constraints=None)[source]
load_dataset(split: str, combine: bool = False, task_cfg: fairseq.dataclass.configs.FairseqDataclass = None, **kwargs)[source]

Load a given dataset split.

Parameters:
  • split (str) – name of the split (e.g., train, valid, test)
  • combine (bool) – combines a split segmented into pieces into one dataset
  • task_cfg (FairseqDataclass) – optional task configuration stored in the checkpoint that can be used to load datasets
classmethod load_dictionary(filename)[source]

Load the dictionary from the filename

Parameters:filename (str) – the filename
load_state_dict(state_dict: Dict[str, Any])[source]
static logging_outputs_can_be_summed(criterion) → bool[source]

Whether the logging outputs returned by train_step and valid_step can be summed across workers prior to calling aggregate_logging_outputs. Setting this to True will improves distributed training speed.

max_positions()[source]

Return the max input length allowed by the task.

optimizer_step(optimizer, model, update_num)[source]
reduce_metrics(logging_outputs, criterion)[source]

Aggregate logging outputs from data parallel training.

classmethod setup_task(cfg: omegaconf.dictconfig.DictConfig, **kwargs)[source]

Setup the task (e.g., load dictionaries).

Parameters:cfg (omegaconf.DictConfig) – parsed command-line arguments
source_dictionary

Return the source Dictionary (if applicable for this task).

state_dict()[source]
target_dictionary

Return the target Dictionary (if applicable for this task).

train_step(sample, model, criterion, optimizer, update_num, ignore_grad=False)[source]

Do forward and backward, and return the loss as computed by criterion for the given model and sample.

Parameters:
Returns:

  • the loss
  • the sample size, which is used as the denominator for the gradient
  • logging outputs to display while training

Return type:

tuple

valid_step(sample, model, criterion)[source]

Models

A Model defines the neural network’s forward() method and encapsulates all of the learnable parameters in the network. Each model also provides a set of named architectures that define the precise network configuration (e.g., embedding dimension, number of layers, etc.).

Both the model type and architecture are selected via the --arch command-line argument. Once selected, a model may expose additional command-line arguments for further configuration.

Note

All fairseq Models extend BaseFairseqModel, which in turn extends torch.nn.Module. Thus any fairseq Model can be used as a stand-alone Module in other PyTorch code.

Convolutional Neural Networks (CNN)

class fairseq.models.fconv.FConvModel(encoder, decoder)[source]

A fully convolutional model, i.e. a convolutional encoder and a convolutional decoder, as described in “Convolutional Sequence to Sequence Learning” (Gehring et al., 2017).

Parameters:

The Convolutional model provides the following named architectures and command-line arguments:

usage: 
        [--arch {fconv,fconv_iwslt_de_en,fconv_wmt_en_ro,fconv_wmt_en_de,fconv_wmt_en_fr}]
        [--dropout D] [--encoder-embed-dim N] [--encoder-embed-path STR]
        [--encoder-layers EXPR] [--decoder-embed-dim N]
        [--decoder-embed-path STR] [--decoder-layers EXPR]
        [--decoder-out-embed-dim N] [--decoder-attention EXPR]
        [--share-input-output-embed]

Named architectures

--arch Possible choices: fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr

Additional command-line arguments

--dropout dropout probability
--encoder-embed-dim encoder embedding dimension
--encoder-embed-path path to pre-trained encoder embedding
--encoder-layers encoder layers [(dim, kernel_size), …]
--decoder-embed-dim decoder embedding dimension
--decoder-embed-path path to pre-trained decoder embedding
--decoder-layers decoder layers [(dim, kernel_size), …]
--decoder-out-embed-dim decoder output embedding dimension
--decoder-attention decoder attention [True, …]
--share-input-output-embed

share input and output embeddings (requires –decoder-out-embed-dim and –decoder-embed-dim to be equal)

Default: False

static add_args(parser)[source]

Add model-specific arguments to the parser.

classmethod build_model(args, task)[source]

Build a new model instance.

class fairseq.models.fconv.FConvEncoder(dictionary, embed_dim=512, embed_dict=None, max_positions=1024, convolutions=((512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3)), dropout=0.1)[source]

Convolutional encoder consisting of len(convolutions) layers.

Parameters:
  • dictionary (Dictionary) – encoding dictionary
  • embed_dim (int, optional) – embedding dimension
  • embed_dict (str, optional) – filename from which to load pre-trained embeddings
  • max_positions (int, optional) – maximum supported input sequence length
  • convolutions (list, optional) – the convolutional layer structure. Each list item i corresponds to convolutional layer i. Layers are given as (out_channels, kernel_width, [residual]). Residual connections are added between layers when residual=1 (which is the default behavior).
  • dropout (float, optional) – dropout to be applied before each conv layer
forward(src_tokens, src_lengths)[source]
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (LongTensor) – lengths of each source sentence of shape (batch)
Returns:

  • encoder_out (tuple): a tuple with two elements, where the first element is the last encoder layer’s output and the second element is the same quantity summed with the input embedding (used for attention). The shape of both tensors is (batch, src_len, embed_dim).
  • encoder_padding_mask (ByteTensor): the positions of padding elements of shape (batch, src_len)

Return type:

dict

max_positions()[source]

Maximum input length supported by the encoder.

reorder_encoder_out(encoder_out, new_order)[source]

Reorder encoder output according to new_order.

Parameters:
  • encoder_out – output from the forward() method
  • new_order (LongTensor) – desired order
Returns:

encoder_out rearranged according to new_order

class fairseq.models.fconv.FConvDecoder(dictionary, embed_dim=512, embed_dict=None, out_embed_dim=256, max_positions=1024, convolutions=((512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3)), attention=True, dropout=0.1, share_embed=False, positional_embeddings=True, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0.0)[source]

Convolutional decoder

forward(prev_output_tokens, encoder_out=None, incremental_state=None, **unused)[source]
Parameters:
  • prev_output_tokens (LongTensor) – shifted output tokens of shape (batch, tgt_len), for teacher forcing
  • encoder_out (dict, optional) – output from the encoder, used for encoder-side attention
  • incremental_state (dict, optional) – dictionary used for storing state during Incremental decoding
Returns:

  • the decoder’s output of shape (batch, tgt_len, vocab)
  • a dictionary with any model-specific outputs

Return type:

tuple

max_positions()[source]

Maximum output length supported by the decoder.

reorder_incremental_state(incremental_state, new_order)[source]

Reorder incremental state.

This will be called when the order of the input has changed from the previous time step. A typical use case is beam search, where the input order changes between time steps based on the selection of beams.

Long Short-Term Memory (LSTM) networks

class fairseq.models.lstm.LSTMModel(encoder, decoder)[source]
static add_args(parser)[source]

Add model-specific arguments to the parser.

classmethod build_model(args, task)[source]

Build a new model instance.

forward(src_tokens, src_lengths, prev_output_tokens, incremental_state: Optional[Dict[str, Dict[str, Optional[torch.Tensor]]]] = None)[source]

Run the forward pass for an encoder-decoder model.

First feed a batch of source tokens through the encoder. Then, feed the encoder output and previous decoder outputs (i.e., teacher forcing) to the decoder to produce the next outputs:

encoder_out = self.encoder(src_tokens, src_lengths)
return self.decoder(prev_output_tokens, encoder_out)
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (LongTensor) – source sentence lengths of shape (batch)
  • prev_output_tokens (LongTensor) – previous decoder outputs of shape (batch, tgt_len), for teacher forcing
Returns:

  • the decoder’s output of shape (batch, tgt_len, vocab)
  • a dictionary with any model-specific outputs

Return type:

tuple

class fairseq.models.lstm.LSTMEncoder(dictionary, embed_dim=512, hidden_size=512, num_layers=1, dropout_in=0.1, dropout_out=0.1, bidirectional=False, left_pad=True, pretrained_embed=None, padding_idx=None, max_source_positions=100000.0)[source]

LSTM encoder.

forward(src_tokens: torch.Tensor, src_lengths: torch.Tensor, enforce_sorted: bool = True)[source]
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (LongTensor) – lengths of each source sentence of shape (batch)
  • enforce_sorted (bool, optional) – if True, src_tokens is expected to contain sequences sorted by length in a decreasing order. If False, this condition is not required. Default: True.
max_positions()[source]

Maximum input length supported by the encoder.

reorder_encoder_out(encoder_out: Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor], new_order)[source]

Reorder encoder output according to new_order.

Parameters:
  • encoder_out – output from the forward() method
  • new_order (LongTensor) – desired order
Returns:

encoder_out rearranged according to new_order

class fairseq.models.lstm.LSTMDecoder(dictionary, embed_dim=512, hidden_size=512, out_embed_dim=512, num_layers=1, dropout_in=0.1, dropout_out=0.1, attention=True, encoder_output_units=512, pretrained_embed=None, share_input_output_embed=False, adaptive_softmax_cutoff=None, max_target_positions=100000.0, residuals=False)[source]

LSTM decoder.

extract_features(prev_output_tokens, encoder_out: Optional[Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]] = None, incremental_state: Optional[Dict[str, Dict[str, Optional[torch.Tensor]]]] = None)[source]

Similar to forward but only return features.

forward(prev_output_tokens, encoder_out: Optional[Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]] = None, incremental_state: Optional[Dict[str, Dict[str, Optional[torch.Tensor]]]] = None, src_lengths: Optional[torch.Tensor] = None)[source]
Parameters:
  • prev_output_tokens (LongTensor) – shifted output tokens of shape (batch, tgt_len), for teacher forcing
  • encoder_out (dict, optional) – output from the encoder, used for encoder-side attention
  • incremental_state (dict, optional) – dictionary used for storing state during Incremental decoding
Returns:

  • the decoder’s output of shape (batch, tgt_len, vocab)
  • a dictionary with any model-specific outputs

Return type:

tuple

max_positions()[source]

Maximum output length supported by the decoder.

output_layer(x)[source]

Project features to the vocabulary size.

reorder_incremental_state(incremental_state: Dict[str, Dict[str, Optional[torch.Tensor]]], new_order: torch.Tensor)[source]

Reorder incremental state.

This will be called when the order of the input has changed from the previous time step. A typical use case is beam search, where the input order changes between time steps based on the selection of beams.

Transformer (self-attention) networks

class fairseq.models.transformer.TransformerModel(args, encoder, decoder)[source]

This is the legacy implementation of the transformer model that uses argparse for configuration.

classmethod add_args(parser)[source]

Add model-specific arguments to the parser.

classmethod build_model(args, task)[source]

Build a new model instance.

class fairseq.models.transformer.TransformerEncoder(args, dictionary, embed_tokens, return_fc=False)[source]
class fairseq.models.transformer.TransformerDecoder(args, dictionary, embed_tokens, no_encoder_attn=False, output_projection=None)[source]

Adding new models

fairseq.models.register_model(name, dataclass=None)[source]

New model types can be added to fairseq with the register_model() function decorator.

For example:

@register_model('lstm')
class LSTM(FairseqEncoderDecoderModel):
    (...)

Note

All models must implement the BaseFairseqModel interface. Typically you will extend FairseqEncoderDecoderModel for sequence-to-sequence tasks or FairseqLanguageModel for language modeling tasks.

Parameters:name (str) – the name of the model
fairseq.models.register_model_architecture(model_name, arch_name)[source]

New model architectures can be added to fairseq with the register_model_architecture() function decorator. After registration, model architectures can be selected with the --arch command-line argument.

For example:

@register_model_architecture('lstm', 'lstm_luong_wmt_en_de')
def lstm_luong_wmt_en_de(cfg):
    args.encoder_embed_dim = getattr(cfg.model, 'encoder_embed_dim', 1000)
    (...)

The decorated function should take a single argument cfg, which is a omegaconf.DictConfig. The decorated function should modify these arguments in-place to match the desired architecture.

Parameters:
  • model_name (str) – the name of the Model (Model must already be registered)
  • arch_name (str) – the name of the model architecture (--arch)
class fairseq.models.BaseFairseqModel[source]

Base class for fairseq models.

classmethod add_args(parser)[source]

Add model-specific arguments to the parser.

classmethod build_model(args, task)[source]

Build a new model instance.

extract_features(*args, **kwargs)[source]

Similar to forward but only return features.

classmethod from_pretrained(model_name_or_path, checkpoint_file='model.pt', data_name_or_path='.', **kwargs)[source]

Load a FairseqModel from a pre-trained model file. Downloads and caches the pre-trained model file if needed.

The base implementation returns a GeneratorHubInterface, which can be used to generate translations or sample from language models. The underlying FairseqModel can be accessed via the generator.models attribute.

Other models may override this to implement custom hub interfaces.

Parameters:
  • model_name_or_path (str) – either the name of a pre-trained model to load or a path/URL to a pre-trained model state dict
  • checkpoint_file (str, optional) – colon-separated list of checkpoint files in the model archive to ensemble (default: ‘model.pt’)
  • data_name_or_path (str, optional) – point args.data to the archive at the given path/URL. Can start with ‘.’ or ‘./’ to reuse the model archive path.
get_normalized_probs(net_output: Tuple[torch.Tensor, Optional[Dict[str, List[Optional[torch.Tensor]]]]], log_probs: bool, sample: Optional[Dict[str, torch.Tensor]] = None)[source]

Get normalized probabilities (or log probs) from a net’s output.

get_normalized_probs_scriptable(net_output: Tuple[torch.Tensor, Optional[Dict[str, List[Optional[torch.Tensor]]]]], log_probs: bool, sample: Optional[Dict[str, torch.Tensor]] = None)[source]

Scriptable helper function for get_normalized_probs in ~BaseFairseqModel

get_targets(sample, net_output)[source]

Get targets from either the sample or the net’s output.

classmethod hub_models()[source]
load_state_dict(state_dict, strict=True, model_cfg: Optional[omegaconf.dictconfig.DictConfig] = None, args: Optional[argparse.Namespace] = None)[source]

Copies parameters and buffers from state_dict into this module and its descendants.

Overrides the method in nn.Module. Compared with that method this additionally “upgrades” state_dicts from old checkpoints.

make_generation_fast_(**kwargs)[source]

Legacy entry point to optimize model for faster generation. Prefer prepare_for_inference_.

max_positions()[source]

Maximum length supported by the model.

prepare_for_inference_(cfg: omegaconf.dictconfig.DictConfig)[source]

Prepare model for inference.

prepare_for_onnx_export_(**kwargs)[source]

Make model exportable via ONNX trace.

set_num_updates(num_updates)[source]

State from trainer to pass along to model at every update.

upgrade_state_dict(state_dict)[source]

Upgrade old state dicts to work with newer code.

upgrade_state_dict_named(state_dict, name)[source]

Upgrade old state dicts to work with newer code.

Parameters:
  • state_dict (dict) – state dictionary to upgrade, in place
  • name (str) – the state dict key corresponding to the current module
class fairseq.models.FairseqEncoderDecoderModel(encoder, decoder)[source]

Base class for encoder-decoder models.

Parameters:
extract_features(src_tokens, src_lengths, prev_output_tokens, **kwargs)[source]

Similar to forward but only return features.

Returns:
  • the decoder’s features of shape (batch, tgt_len, embed_dim)
  • a dictionary with any model-specific outputs
Return type:tuple
forward(src_tokens, src_lengths, prev_output_tokens, **kwargs)[source]

Run the forward pass for an encoder-decoder model.

First feed a batch of source tokens through the encoder. Then, feed the encoder output and previous decoder outputs (i.e., teacher forcing) to the decoder to produce the next outputs:

encoder_out = self.encoder(src_tokens, src_lengths)
return self.decoder(prev_output_tokens, encoder_out)
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (LongTensor) – source sentence lengths of shape (batch)
  • prev_output_tokens (LongTensor) – previous decoder outputs of shape (batch, tgt_len), for teacher forcing
Returns:

  • the decoder’s output of shape (batch, tgt_len, vocab)
  • a dictionary with any model-specific outputs

Return type:

tuple

forward_decoder(prev_output_tokens, **kwargs)[source]
max_decoder_positions()[source]

Maximum length supported by the decoder.

max_positions()[source]

Maximum length supported by the model.

output_layer(features, **kwargs)[source]

Project features to the default output size (typically vocabulary size).

class fairseq.models.FairseqEncoderModel(encoder)[source]

Base class for encoder-only models.

Parameters:encoder (FairseqEncoder) – the encoder
forward(src_tokens, src_lengths, **kwargs)[source]

Run the forward pass for a encoder-only model.

Feeds a batch of tokens through the encoder to generate features.

Parameters:
  • src_tokens (LongTensor) – input tokens of shape (batch, src_len)
  • src_lengths (LongTensor) – source sentence lengths of shape (batch)
Returns:

the encoder’s output, typically of shape (batch, src_len, features)

get_normalized_probs(net_output, log_probs, sample=None)[source]

Get normalized probabilities (or log probs) from a net’s output.

max_positions()[source]

Maximum length supported by the model.

class fairseq.models.FairseqLanguageModel(decoder)[source]

Base class for decoder-only models.

Parameters:decoder (FairseqDecoder) – the decoder
extract_features(src_tokens, **kwargs)[source]

Similar to forward but only return features.

Returns:
  • the decoder’s features of shape (batch, seq_len, embed_dim)
  • a dictionary with any model-specific outputs
Return type:tuple
forward(src_tokens, **kwargs)[source]

Run the forward pass for a decoder-only model.

Feeds a batch of tokens through the decoder to predict the next tokens.

Parameters:
  • src_tokens (LongTensor) – tokens on which to condition the decoder, of shape (batch, tgt_len)
  • src_lengths (LongTensor) – source sentence lengths of shape (batch)
Returns:

  • the decoder’s output of shape (batch, seq_len, vocab)
  • a dictionary with any model-specific outputs

Return type:

tuple

forward_decoder(prev_output_tokens, **kwargs)[source]
max_decoder_positions()[source]

Maximum length supported by the decoder.

max_positions()[source]

Maximum length supported by the model.

output_layer(features, **kwargs)[source]

Project features to the default output size (typically vocabulary size).

supported_targets
class fairseq.models.FairseqMultiModel(encoders, decoders)[source]

Base class for combining multiple encoder-decoder models.

static build_shared_embeddings(dicts: Dict[str, fairseq.data.dictionary.Dictionary], langs: List[str], embed_dim: int, build_embedding: callable, pretrained_embed_path: Optional[str] = None)[source]

Helper function to build shared embeddings for a set of languages after checking that all dicts corresponding to those languages are equivalent.

Parameters:
  • dicts – Dict of lang_id to its corresponding Dictionary
  • langs – languages that we want to share embeddings for
  • embed_dim – embedding dimension
  • build_embedding – callable function to actually build the embedding
  • pretrained_embed_path – Optional path to load pretrained embeddings
decoder
encoder
forward(src_tokens, src_lengths, prev_output_tokens, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward_decoder(prev_output_tokens, **kwargs)[source]
load_state_dict(state_dict, strict=True, model_cfg=None, args: Optional[argparse.Namespace] = None)[source]

Copies parameters and buffers from state_dict into this module and its descendants.

Overrides the method in nn.Module. Compared with that method this additionally “upgrades” state_dicts from old checkpoints.

max_decoder_positions()[source]

Maximum length supported by the decoder.

max_positions()[source]

Maximum length supported by the model.

class fairseq.models.FairseqEncoder(dictionary)[source]

Base class for encoders.

forward(src_tokens, src_lengths=None, **kwargs)[source]
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (LongTensor) – lengths of each source sentence of shape (batch)
forward_torchscript(net_input: Dict[str, torch.Tensor])[source]

A TorchScript-compatible version of forward.

Encoders which use additional arguments may want to override this method for TorchScript compatibility.

max_positions()[source]

Maximum input length supported by the encoder.

reorder_encoder_out(encoder_out, new_order)[source]

Reorder encoder output according to new_order.

Parameters:
  • encoder_out – output from the forward() method
  • new_order (LongTensor) – desired order
Returns:

encoder_out rearranged according to new_order

set_num_updates(num_updates)[source]

State from trainer to pass along to model at every update.

upgrade_state_dict_named(state_dict, name)[source]

Upgrade old state dicts to work with newer code.

class fairseq.models.CompositeEncoder(encoders)[source]

A wrapper around a dictionary of FairseqEncoder objects.

We run forward on each encoder and return a dictionary of outputs. The first encoder’s dictionary is used for initialization.

Parameters:encoders (dict) – a dictionary of FairseqEncoder objects.
forward(src_tokens, src_lengths)[source]
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (LongTensor) – lengths of each source sentence of shape (batch)
Returns:

the outputs from each Encoder

Return type:

dict

max_positions()[source]

Maximum input length supported by the encoder.

reorder_encoder_out(encoder_out, new_order)[source]

Reorder encoder output according to new_order.

class fairseq.models.FairseqDecoder(dictionary)[source]

Base class for decoders.

extract_features(prev_output_tokens, encoder_out=None, **kwargs)[source]
Returns:
  • the decoder’s features of shape (batch, tgt_len, embed_dim)
  • a dictionary with any model-specific outputs
Return type:tuple
forward(prev_output_tokens, encoder_out=None, **kwargs)[source]
Parameters:
  • prev_output_tokens (LongTensor) – shifted output tokens of shape (batch, tgt_len), for teacher forcing
  • encoder_out (dict, optional) – output from the encoder, used for encoder-side attention
Returns:

  • the decoder’s output of shape (batch, tgt_len, vocab)
  • a dictionary with any model-specific outputs

Return type:

tuple

get_normalized_probs(net_output: Tuple[torch.Tensor, Optional[Dict[str, List[Optional[torch.Tensor]]]]], log_probs: bool, sample: Optional[Dict[str, torch.Tensor]] = None)[source]

Get normalized probabilities (or log probs) from a net’s output.

get_normalized_probs_scriptable(net_output: Tuple[torch.Tensor, Optional[Dict[str, List[Optional[torch.Tensor]]]]], log_probs: bool, sample: Optional[Dict[str, torch.Tensor]] = None)[source]

Get normalized probabilities (or log probs) from a net’s output.

max_positions()[source]

Maximum input length supported by the decoder.

output_layer(features, **kwargs)[source]

Project features to the default output size, e.g., vocabulary size.

Parameters:features (Tensor) – features returned by extract_features.
upgrade_state_dict_named(state_dict, name)[source]

Upgrade old state dicts to work with newer code.

Incremental decoding

class fairseq.models.FairseqIncrementalDecoder(dictionary)[source]

Base class for incremental decoders.

Incremental decoding is a special mode at inference time where the Model only receives a single timestep of input corresponding to the previous output token (for teacher forcing) and must produce the next output incrementally. Thus the model must cache any long-term state that is needed about the sequence, e.g., hidden states, convolutional states, etc.

Compared to the standard FairseqDecoder interface, the incremental decoder interface allows forward() functions to take an extra keyword argument (incremental_state) that can be used to cache state across time-steps.

The FairseqIncrementalDecoder interface also defines the reorder_incremental_state() method, which is used during beam search to select and reorder the incremental state based on the selection of beams.

To learn more about how incremental decoding works, refer to this blog.

extract_features(prev_output_tokens, encoder_out=None, incremental_state=None, **kwargs)[source]
Returns:
  • the decoder’s features of shape (batch, tgt_len, embed_dim)
  • a dictionary with any model-specific outputs
Return type:tuple
forward(prev_output_tokens, encoder_out=None, incremental_state=None, **kwargs)[source]
Parameters:
  • prev_output_tokens (LongTensor) – shifted output tokens of shape (batch, tgt_len), for teacher forcing
  • encoder_out (dict, optional) – output from the encoder, used for encoder-side attention
  • incremental_state (dict, optional) – dictionary used for storing state during Incremental decoding
Returns:

  • the decoder’s output of shape (batch, tgt_len, vocab)
  • a dictionary with any model-specific outputs

Return type:

tuple

reorder_incremental_state(incremental_state: Dict[str, Dict[str, Optional[torch.Tensor]]], new_order: torch.Tensor)[source]

Reorder incremental state.

This will be called when the order of the input has changed from the previous time step. A typical use case is beam search, where the input order changes between time steps based on the selection of beams.

reorder_incremental_state_scripting(incremental_state: Dict[str, Dict[str, Optional[torch.Tensor]]], new_order: torch.Tensor)[source]

Main entry point for reordering the incremental state.

Due to limitations in TorchScript, we call this function in fairseq.sequence_generator.SequenceGenerator instead of calling reorder_incremental_state() directly.

set_beam_size(beam_size)[source]

Sets the beam size in the decoder and all children.

Criterions

Criterions compute the loss function given the model and batch, roughly:

loss = criterion(model, batch)

isort:skip_file

class fairseq.criterions.FairseqCriterion(task)[source]
classmethod add_args(parser)[source]

Add criterion-specific arguments to the parser.

static aggregate_logging_outputs(logging_outputs: List[Dict[str, Any]]) → Dict[str, Any][source]

Aggregate logging outputs from data parallel training.

classmethod build_criterion(cfg: fairseq.dataclass.configs.FairseqDataclass, task)[source]

Construct a criterion from command-line args.

forward(model, sample, reduce=True)[source]

Compute the loss for the given sample.

Returns a tuple with three elements: 1) the loss 2) the sample size, which is used as the denominator for the gradient 3) logging outputs to display while training

static logging_outputs_can_be_summed() → bool[source]

Whether the logging outputs returned by forward can be summed across workers prior to calling reduce_metrics. Setting this to True will improves distributed training speed.

classmethod reduce_metrics(logging_outputs: List[Dict[str, Any]]) → None[source]

Aggregate logging outputs from data parallel training.

class fairseq.criterions.adaptive_loss.AdaptiveLoss(task, sentence_avg)[source]

This is an implementation of the loss function accompanying the adaptive softmax approximation for graphical processing units (GPU), described in the paper “Efficient softmax approximation for GPUs” (http://arxiv.org/abs/1609.04309).

classmethod build_criterion(cfg: fairseq.criterions.adaptive_loss.AdaptiveLossConfig, task)[source]

Construct a criterion from command-line args.

forward(model, sample, reduce=True)[source]

Compute the loss for the given sample.

Returns a tuple with three elements: 1) the loss 2) the sample size, which is used as the denominator for the gradient 3) logging outputs to display while training

static logging_outputs_can_be_summed() → bool[source]

Whether the logging outputs returned by forward can be summed across workers prior to calling reduce_metrics. Setting this to True will improves distributed training speed.

static reduce_metrics(logging_outputs) → None[source]

Aggregate logging outputs from data parallel training.

class fairseq.criterions.composite_loss.CompositeLoss(args, task)[source]

This is a composite loss that, given a list of model outputs and a list of targets, computes an average of losses for each output-target pair

static add_args(parser)[source]

Add criterion-specific arguments to the parser.

classmethod build_criterion(args, task)[source]

Construct a criterion from command-line args.

static build_underlying_criterion(args, task)[source]
class fairseq.criterions.cross_entropy.CrossEntropyCriterion(task, sentence_avg)[source]
compute_loss(model, net_output, sample, reduce=True)[source]
forward(model, sample, reduce=True)[source]

Compute the loss for the given sample.

Returns a tuple with three elements: 1) the loss 2) the sample size, which is used as the denominator for the gradient 3) logging outputs to display while training

static logging_outputs_can_be_summed() → bool[source]

Whether the logging outputs returned by forward can be summed across workers prior to calling reduce_metrics. Setting this to True will improves distributed training speed.

static reduce_metrics(logging_outputs) → None[source]

Aggregate logging outputs from data parallel training.

class fairseq.criterions.label_smoothed_cross_entropy.LabelSmoothedCrossEntropyCriterion(task, sentence_avg, label_smoothing, ignore_prefix_size=0, report_accuracy=False)[source]
compute_accuracy(model, net_output, sample)[source]
compute_loss(model, net_output, sample, reduce=True)[source]
forward(model, sample, reduce=True)[source]

Compute the loss for the given sample.

Returns a tuple with three elements: 1) the loss 2) the sample size, which is used as the denominator for the gradient 3) logging outputs to display while training

get_lprobs_and_target(model, net_output, sample)[source]
static logging_outputs_can_be_summed() → bool[source]

Whether the logging outputs returned by forward can be summed across workers prior to calling reduce_metrics. Setting this to True will improves distributed training speed.

classmethod reduce_metrics(logging_outputs) → None[source]

Aggregate logging outputs from data parallel training.

Optimizers

Optimizers update the Model parameters based on the gradients.

isort:skip_file

class fairseq.optim.AMPOptimizer(cfg: omegaconf.dictconfig.DictConfig, params, fp32_optimizer, **kwargs)[source]

Wrap an optimizer to support AMP (automatic mixed precision) training.

all_reduce_grads(module)[source]

Manually all-reduce gradients (if required).

backward(loss)[source]

Computes the sum of gradients of the given tensor w.r.t. graph leaves.

Compared to fairseq.optim.FairseqOptimizer.backward(), this function additionally dynamically scales the loss to avoid gradient underflow.

classmethod build_optimizer(cfg: omegaconf.dictconfig.DictConfig, params, **kwargs)[source]
Parameters:
  • cfg (omegaconf.DictConfig) – fairseq args
  • params (iterable) – iterable of parameters to optimize
clip_grad_norm(max_norm, aggregate_norm_fn=None)[source]

Clips gradient norm.

get_lr()[source]

Return the current learning rate.

optimizer

Return a torch.optim.optimizer.Optimizer instance.

optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

set_lr(lr)[source]

Set the learning rate.

step()[source]

Performs a single optimization step.

supports_flat_params

Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

class fairseq.optim.FP16Optimizer(cfg: omegaconf.dictconfig.DictConfig, params, fp32_optimizer, fp32_params, **kwargs)[source]

Wrap an optimizer to support FP16 (mixed precision) training.

all_reduce_grads(module)[source]

Manually all-reduce gradients (if required).

classmethod build_optimizer(cfg: omegaconf.dictconfig.DictConfig, params, **kwargs)[source]
Parameters:
  • cfg (omegaconf.DictConfig) – fairseq args
  • params (iterable) – iterable of parameters to optimize
get_lr()[source]

Return the current learning rate.

optimizer

Return a torch.optim.optimizer.Optimizer instance.

optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

set_lr(lr)[source]

Set the learning rate.

supports_flat_params

Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

class fairseq.optim.MemoryEfficientFP16Optimizer(cfg: omegaconf.dictconfig.DictConfig, params, optimizer, allow_unsupported=False, **kwargs)[source]

Wrap an optimizer to support FP16 (mixed precision) training.

Compared to fairseq.optim.FP16Optimizer, this version does not maintain an FP32 copy of the model. We instead expect the optimizer to convert the gradients to FP32 internally and sync the results back to the FP16 model params. This significantly reduces memory usage but slightly increases the time spent in the optimizer.

Since this wrapper depends on specific functionality in the wrapped optimizer (i.e., on-the-fly conversion of grads to FP32), only certain optimizers can be wrapped. This is determined by the supports_memory_efficient_fp16 property.

all_reduce_grads(module)[source]

Manually all-reduce gradients (if required).

classmethod build_optimizer(cfg: omegaconf.dictconfig.DictConfig, params, **kwargs)[source]
Parameters:
  • args (argparse.Namespace) – fairseq args
  • params (iterable) – iterable of parameters to optimize
get_lr()[source]

Return the current learning rate.

optimizer

Return a torch.optim.optimizer.Optimizer instance.

optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

set_lr(lr)[source]

Set the learning rate.

class fairseq.optim.FairseqOptimizer(cfg)[source]
classmethod add_args(parser)[source]

Add optimizer-specific arguments to the parser.

all_reduce_grads(module)[source]

Manually all-reduce gradients (if required).

average_params()[source]
backward(loss)[source]

Computes the sum of gradients of the given tensor w.r.t. graph leaves.

broadcast_global_state_dict(state_dict)[source]

Broadcasts a global state dict to all ranks. Useful for optimizers that shard state between ranks.

clip_grad_norm(max_norm, aggregate_norm_fn=None)[source]

Clips gradient norm.

get_lr()[source]

Return the current learning rate.

load_state_dict(state_dict, optimizer_overrides=None)[source]

Load an optimizer state dict.

In general we should prefer the configuration of the existing optimizer instance (e.g., learning rate) over that found in the state_dict. This allows us to resume training from a checkpoint using a new set of optimizer args.

multiply_grads(c)[source]

Multiplies grads by a constant c.

optimizer

Return a torch.optim.optimizer.Optimizer instance.

optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

param_groups
params

Return an iterable of the parameters held by the optimizer.

set_lr(lr)[source]

Set the learning rate.

state_dict()[source]

Return the optimizer’s state dict.

step(closure=None, scale=1.0, groups=None)[source]

Performs a single optimization step.

supports_flat_params

Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

supports_groups
supports_memory_efficient_fp16
supports_step_with_scale
zero_grad()[source]

Clears the gradients of all optimized parameters.

class fairseq.optim.adadelta.Adadelta(args, params)[source]
static add_args(parser)[source]

Add optimizer-specific arguments to the parser.

optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

supports_flat_params

Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

class fairseq.optim.adagrad.Adagrad(args, params)[source]
static add_args(parser)[source]

Add optimizer-specific arguments to the parser.

optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

supports_flat_params

Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

class fairseq.optim.adafactor.FairseqAdafactor(args, params)[source]
static add_args(parser)[source]

Add optimizer-specific arguments to the parser.

optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate. Note : Convergence issues empirically observed with fp16 on.

Might require search for appropriate configuration.
class fairseq.optim.adam.FairseqAdam(cfg: fairseq.optim.adam.FairseqAdamConfig, params)[source]

Adam optimizer for fairseq.

Important note: this optimizer corresponds to the “AdamW” variant of Adam in its weight decay behavior. As such, it is most closely analogous to torch.optim.AdamW from PyTorch.

average_params()[source]

Reduce Params is only used during BMUF distributed training.

optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

class fairseq.optim.fp16_optimizer.FP16Optimizer(cfg: omegaconf.dictconfig.DictConfig, params, fp32_optimizer, fp32_params, **kwargs)[source]

Wrap an optimizer to support FP16 (mixed precision) training.

all_reduce_grads(module)[source]

Manually all-reduce gradients (if required).

classmethod build_optimizer(cfg: omegaconf.dictconfig.DictConfig, params, **kwargs)[source]
Parameters:
  • cfg (omegaconf.DictConfig) – fairseq args
  • params (iterable) – iterable of parameters to optimize
get_lr()[source]

Return the current learning rate.

lr_scheduler
optimizer

Return a torch.optim.optimizer.Optimizer instance.

optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

set_lr(lr)[source]

Set the learning rate.

supports_flat_params

Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

class fairseq.optim.nag.FairseqNAG(cfg: omegaconf.dictconfig.DictConfig, params)[source]
optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

class fairseq.optim.sgd.SGD(args, params)[source]
static add_args(parser)[source]

Add optimizer-specific arguments to the parser.

optimizer_config

Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

supports_flat_params

Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

Learning Rate Schedulers

Learning Rate Schedulers update the learning rate over the course of training. Learning rates can be updated after each update via step_update() or at epoch boundaries via step().

isort:skip_file

class fairseq.optim.lr_scheduler.FairseqLRScheduler(cfg, optimizer)[source]
classmethod add_args(parser)[source]

Add arguments to the parser for this LR scheduler.

load_state_dict(state_dict)[source]

Load an LR scheduler state dict.

state_dict()[source]

Return the LR scheduler state dict.

step(epoch, val_loss=None)[source]

Update the learning rate at the end of the given epoch.

step_begin_epoch(epoch)[source]

Update the learning rate at the beginning of the given epoch.

step_update(num_updates)[source]

Update the learning rate after each update.

class fairseq.optim.lr_scheduler.inverse_square_root_schedule.InverseSquareRootSchedule(cfg: fairseq.optim.lr_scheduler.inverse_square_root_schedule.InverseSquareRootLRScheduleConfig, optimizer)[source]

Decay the LR based on the inverse square root of the update number.

We also support a warmup phase where we linearly increase the learning rate from some initial learning rate (--warmup-init-lr) until the configured learning rate (--lr). Thereafter we decay proportional to the number of updates, with a decay factor set to align with the configured learning rate.

During warmup:

lrs = torch.linspace(cfg.warmup_init_lr, cfg.lr, cfg.warmup_updates)
lr = lrs[update_num]

After warmup:

decay_factor = cfg.lr * sqrt(cfg.warmup_updates)
lr = decay_factor / sqrt(update_num)
step(epoch, val_loss=None)[source]

Update the learning rate at the end of the given epoch.

step_update(num_updates)[source]

Update the learning rate after each update.

Data Loading and Utilities

Datasets

Datasets define the data format and provide helpers for creating mini-batches.

class fairseq.data.FairseqDataset[source]

A dataset that provides helpers for batching.

batch_by_size(indices, max_tokens=None, max_sentences=None, required_batch_size_multiple=1)[source]

Given an ordered set of indices, return batches according to max_tokens, max_sentences and required_batch_size_multiple.

collater(samples)[source]

Merge a list of samples to form a mini-batch.

Parameters:samples (List[dict]) – samples to collate
Returns:a mini-batch suitable for forwarding with a Model
Return type:dict
filter_indices_by_size(indices, max_sizes)[source]

Filter a list of sample indices. Remove those that are longer than specified in max_sizes.

WARNING: don’t update, override method in child classes

Parameters:
  • indices (np.array) – original array of sample indices
  • max_sizes (int or list[int] or tuple[int]) – max sample size, can be defined separately for src and tgt (then list or tuple)
Returns:

filtered sample array list: list of removed indices

Return type:

np.array

get_batch_shapes()[source]

Return a list of valid batch shapes, for example:

[(8, 512), (16, 256), (32, 128)]

The first dimension of each tuple is the batch size and can be None to automatically infer the max batch size based on --max-tokens. The second dimension of each tuple is the max supported length as given by fairseq.data.FairseqDataset.num_tokens().

This will be used by fairseq.data.FairseqDataset.batch_by_size() to restrict batch shapes. This is useful on TPUs to avoid too many dynamic shapes (and recompilations).

num_tokens(index)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

num_tokens_vec(indices)[source]

Return the number of tokens for a set of positions defined by indices. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_fetch_outside_dataloader

Whether this dataset supports fetching outside the workers of the dataloader.

supports_prefetch

Whether this dataset supports prefetching.

class fairseq.data.LanguagePairDataset(src, src_sizes, src_dict, tgt=None, tgt_sizes=None, tgt_dict=None, left_pad_source=True, left_pad_target=False, shuffle=True, input_feeding=True, remove_eos_from_source=False, append_eos_to_target=False, align_dataset=None, constraints=None, append_bos=False, eos=None, num_buckets=0, src_lang_id=None, tgt_lang_id=None, pad_to_multiple=1)[source]

A pair of torch.utils.data.Datasets.

Parameters:
  • src (torch.utils.data.Dataset) – source dataset to wrap
  • src_sizes (List[int]) – source sentence lengths
  • src_dict (Dictionary) – source vocabulary
  • tgt (torch.utils.data.Dataset, optional) – target dataset to wrap
  • tgt_sizes (List[int], optional) – target sentence lengths
  • tgt_dict (Dictionary, optional) – target vocabulary
  • left_pad_source (bool, optional) – pad source tensors on the left side (default: True).
  • left_pad_target (bool, optional) – pad target tensors on the left side (default: False).
  • shuffle (bool, optional) – shuffle dataset elements before batching (default: True).
  • input_feeding (bool, optional) – create a shifted version of the targets to be passed into the model for teacher forcing (default: True).
  • remove_eos_from_source (bool, optional) – if set, removes eos from end of source if it’s present (default: False).
  • append_eos_to_target (bool, optional) – if set, appends eos to end of target if it’s absent (default: False).
  • align_dataset (torch.utils.data.Dataset, optional) – dataset containing alignments.
  • constraints (Tensor, optional) – 2d tensor with a concatenated, zero- delimited list of constraints for each sentence.
  • append_bos (bool, optional) – if set, appends bos to the beginning of source/target sentence.
  • num_buckets (int, optional) – if set to a value greater than 0, then batches will be bucketed into the given number of batch shapes.
  • src_lang_id (int, optional) – source language ID, if set, the collated batch will contain a field ‘src_lang_id’ in ‘net_input’ which indicates the source language of the samples.
  • tgt_lang_id (int, optional) –

    target language ID, if set, the collated batch will contain a field ‘tgt_lang_id’ which indicates the target language

    of the samples.
collater(samples, pad_to_length=None)[source]

Merge a list of samples to form a mini-batch.

Parameters:
  • samples (List[dict]) – samples to collate
  • pad_to_length (dict, optional) – a dictionary of {‘source’: source_pad_to_length, ‘target’: target_pad_to_length} to indicate the max length to pad to in source and target respectively.
Returns:

a mini-batch with the following keys:

  • id (LongTensor): example IDs in the original input order
  • ntokens (int): total number of tokens in the batch
  • net_input (dict): the input to the Model, containing keys:
    • src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the left if left_pad_source is True.
    • src_lengths (LongTensor): 1D Tensor of the unpadded lengths of each source sentence of shape (bsz)
    • prev_output_tokens (LongTensor): a padded 2D Tensor of tokens in the target sentence, shifted right by one position for teacher forcing, of shape (bsz, tgt_len). This key will not be present if input_feeding is False. Padding will appear on the left if left_pad_target is True.
    • src_lang_id (LongTensor): a long Tensor which contains source language IDs of each sample in the batch
  • target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the left if left_pad_target is True.
  • tgt_lang_id (LongTensor): a long Tensor which contains target language
    IDs of each sample in the batch

Return type:

dict

filter_indices_by_size(indices, max_sizes)[source]
Filter a list of sample indices. Remove those that are longer
than specified in max_sizes.
Parameters:
  • indices (np.array) – original array of sample indices
  • max_sizes (int or list[int] or tuple[int]) – max sample size, can be defined separately for src and tgt (then list or tuple)
Returns:

filtered sample array list: list of removed indices

Return type:

np.array

get_batch_shapes()[source]

Return a list of valid batch shapes, for example:

[(8, 512), (16, 256), (32, 128)]

The first dimension of each tuple is the batch size and can be None to automatically infer the max batch size based on --max-tokens. The second dimension of each tuple is the max supported length as given by fairseq.data.FairseqDataset.num_tokens().

This will be used by fairseq.data.FairseqDataset.batch_by_size() to restrict batch shapes. This is useful on TPUs to avoid too many dynamic shapes (and recompilations).

num_tokens(index)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

num_tokens_vec(indices)[source]

Return the number of tokens for a set of positions defined by indices. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch

Whether this dataset supports prefetching.

class fairseq.data.MonolingualDataset(dataset, sizes, src_vocab, tgt_vocab=None, add_eos_for_other_targets=False, shuffle=False, targets=None, add_bos_token=False, fixed_pad_length=None, pad_to_bsz=None, src_lang_idx=None, tgt_lang_idx=None)[source]

A wrapper around torch.utils.data.Dataset for monolingual data.

Parameters:
  • dataset (torch.utils.data.Dataset) – dataset to wrap
  • sizes (List[int]) – sentence lengths
  • vocab (Dictionary) – vocabulary
  • shuffle (bool, optional) – shuffle the elements before batching (default: True).
collater(samples)[source]

Merge a list of samples to form a mini-batch.

Parameters:samples (List[dict]) – samples to collate
Returns:a mini-batch with the following keys:
  • id (LongTensor): example IDs in the original input order
  • ntokens (int): total number of tokens in the batch
  • net_input (dict): the input to the Model, containing keys:
    • src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the right.
  • target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the right.
Return type:dict
num_tokens(index)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

num_tokens_vec(indices)[source]

Return the number of tokens for a set of positions defined by indices. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch

Whether this dataset supports prefetching.

Helper Datasets

These datasets wrap other fairseq.data.FairseqDataset instances and provide additional functionality:

class fairseq.data.BacktranslationDataset(tgt_dataset, src_dict, tgt_dict=None, backtranslation_fn=None, output_collater=None, cuda=True, **kwargs)[source]

Sets up a backtranslation dataset which takes a tgt batch, generates a src using a tgt-src backtranslation function (backtranslation_fn), and returns the corresponding {generated src, input tgt} batch.

Parameters:
  • tgt_dataset (FairseqDataset) – the dataset to be backtranslated. Only the source side of this dataset will be used. After backtranslation, the source sentences in this dataset will be returned as the targets.
  • src_dict (Dictionary) – the dictionary of backtranslated sentences.
  • tgt_dict (Dictionary, optional) – the dictionary of sentences to be backtranslated.
  • backtranslation_fn (callable, optional) – function to call to generate backtranslations. This is typically the generate method of a SequenceGenerator object. Pass in None when it is not available at initialization time, and use set_backtranslation_fn function to set it when available.
  • output_collater (callable, optional) – function to call on the backtranslated samples to create the final batch (default: tgt_dataset.collater).
  • cuda – use GPU for generation
collater(samples)[source]

Merge and backtranslate a list of samples to form a mini-batch.

Using the samples from tgt_dataset, load a collated target sample to feed to the backtranslation model. Then take the backtranslation with the best score as the source and the original input as the target.

Note: we expect tgt_dataset to provide a function collater() that will collate samples into the format expected by backtranslation_fn. After backtranslation, we will feed the new list of samples (i.e., the (backtranslated source, original source) pairs) to output_collater and return the result.

Parameters:samples (List[dict]) – samples to backtranslate and collate
Returns:a mini-batch with keys coming from output_collater
Return type:dict
num_tokens(index)[source]

Just use the tgt dataset num_tokens

ordered_indices()[source]

Just use the tgt dataset ordered_indices

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

Note: we use tgt_dataset to approximate the length of the source sentence, since we do not know the actual length until after backtranslation.

supports_prefetch

Whether this dataset supports prefetching.

class fairseq.data.ConcatDataset(datasets, sample_ratios=1)[source]
can_reuse_epoch_itr_across_epochs

Whether we can reuse the fairseq.data.EpochBatchIterator for this dataset across epochs.

This needs to return False if the sample sizes can change across epochs, in which case we may need to regenerate batches at each epoch. If your dataset relies in set_epoch then you should consider setting this to False.

collater(samples, **extra_args)[source]

Merge a list of samples to form a mini-batch.

Parameters:samples (List[dict]) – samples to collate
Returns:a mini-batch suitable for forwarding with a Model
Return type:dict
num_tokens(index: int)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Returns indices sorted by length. So less padding is needed.

prefetch(indices)[source]

Prefetch the data required for this epoch.

set_epoch(epoch)[source]

Will receive the updated epoch number at the beginning of the epoch.

size(idx: int)[source]

Return an example’s size as a float or tuple.

supports_prefetch

Whether this dataset supports prefetching.

class fairseq.data.ResamplingDataset(dataset, weights=None, replace=True, size_ratio=1.0, batch_by_size=True, seed=0, epoch=1)[source]

Randomly samples from a given dataset at each epoch.

Sampling is done with or without replacement, depending on the “replace” parameter.

Optionally, the epoch size can be rescaled. This is potentially desirable to increase per-epoch coverage of the base dataset (since sampling with replacement means that many items in the dataset will be left out). In the case of sampling without replacement, size_ratio should be strictly less than 1.

Parameters:
  • dataset (Dataset) – dataset on which to sample.
  • weights (List[float]) – list of probability weights (default: None, which corresponds to uniform sampling).
  • replace (bool) – sampling mode; True for “with replacement”, or False for “without replacement” (default: True)
  • size_ratio (float) – the ratio to subsample to; must be positive (default: 1.0).
  • batch_by_size (bool) – whether or not to batch by sequence length (default: True).
  • seed (int) – RNG seed to use (default: 0).
  • epoch (int) – starting epoch number (default: 1).
can_reuse_epoch_itr_across_epochs

Whether we can reuse the fairseq.data.EpochBatchIterator for this dataset across epochs.

This needs to return False if the sample sizes can change across epochs, in which case we may need to regenerate batches at each epoch. If your dataset relies in set_epoch then you should consider setting this to False.

num_tokens(index)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]

Prefetch the data required for this epoch.

set_epoch(epoch)[source]

Will receive the updated epoch number at the beginning of the epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

class fairseq.data.RoundRobinZipDatasets(datasets, eval_key=None)[source]

Zip multiple FairseqDataset instances together.

Shorter datasets are repeated in a round-robin fashion to match the length of the longest one.

Parameters:
  • datasets (Dict[FairseqDataset]) – a dictionary of FairseqDataset instances.
  • eval_key (str, optional) – a key used at evaluation time that causes this instance to pass-through batches from datasets[eval_key].
collater(samples)[source]

Merge a list of samples to form a mini-batch.

filter_indices_by_size(indices, max_positions=None)[source]

Filter each sub-dataset independently, then update the round robin to work on the filtered sub-datasets.

num_tokens(index)[source]

Return an example’s length (number of tokens), used for batching.

ordered_indices()[source]

Ordered indices for batching.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch

Whether this dataset supports prefetching.

class fairseq.data.TransformEosDataset(dataset, eos, append_eos_to_src=False, remove_eos_from_src=False, append_eos_to_tgt=False, remove_eos_from_tgt=False, has_target=True)[source]

A FairseqDataset wrapper that appends/prepends/strips EOS.

Note that the transformation is applied in collater().

Parameters:
  • dataset (FairseqDataset) – dataset to wrap
  • eos (int) – index of the end-of-sentence symbol
  • append_eos_to_src (bool, optional) – append EOS to the end of src
  • remove_eos_from_src (bool, optional) – remove EOS from the end of src
  • append_eos_to_tgt (bool, optional) – append EOS to the end of tgt
  • remove_eos_from_tgt (bool, optional) – remove EOS from the end of tgt
collater(samples)[source]

Merge a list of samples to form a mini-batch.

Parameters:samples (List[dict]) – samples to collate
Returns:a mini-batch suitable for forwarding with a Model
Return type:dict
num_tokens(index)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch

Whether this dataset supports prefetching.

Dictionary

class fairseq.data.Dictionary(*, bos='<s>', pad='<pad>', eos='</s>', unk='<unk>', extra_special_symbols=None)[source]

A mapping from symbols to consecutive integers

add_from_file(f)[source]

Loads a pre-existing dictionary from a text file and adds its symbols to this instance.

add_symbol(word, n=1, overwrite=False)[source]

Adds a word to the dictionary

bos()[source]

Helper to get index of beginning-of-sentence symbol

eos()[source]

Helper to get index of end-of-sentence symbol

finalize(threshold=-1, nwords=-1, padding_factor=8)[source]

Sort symbols by frequency in descending order, ignoring special ones.

Parameters:
  • threshold defines the minimum word count (-) –
  • nwords defines the total number of words in the final dictionary, (-) – including special symbols
  • padding_factor can be used to pad the dictionary size to be a (-) – multiple of 8, which is important on some hardware (e.g., Nvidia Tensor Cores).
index(sym)[source]

Returns the index of the specified symbol

classmethod load(f)[source]

Loads the dictionary from a text file with the format:

` <symbol0> <count0> <symbol1> <count1> ... `

pad()[source]

Helper to get index of pad symbol

pad_to_multiple_(padding_factor)[source]

Pad Dictionary size to be a multiple of padding_factor.

save(f)[source]

Stores dictionary into a text file

string(tensor, bpe_symbol=None, escape_unk=False, extra_symbols_to_ignore=None, unk_string=None, include_eos=False, separator=' ')[source]

Helper for converting a tensor of token indices to a string.

Can optionally remove BPE symbols or escape <unk> words.

unk()[source]

Helper to get index of unk symbol

unk_string(escape=False)[source]

Return unknown string, optionally escaped as: <<unk>>

update(new_dict)[source]

Updates counts from new dictionary.

Iterators

class fairseq.data.CountingIterator(iterable, start=None, total=None)[source]

Wrapper around an iterable that maintains the iteration count.

Parameters:
  • iterable (iterable) – iterable to wrap
  • start (int) – starting iteration count. Note that this doesn’t actually advance the iterator.
  • total (int) – override the iterator length returned by __len. This can be used to truncate iterator.
n

number of elements consumed from this iterator

Type:int
has_next()[source]

Whether the iterator has been exhausted.

skip(n)[source]

Fast-forward the iterator by skipping n elements.

take(n)[source]

Truncate the iterator to n elements at most.

class fairseq.data.EpochBatchIterator(dataset, collate_fn, batch_sampler, seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1, buffer_size=0, timeout=0, disable_shuffling=False, skip_remainder_batch=False, grouped_shuffling=False, reuse_dataloader=False, persistent_workers=False)[source]

A multi-epoch iterator over a torch.utils.data.Dataset.

Compared to torch.utils.data.DataLoader, this iterator:

  • can be reused across multiple epochs with the next_epoch_itr() method (optionally shuffled between epochs)
  • can be serialized/deserialized with the state_dict() and load_state_dict() methods
  • supports sharding with the num_shards and shard_id arguments
Parameters:
  • dataset (Dataset) – dataset from which to load the data
  • collate_fn (callable) – merges a list of samples to form a mini-batch
  • batch_sampler (Sampler or a callable) – an iterator over batches of indices, or a callable to create such an iterator (~torch.utils.data.Sampler). A callable batch_sampler will be called for each epoch to enable per epoch dynamic batch iterators defined by this callable batch_sampler.
  • seed (int, optional) – seed for random number generator for reproducibility (default: 1).
  • num_shards (int, optional) – shard the data iterator into N shards (default: 1).
  • shard_id (int, optional) – which shard of the data iterator to return (default: 0).
  • num_workers (int, optional) – how many subprocesses to use for data loading. 0 means the data will be loaded in the main process (default: 0).
  • epoch (int, optional) – the epoch to start the iterator from (default: 1).
  • buffer_size (int, optional) – the number of batches to keep ready in the queue. Helps speeding up dataloading. When buffer_size is zero, the default torch.utils.data.DataLoader preloading is used.
  • timeout (int, optional) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative (default: 0).
  • disable_shuffling (bool, optional) – force disable shuffling (default: False).
  • skip_remainder_batch (bool, optional) –

    if set, discard the last batch in an epoch for the sake of training stability, as the last batch is usually smaller than

    local_batch_size * distributed_word_size (default: False).
  • grouped_shuffling (bool, optional) – enable shuffling batches in groups of num_shards. Ensures that each GPU receives similar length sequences when batches are sorted by length.
end_of_epoch() → bool[source]

Returns whether the most recent epoch iterator has been exhausted

iterations_in_epoch

The number of consumed batches in the current epoch.

load_state_dict(state_dict)[source]

Copies the state of the iterator from the given state_dict.

next_epoch_idx

Return the epoch index after next_epoch_itr is called.

next_epoch_itr(shuffle=True, fix_batches_to_gpus=False, set_dataset_epoch=True)[source]

Return a new iterator over the dataset.

Parameters:
  • shuffle (bool, optional) – shuffle batches before returning the iterator (default: True).
  • fix_batches_to_gpus (bool, optional) – ensure that batches are always allocated to the same shards across epochs. Requires that dataset supports prefetching (default: False).
  • set_dataset_epoch (bool, optional) – update the wrapped Dataset with the new epoch number (default: True).
state_dict()[source]

Returns a dictionary containing a whole state of the iterator.

class fairseq.data.GroupedIterator(iterable, chunk_size, skip_remainder_batch=False)[source]

Wrapper around an iterable that returns groups (chunks) of items.

Parameters:
  • iterable (iterable) – iterable to wrap
  • chunk_size (int) – size of each chunk
  • skip_remainder_batch (bool, optional) –

    if set, discard the last grouped batch in each training epoch, as the last grouped batch is usually smaller than

    local_batch_size * distributed_word_size * chunk_size (default: False).
n

number of elements consumed from this iterator

Type:int
class fairseq.data.ShardedIterator(iterable, num_shards, shard_id, fill_value=None, skip_remainder_batch=None)[source]

A sharded wrapper around an iterable, padded to length.

Parameters:
  • iterable (iterable) – iterable to wrap
  • num_shards (int) – number of shards to split the iterable into
  • shard_id (int) – which shard to iterator over
  • fill_value (Any, optional) – padding value when the iterable doesn’t evenly divide num_shards (default: None).
n

number of elements consumed from this iterator

Type:int

Modules

Fairseq provides several stand-alone torch.nn.Module classes that may be helpful when implementing a new BaseFairseqModel.

isort:skip_file

class fairseq.modules.AdaptiveInput(vocab_size: int, padding_idx: int, initial_dim: int, factor: float, output_dim: int, cutoff: List[int], q_noise: float = 0, qn_block_size: int = 8)[source]
forward(input: torch.Tensor)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

weights_for_band(band: int)[source]
class fairseq.modules.AdaptiveSoftmax(vocab_size, input_dim, cutoff, dropout, factor=4.0, adaptive_inputs=None, tie_proj=False, q_noise=0, qn_block_size=8)[source]

This is an implementation of the efficient softmax approximation for graphical processing units (GPU), described in the paper “Efficient softmax approximation for GPUs” (http://arxiv.org/abs/1609.04309).

adapt_target(target)[source]

In order to be efficient, the AdaptiveSoftMax does not compute the scores for all the word of the vocabulary for all the examples. It is thus necessary to call the method adapt_target of the AdaptiveSoftMax layer inside each forward pass.

forward(input, target)[source]
Parameters:
  • input – (b x t x d)
  • target – (b x t)
Returns:

output for each cutoff section and new targets by cut off

Return type:

2 lists

get_log_prob(input, target)[source]

Computes the log probabilities for all the words of the vocabulary, given a 2D tensor of hidden vectors.

upgrade_state_dict_named(state_dict, name)[source]
class fairseq.modules.BaseLayer(args)[source]
balanced_assignment(scores)[source]
forward(input_features, *args, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

greedy_assignment(scores, k=1)[source]
inverse_sort(order)[source]
load_assignment()[source]
class fairseq.modules.BeamableMM(beam_size=None)[source]

This module provides an optimized MM for beam decoding with attention.

It leverage the fact that the source-side of the input is replicated beam times and the target-side of the input is of width one. This layer speeds up inference by replacing the inputs {(bsz x 1 x nhu), (bsz x sz2 x nhu)} with smaller inputs {(bsz/beam x beam x nhu), (bsz/beam x sz2 x nhu)}.

forward(input1, input2)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

set_beam_size(beam_size)[source]
class fairseq.modules.CharacterTokenEmbedder(vocab: fairseq.data.dictionary.Dictionary, filters: List[Tuple[int, int]], char_embed_dim: int, word_embed_dim: int, highway_layers: int, max_char_len: int = 50, char_inputs: bool = False)[source]
forward(input: torch.Tensor)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

padding_idx
prepare_for_onnx_export_()[source]
reset_parameters()[source]
set_vocab(vocab, max_char_len)[source]
class fairseq.modules.ConvTBC(in_channels, out_channels, kernel_size, padding=0)[source]

1D convolution over an input of shape (time x batch x channel)

The implementation uses gemm to perform the convolution. This implementation is faster than cuDNN for small kernel sizes.

conv_tbc(input: torch.Tensor)[source]
forward(input: torch.Tensor)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reset_parameters()[source]
fairseq.modules.cross_entropy(logits, target, ignore_index=-100, reduction='mean')[source]
class fairseq.modules.DownsampledMultiHeadAttention(out_channels, embed_dim, num_heads, dropout=0.0, bias=True, project_input=True, gated=False, downsample=False)[source]

Multi-headed attention with Gating and Downsampling

forward(query, key, value, mask_future_timesteps=False, key_padding_mask=None, use_scalar_bias=False)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class fairseq.modules.DynamicConv1dTBC(input_size, kernel_size=1, padding_l=None, num_heads=1, weight_dropout=0.0, weight_softmax=False, renorm_padding=False, bias=False, conv_bias=False, query_size=None, in_proj=False)[source]

Dynamic lightweight convolution taking T x B x C inputs :param input_size: # of channels of the input :param kernel_size: convolution channels :param padding_l: padding to the left when using “same” padding :param num_heads: number of heads used. The weight is of shape (num_heads, 1, kernel_size) :param weight_dropout: the drop rate of the DropConnect to drop the weight :param weight_softmax: normalize the weight with softmax before the convolution :param renorm_padding: re-normalize the filters to ignore the padded part (only the non-padding parts sum up to 1) :param bias: use bias :param conv_bias: bias of the convolution :param query_size: specified when feeding a different input as the query :param in_proj: project the input and generate the filter together

Shape:
Input: TxBxC, i.e. (timesteps, batch_size, input_size) Output: TxBxC, i.e. (timesteps, batch_size, input_size)
weight

the learnable weights of the module of shape (num_heads, 1, kernel_size)

bias

the learnable bias of the module of shape (input_size)

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x, incremental_state=None, query=None, unfold=None)[source]

Assuming the input, x, of the shape T x B x C and producing an output in the shape T x B x C :param x: Input of shape T x B x C, i.e. (timesteps, batch_size, input_size) :param incremental_state: A dict to keep the state :param unfold: unfold the input or not. If not, we use the matrix trick instead :param query: use the specified query to predict the conv filters

in_proj
reorder_incremental_state(incremental_state, new_order)[source]
reset_parameters()[source]
fairseq.modules.DynamicConv(input_size, kernel_size=1, padding_l=None, num_heads=1, weight_dropout=0.0, weight_softmax=False, renorm_padding=False, bias=False, conv_bias=False, query_size=None, in_proj=False)[source]
class fairseq.modules.DynamicCRF(num_embedding, low_rank=32, beam_size=64)[source]

Dynamic CRF layer is used to approximate the traditional Conditional Random Fields (CRF) $P(y | x) = 1/Z(x) exp(sum_i s(y_i, x) + sum_i t(y_{i-1}, y_i, x))$

where in this function, we assume the emition scores (s) are given, and the transition score is a |V| x |V| matrix $M$

in the following two aspects:
  1. it used a low-rank approximation for the transition matrix: $M = E_1 E_2^T$
  2. it used a beam to estimate the normalizing factor Z(x)
extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(emissions, targets, masks, beam=None)[source]

Compute the conditional log-likelihood of a sequence of target tokens given emission scores

Parameters:
  • emissions (~torch.Tensor) – Emission score are usually the unnormalized decoder output (batch_size, seq_len, vocab_size). We assume batch-first
  • targets (~torch.LongTensor) – Sequence of target token indices ``(batch_size, seq_len)
  • masks (~torch.ByteTensor) – Mask tensor with the same size as targets
Returns:

approximated log-likelihood

Return type:

~torch.Tensor

forward_decoder(emissions, masks=None, beam=None)[source]

Find the most likely output sequence using Viterbi algorithm.

Parameters:
  • emissions (~torch.Tensor) – Emission score are usually the unnormalized decoder output (batch_size, seq_len, vocab_size). We assume batch-first
  • masks (~torch.ByteTensor) – Mask tensor with the same size as targets
Returns:

decoded sequence from the CRF model

Return type:

~torch.LongTensor

class fairseq.modules.EMAModule(model, config: fairseq.modules.ema_module.EMAModuleConfig, device=None, skip_keys=None)[source]

Exponential Moving Average of Fairseq Models

build_fp32_params(state_dict=None)[source]

Store a copy of the EMA params in fp32. If state dict is passed, the EMA params is copied from the provided state dict. Otherwise, it is copied from the current EMA model parameters.

get_decay()[source]
restore(state_dict, build_fp32_params=False)[source]

Load data from a model spec into EMA model

reverse(model)[source]

Load the model parameters from EMA model. Useful for inference or fine-tuning from the EMA model.

set_decay(decay)[source]
step(new_model)[source]
class fairseq.modules.EMAModuleConfig(_name: Union[str, NoneType] = None, ema_decay: float = 0.9999, ema_fp32: bool = False)[source]
ema_decay = 0.9999
ema_fp32 = False
class fairseq.modules.FairseqDropout(p, module_name=None)[source]
forward(x, inplace: bool = False)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

make_generation_fast_(name: str, retain_dropout: bool = False, retain_dropout_modules: Optional[List[str]] = None, **kwargs)[source]
class fairseq.modules.Fp32BatchNorm(sync=False, *args, **kwargs)[source]
forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class fairseq.modules.Fp32GroupNorm(*args, **kwargs)[source]
forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class fairseq.modules.Fp32LayerNorm(*args, **kwargs)[source]
forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class fairseq.modules.Fp32InstanceNorm(*args, **kwargs)[source]
forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

fairseq.modules.gelu(x: torch.Tensor) → torch.Tensor[source]
fairseq.modules.gelu_accurate(x)[source]
class fairseq.modules.GradMultiply(*args, **kwargs)[source]
static backward(ctx, grad)[source]

Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs as the forward() returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, x, scale)[source]

Performs the operation.

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).

The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with ctx.save_for_backward() if they are intended to be used in backward (equivalently, vjp) or ctx.save_for_forward() if they are intended to be used for in jvp.

class fairseq.modules.GumbelVectorQuantizer(dim, num_vars, temp, groups, combine_groups, vq_dim, time_first, activation=GELU(approximate=none), weight_proj_depth=1, weight_proj_factor=1)[source]
codebook()[source]
forward(x, produce_targets=False)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward_idx(x)[source]
get_codebook_indices()[source]
sample_from_codebook(b, n)[source]
set_num_updates(num_updates)[source]
to_codebook_index(indices)[source]
class fairseq.modules.KmeansVectorQuantizer(dim, num_vars, groups, combine_groups, vq_dim, time_first, gamma=0.25)[source]
expand_embedding
forward(x, produce_targets=False)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward_idx(x)[source]
class fairseq.modules.LayerDropModuleList(p, modules=None)[source]

A LayerDrop implementation based on torch.nn.ModuleList.

We refresh the choice of which layers to drop every time we iterate over the LayerDropModuleList instance. During evaluation we always iterate over all layers.

Usage:

layers = LayerDropList(p=0.5, modules=[layer1, layer2, layer3])
for layer in layers:  # this might iterate over layers 1 and 3
    x = layer(x)
for layer in layers:  # this might iterate over all layers
    x = layer(x)
for layer in layers:  # this might not iterate over any layers
    x = layer(x)
Parameters:
  • p (float) – probability of dropping out each layer
  • modules (iterable, optional) – an iterable of modules to add
fairseq.modules.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, export=False)[source]
class fairseq.modules.LearnedPositionalEmbedding(num_embeddings: int, embedding_dim: int, padding_idx: int)[source]

This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting based on padding_idx or by setting padding_idx to None and ensuring that the appropriate position ids are passed to the forward function.

forward(input: torch.Tensor, incremental_state: Optional[Dict[str, Dict[str, Optional[torch.Tensor]]]] = None, positions: Optional[torch.Tensor] = None)[source]

Input is expected to be of size [bsz x seqlen].

class fairseq.modules.LightweightConv1dTBC(input_size, kernel_size=1, padding_l=None, num_heads=1, weight_dropout=0.0, weight_softmax=False, bias=False)[source]

Lightweight Convolution assuming the input is TxBxC :param input_size: # of channels of the input :param kernel_size: convolution channels :param padding_l: padding to the left when using “same” padding :param num_heads: number of heads used. The weight is of shape (num_heads, 1, kernel_size) :param weight_dropout: the drop rate of the DropConnect to drop the weight :param weight_softmax: normalize the weight with softmax before the convolution :param bias: use bias

Shape:
Input: TxBxC, i.e. (timesteps, batch_size, input_size) Output: TxBxC, i.e. (timesteps, batch_size, input_size)
weight

the learnable weights of the module of shape (num_heads, 1, kernel_size)

bias

the learnable bias of the module of shape (input_size)

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(x, incremental_state=None, unfold=False)[source]

Assuming the input, x, of the shape T x B x C and producing an output in the shape T x B x C :param x: Input of shape T x B x C, i.e. (timesteps, batch_size, input_size) :param incremental_state: A dict to keep the state :param unfold: unfold the input or not. If not, we use the matrix trick instead

prepare_for_onnx_export_()[source]
reorder_incremental_state(incremental_state, new_order)[source]
reset_parameters()[source]
fairseq.modules.LightweightConv(input_size, kernel_size=1, padding_l=None, num_heads=1, weight_dropout=0.0, weight_softmax=False, bias=False)[source]
class fairseq.modules.LinearizedConvolution(in_channels, out_channels, kernel_size, **kwargs)[source]

An optimized version of nn.Conv1d.

At training time, this module uses ConvTBC, which is an optimized version of Conv1d. At inference time, it optimizes incremental generation (i.e., one time step at a time) by replacing the convolutions with linear layers. Note that the input order changes from training to inference.

forward(input, incremental_state: Optional[Dict[str, Dict[str, Optional[torch.Tensor]]]] = None)[source]
Parameters:incremental_state – Used to buffer signal; if not None, then input is expected to contain a single frame. If the input order changes between time steps, call reorder_incremental_state.
Input:
Time x Batch x Channel during training Batch x Time x Channel during inference
reorder_incremental_state(incremental_state: Optional[Dict[str, Dict[str, Optional[torch.Tensor]]]], new_order)[source]
state_dict(destination=None, prefix='', keep_vars=False)[source]

Returns a dictionary containing a whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Warning

Currently state_dict() also accepts positional arguments for destination, prefix and keep_vars in order. However, this is being deprecated and keyword arguments will be enforced in future releases.

Warning

Please avoid the use of argument destination as it is not designed for end-users.

Parameters:
  • destination (dict, optional) – If provided, the state of module will be updated into the dict and the same object is returned. Otherwise, an OrderedDict will be created and returned. Default: None.
  • prefix (str, optional) – a prefix added to parameter and buffer names to compose the keys in state_dict. Default: ''.
  • keep_vars (bool, optional) – by default the Tensor s returned in the state dict are detached from autograd. If it’s set to True, detaching will not be performed. Default: False.
Returns:

a dictionary containing a whole state of the module

Return type:

dict

Example:

>>> module.state_dict().keys()
['bias', 'weight']
upgrade_state_dict_named(state_dict, name)[source]
class fairseq.modules.LocationAttention(attn_dim, encoder_dim, decoder_dim, attn_state_kernel_size, conv_dim, conv_kernel_size, scaling=2.0)[source]

Attention-Based Models for Speech Recognition https://arxiv.org/pdf/1506.07503.pdf

Parameters:
  • encoder_dim (int) – # projection-units of encoder
  • decoder_dim (int) – # units of decoder
  • attn_dim (int) – attention dimension
  • conv_dim (int) – # channels of attention convolution
  • conv_kernel_size (int) – filter size of attention convolution
clear_cache()[source]
forward(encoder_out, encoder_padding_mask, decoder_h, attn_state)[source]
Parameters:
  • encoder_out (torch.Tensor) – padded encoder hidden state B x T x D
  • encoder_padding_mask (torch.Tensor) – encoder padding mask
  • decoder_h (torch.Tensor) – decoder hidden state B x D
  • attn_prev (torch.Tensor) – previous attention weight B x K x T
Returns:

attention weighted encoder state (B, D)

Return type:

torch.Tensor

Returns:

previous attention weights (B x T)

Return type:

torch.Tensor

class fairseq.modules.LSTMCellWithZoneOut(prob: float, input_size: int, hidden_size: int, bias: bool = True)[source]

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations https://arxiv.org/abs/1606.01305

forward(x, h)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

zoneout(h, next_h, prob)[source]
class fairseq.modules.MultiheadAttention(embed_dim, num_heads, kdim=None, vdim=None, dropout=0.0, bias=True, add_bias_kv=False, add_zero_attn=False, self_attention=False, encoder_decoder_attention=False, q_noise=0.0, qn_block_size=8, xformers_att_config: Optional[str] = None, xformers_blocksparse_layout: Optional[torch.Tensor] = None, xformers_blocksparse_blocksize: Optional[int] = 16)[source]

Multi-headed attention.

See “Attention Is All You Need” for more details.

apply_sparse_mask(attn_weights, tgt_len: int, src_len: int, bsz: int)[source]
forward(query, key: Optional[torch.Tensor], value: Optional[torch.Tensor], key_padding_mask: Optional[torch.Tensor] = None, incremental_state: Optional[Dict[str, Dict[str, Optional[torch.Tensor]]]] = None, need_weights: bool = True, static_kv: bool = False, attn_mask: Optional[torch.Tensor] = None, before_softmax: bool = False, need_head_weights: bool = False) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Input shape: Time x Batch x Channel

Parameters:
  • key_padding_mask (ByteTensor, optional) – mask to exclude keys that are pads, of shape (batch, src_len), where padding elements are indicated by 1s.
  • need_weights (bool, optional) – return the attention weights, averaged over heads (default: False).
  • attn_mask (ByteTensor, optional) – typically used to implement causal attention, where the mask prevents the attention from looking forward in time (default: None).
  • before_softmax (bool, optional) – return the raw attention weights and values before the attention softmax.
  • need_head_weights (bool, optional) – return the attention weights for each head. Implies need_weights. Default: return the average attention weights over all heads.
prepare_for_onnx_export_()[source]
reorder_incremental_state(incremental_state: Dict[str, Dict[str, Optional[torch.Tensor]]], new_order: torch.Tensor)[source]

Reorder buffered internal state (for incremental generation).

reset_parameters()[source]
set_beam_size(beam_size)[source]

Used for effiecient beamable enc-dec attention

upgrade_state_dict_named(state_dict, name)[source]
fairseq.modules.PositionalEmbedding(num_embeddings: int, embedding_dim: int, padding_idx: int, learned: bool = False)[source]
class fairseq.modules.SamePad(kernel_size, causal=False)[source]
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class fairseq.modules.ScalarBias(*args, **kwargs)[source]

Adds a vector of scalars, used in self-attention mechanism to allow the model to optionally attend to this vector instead of the past

static backward(ctx, grad)[source]

Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs as the forward() returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, input, dim, bias_init)[source]

Performs the operation.

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).

The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with ctx.save_for_backward() if they are intended to be used in backward (equivalently, vjp) or ctx.save_for_forward() if they are intended to be used for in jvp.

class fairseq.modules.SinusoidalPositionalEmbedding(embedding_dim, padding_idx, init_size=1024)[source]

This module produces sinusoidal positional embeddings of any length.

Padding symbols are ignored.

forward(input, incremental_state: Optional[Any] = None, timestep: Optional[torch.Tensor] = None, positions: Optional[Any] = None)[source]

Input is expected to be of size [bsz x seqlen].

static get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None)[source]

Build sinusoidal embeddings.

This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of “Attention Is All You Need”.

prepare_for_onnx_export_()[source]
class fairseq.modules.TransformerSentenceEncoderLayer(embedding_dim: int = 768, ffn_embedding_dim: int = 3072, num_attention_heads: int = 8, dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.1, activation_fn: str = 'relu', export: bool = False, q_noise: float = 0.0, qn_block_size: int = 8, init_fn: Callable = None)[source]

Implements a Transformer Encoder Layer used in BERT/XLM style pre-trained models.

build_fc1(input_dim, output_dim, q_noise, qn_block_size)[source]
build_fc2(input_dim, output_dim, q_noise, qn_block_size)[source]
build_self_attention(embed_dim, num_attention_heads, dropout, self_attention, q_noise, qn_block_size)[source]
forward(x: torch.Tensor, self_attn_mask: Optional[torch.Tensor] = None, self_attn_padding_mask: Optional[torch.Tensor] = None)[source]

LayerNorm is applied either before or after the self-attention/ffn modules similar to the original Transformer implementation.

class fairseq.modules.TransformerSentenceEncoder(padding_idx: int, vocab_size: int, num_encoder_layers: int = 6, embedding_dim: int = 768, ffn_embedding_dim: int = 3072, num_attention_heads: int = 8, dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.1, layerdrop: float = 0.0, max_seq_len: int = 256, num_segments: int = 2, use_position_embeddings: bool = True, offset_positions_by_padding: bool = True, encoder_normalize_before: bool = False, apply_bert_init: bool = False, activation_fn: str = 'relu', learned_pos_embedding: bool = True, embed_scale: float = None, freeze_embeddings: bool = False, n_trans_layers_to_freeze: int = 0, export: bool = False, traceable: bool = False, q_noise: float = 0.0, qn_block_size: int = 8)[source]

Implementation for a Bi-directional Transformer based Sentence Encoder used in BERT/XLM style pre-trained models.

This first computes the token embedding using the token embedding matrix, position embeddings (if specified) and segment embeddings (if specified). After applying the specified number of TransformerEncoderLayers, it outputs all the internal states of the encoder as well as the final representation associated with the first token (usually CLS token).

Input:
  • tokens: B x T matrix representing sentences
  • segment_labels: B x T matrix representing segment label for tokens
Output:
  • a tuple of the following:
    • a list of internal model states used to compute the predictions where each tensor has shape T x B x C
    • sentence representation associated with first input token in format B x C.
build_embedding(vocab_size, embedding_dim, padding_idx)[source]
build_transformer_sentence_encoder_layer(embedding_dim, ffn_embedding_dim, num_attention_heads, dropout, attention_dropout, activation_dropout, activation_fn, export, q_noise, qn_block_size)[source]
forward(tokens: torch.Tensor, segment_labels: torch.Tensor = None, last_state_only: bool = False, positions: Optional[torch.Tensor] = None, token_embeddings: Optional[torch.Tensor] = None, attn_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class fairseq.modules.TransformerDecoderLayer(args, no_encoder_attn=False, add_bias_kv=False, add_zero_attn=False)[source]
build_encoder_attention(embed_dim, args)[source]
build_self_attention(embed_dim, args, add_bias_kv=False, add_zero_attn=False)[source]
class fairseq.modules.TransformerEncoderLayer(args)[source]
build_self_attention(embed_dim, args)[source]
class fairseq.modules.TransposeLast(deconstruct_idx=None)[source]
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class fairseq.modules.VGGBlock(in_channels, out_channels, conv_kernel_size, pooling_kernel_size, num_conv_layers, input_dim, conv_stride=1, padding=None, layer_norm=False)[source]

VGG motibated cnn module https://arxiv.org/pdf/1409.1556.pdf

Parameters:
  • in_channels – (int) number of input channels (typically 1)
  • out_channels – (int) number of output channels
  • conv_kernel_size – convolution channels
  • pooling_kernel_size – the size of the pooling window to take a max over
  • num_conv_layers – (int) number of convolution layers
  • input_dim – (int) input dimension
  • conv_stride – the stride of the convolving kernel. Can be a single number or a tuple (sH, sW) Default: 1
  • padding – implicit paddings on both sides of the input. Can be a single number or a tuple (padH, padW). Default: None
  • layer_norm – (bool) if layer norm is going to be applied. Default: False
Shape:
Input: BxCxTxfeat, i.e. (batch_size, input_size, timesteps, features) Output: BxCxTxfeat, i.e. (batch_size, input_size, timesteps, features)
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

fairseq.modules.unfold1d(x, kernel_size, padding_l, pad_value=0)[source]

unfold T x B x C to T x B x C x K

fairseq.modules.PositionalEmbedding(num_embeddings: int, embedding_dim: int, padding_idx: int, learned: bool = False)[source]
class fairseq.modules.RelPositionMultiHeadedAttention(n_feat, n_head, dropout, zero_triu=False)[source]

Multi-Head Attention layer with relative position encoding. Paper: https://arxiv.org/abs/1901.02860 :param n_head: The number of heads. :param n_feat: The number of features. :param dropout: Dropout rate. :param zero_triu: Whether to zero the upper triangular part of attention matrix.

forward(query, key, value, pos_emb, key_padding_mask=None, **kwargs)[source]

Compute scaled dot product attention. :param query: Query tensor T X B X C :param key: Key tensor T X B X C :param value: Value tensor T X B X C :param pos_emb: Positional embedding tensor B X 2T-1 X C :param key_padding_mask: Mask tensor T X B

Returns:Output tensor T X B X C.
Return type:torch.Tensor
rel_shift(x)[source]

Compute relative positional encoding. :param x: Input tensor B X n_head X T X 2T-1

Returns:Output tensor.
Return type:torch.Tensor
class fairseq.modules.RelPositionalEncoding(max_len, d_model)[source]

Relative positional encoding module (new implementation).

Parameters:
  • d_model – Embedding dimension.
  • dropout_rate – Dropout rate.
  • max_len – Maximum input length.
extend_pe(x)[source]

Reset the positional encodings.

forward(x: torch.Tensor)[source]

Add positional encoding. :param x: Input tensor T X B X C.

Returns:Encoded tensor T X B X C.
Return type:torch.Tensor
class fairseq.modules.RotaryPositionalEmbedding(dim, base=10000, precision=torch.float16)[source]
forward(x, seq_len=None)[source]
Parameters:
  • x – Input x with T X B X C
  • seq_len – Sequence length of input x
class fairseq.modules.RotaryPositionMultiHeadedAttention(n_feat, n_head, dropout, precision, rotary_emd_base=10000)[source]
forward(query, key, value, key_padding_mask=None, **kwargs)[source]

Compute rotary position attention. :param query: Query tensor T X B X C :param key: Key tensor T X B X C :param value: Value tensor T X B X C :param key_padding_mask: Mask tensor T X B

Returns:Output tensor T X B X D.
Return type:torch.Tensor

Notes

Assumes self attn

Indices and tables