torchtext

The torchtext package consists of data processing utilities and popular datasets for natural language.

torchtext.data

The data module provides the following:

  • Ability to define a preprocessing pipeline
  • Batching, padding, and numericalizing (including building a vocabulary object)
  • Wrapper for dataset splits (train, validation, test)
  • Loader a custom NLP dataset

Dataset, Batch, and Example

Dataset

TabularDataset

Batch

Example

Fields

RawField

Field

ReversibleField

SubwordField

NestedField

Iterators

Iterator

BucketIterator

BPTTIterator

Pipeline

Pipeline

Functions

batch

pool

get_tokenizer

interleave_keys

torchtext.datasets

All datasets are subclasses of torchtext.data.Dataset, which inherits from torch.utils.data.Dataset i.e, they have split and iters methods implemented.

General use cases are as follows:

Approach 1, splits:

# set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False)

# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)

# build the vocabulary
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300))
LABEL.build_vocab(train)

# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits(
    (train, test), batch_size=3, device=0)

Approach 2, iters:

# use default configurations
train_iter, test_iter = datasets.IMDB.iters(batch_size=4)

The following datasets are available:

Language Modeling

Language modeling datasets are subclasses of LanguageModelingDataset class.

Machine Translation

Machine translation datasets are subclasses of TranslationDataset class.

Sequence Tagging

Sequence tagging datasets are subclasses of SequenceTaggingDataset class.

torchtext.vocab

Vocab

SubwordVocab

Vectors

Pretrained Word Embeddings

GloVe

FastText

CharNGram

Misc.

_default_unk_index

torchtext.utils

reporthook

download_from_url

Examples

Ability to describe declaratively how to load a custom NLP dataset that’s in a “normal” format:

pos = data.TabularDataset(
path='data/pos/pos_wsj_train.tsv', format='tsv',
fields=[('text', data.Field()),
        ('labels', data.Field())])

sentiment = data.TabularDataset(
    path='data/sentiment/train.json', format='json',
    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
             'sentiment_gold': ('labels', data.Field(sequential=False))})

Ability to define a preprocessing pipeline:

src = data.Field(tokenize=my_custom_tokenizer)
trg = data.Field(tokenize=my_custom_tokenizer)
mt_train = datasets.TranslationDataset(
    path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
    fields=(src, trg))

Batching, padding, and numericalizing (including building vocabulary object):

# continuing from above
mt_dev = data.TranslationDataset(
    path='data/mt/newstest2014', exts=('.en', '.de'),
    fields=(src, trg))
src.build_vocab(mt_train, max_size=80000)
trg.build_vocab(mt_train, max_size=40000)
# mt_dev shares the fields, so it shares their vocab objects

train_iter = data.BucketIterator(
    dataset=mt_train, batch_size=32,
    sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
# usage
>>>next(iter(train_iter))
<data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>

Wrapper for dataset splits (train, validation, test):

TEXT = data.Field()
LABELS = data.Field()

train, val, test = data.TabularDataset.splits(
    path='/data/pos_wsj/pos_wsj', train='_train.tsv',
    validation='_dev.tsv', test='_test.tsv', format='tsv',
    fields=[('text', TEXT), ('labels', LABELS)])

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train, val, test), batch_sizes=(16, 256, 256),
    sort_key=lambda x: len(x.text), device=0)

TEXT.build_vocab(train)
LABELS.build_vocab(train)

Indices and tables