torchtext¶
The torchtext
package consists of data processing utilities and
popular datasets for natural language.
torchtext.data¶
The data module provides the following:
- Ability to define a preprocessing pipeline
- Batching, padding, and numericalizing (including building a vocabulary object)
- Wrapper for dataset splits (train, validation, test)
- Loader a custom NLP dataset
torchtext.datasets¶
All datasets are subclasses of torchtext.data.Dataset
, which
inherits from torch.utils.data.Dataset
i.e, they have split
and
iters
methods implemented.
General use cases are as follows:
Approach 1, splits
:
# set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False)
# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)
# build the vocabulary
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300))
LABEL.build_vocab(train)
# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits(
(train, test), batch_size=3, device=0)
Approach 2, iters
:
# use default configurations
train_iter, test_iter = datasets.IMDB.iters(batch_size=4)
The following datasets are available:
Datasets
Language Modeling¶
Language modeling datasets are subclasses of LanguageModelingDataset
class.
Machine Translation¶
Machine translation datasets are subclasses of TranslationDataset
class.
Sequence Tagging¶
Sequence tagging datasets are subclasses of SequenceTaggingDataset
class.
Examples¶
Ability to describe declaratively how to load a custom NLP dataset that’s in a “normal” format:
pos = data.TabularDataset(
path='data/pos/pos_wsj_train.tsv', format='tsv',
fields=[('text', data.Field()),
('labels', data.Field())])
sentiment = data.TabularDataset(
path='data/sentiment/train.json', format='json',
fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
'sentiment_gold': ('labels', data.Field(sequential=False))})
Ability to define a preprocessing pipeline:
src = data.Field(tokenize=my_custom_tokenizer)
trg = data.Field(tokenize=my_custom_tokenizer)
mt_train = datasets.TranslationDataset(
path='data/mt/wmt16-ende.train', exts=('.en', '.de'),
fields=(src, trg))
Batching, padding, and numericalizing (including building vocabulary object):
# continuing from above
mt_dev = data.TranslationDataset(
path='data/mt/newstest2014', exts=('.en', '.de'),
fields=(src, trg))
src.build_vocab(mt_train, max_size=80000)
trg.build_vocab(mt_train, max_size=40000)
# mt_dev shares the fields, so it shares their vocab objects
train_iter = data.BucketIterator(
dataset=mt_train, batch_size=32,
sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
# usage
>>>next(iter(train_iter))
<data.Batch(batch_size=32, src=[LongTensor (32, 25)], trg=[LongTensor (32, 28)])>
Wrapper for dataset splits (train, validation, test):
TEXT = data.Field()
LABELS = data.Field()
train, val, test = data.TabularDataset.splits(
path='/data/pos_wsj/pos_wsj', train='_train.tsv',
validation='_dev.tsv', test='_test.tsv', format='tsv',
fields=[('text', TEXT), ('labels', LABELS)])
train_iter, val_iter, test_iter = data.BucketIterator.splits(
(train, val, test), batch_sizes=(16, 256, 256),
sort_key=lambda x: len(x.text), device=0)
TEXT.build_vocab(train)
LABELS.build_vocab(train)