Persephone (beta version)

NOTE: This codebase is not actively maintained and development efforts are being placed elsewhere. If you’re interested in training a speech recognition model using ELAN files, consider using Elpis (https://github.com/CoEDL/elpis). If you’re interested in phonetic transcription using an existing multilingual speech recognition model, consider trying https://www.dictate.app/.

Persephone (/pərˈsɛfəni/) is an automatic phoneme transcription tool. Traditional speech recognition tools require a large pronunciation lexicon (describing how words are pronounced) and much training data so that the system can learn to output orthographic transcriptions. In contrast, Persephone is designed for situations where training data is limited, perhaps as little as an hour of transcribed speech. Such limitations on data are common in the documentation of low-resource languages. It is possible to use such small amounts of data to train a transcription model that can help aid transcription, yet such technology has not been widely adopted.

The speech recognition tool presented here is named after the goddess who was abducted by Hades and must spend one half of each year in the Underworld. Which of linguistics or computer science is Hell, and which the joyful world of spring and light? For each it’s the other, of course. — Alexis Michaud

The goal of Persephone is to make state-of-the-art phonemic transcription accessible to people involved in language documentation. Creating an easy-to-use user interface is central to this. The user interface and APIs are a work in progress and currently Persephone must be run via a command line.

The tool is implemented in Python/Tensorflow with extensibility in mind. Currently just one model is implemented, which uses bidirectional long short-term memory (LSTMs) and the connectionist temporal classification (CTC) loss function.

Contributors

Persephone has been built based on the code contributions of:

Citation

If you use this code in a publication, please cite Evaluating Phonemic Transcription of Low-Resource Tonal Languages for Language Documentation:

@inproceedings{adams18evaluating,
title = {Evaluating phonemic transcription of low-resource tonal languages for language documentation},
author = {Adams, Oliver and Cohn, Trevor and Neubig, Graham and Cruz, Hilaria and Bird, Steven and Michaud, Alexis},
booktitle = {Proceedings of LREC 2018},
year = {2018}
}

Quickstart

This guide is written to help you get the tool working on your machine. We will use a example setup that involves training a phoneme transcription tool for Yongning Na. For this we use a small (even by language documentation standards) sub-sampling of elicited speech of Yongning Na, a language of Southwestern China.

The example that we will run can be run on most personal computers without a graphics processing unit (GPU), since I’ve made the settings less computationally demanding than it would be for optimal transcription quality. Ideally you’d have access to a server with more memory and a GPU, but this isn’t necessary.

The code has been tested on Mac and Linux systems. It can be run on Windows using the Docker container described below.

For now you must open up a terminal to enter commands at the command line. (The commands below are prefixed with a ":math:`". Don’t enter the "`", just whatever comes afterwards).

1. Installation

Installation option 1: Using the Docker container

To simplify setup and system dependencies, a Docker container has been created. This just requires Docker to be installed. Once you have installed docker you can fetch our container with:

$ docker pull oadams/persephone

Then run it in interactive mode:

$ docker run -it oadams/persephone

This will place you in an environment where Persephone and its dependencies have been installed, along with the example Na data.

Installation option 2: A “native” install

Ensure Python 3 is installed. Python 3.5 or Python 3.6 is required. Currently Python 3.7 is not supported because we depend on Tensorflow which currently does not support Python 3.7.

You will also need to install some system dependencies. For your convenience we have an install script for dependencies for Ubuntu. To install the Ubuntu binaries, run ./ubuntu_bootstrap.sh to install ffmpeg packages. On MacOS we suggest installing via Homebrew with brew install ffmpeg.

We now need to set up a virtual environment and install the library.

$ python3 -m virtualenv -p python3 persephone-venv
$ source persephone-venv/bin/activate
$ pip install -U pip
$ pip install persephone
$ pip install ipython

(This library can be installed system-wide but it is recommended to install in a virtualenv.)

I’ve uploaded an example dataset that includes some Yongning Na data that has already been preprocessed. We’ll use this example dataset in this tutorial. Once we confirm that the software itself is working on your computer, we can discuss preprocessing of your own data.

Create a working directory for storage of the data and running experiments:

mkdir persephone-tutorial/
cd persephone-tutorial/
mkdir data

Get the data here

Unzip na_example_small.zip. There should now be a directory na_example/, with subdirectories wav/ and label/. You can put na_example anywhere, but for the rest of this tutorial I assume it is in the working directory: persephone-tutorial/data/na_example/.

2. Training a toy Na model

One way to conduct experiments is to run the code from the iPython interpreter. Back to the terminal:

$ ipython
> from persephone import corpus
> corp = corpus.Corpus("fbank", "phonemes", "data/na_example")
> from persephone import experiment
> experiment.train_ready(corp)

You’ll should now see something like:

Number of training utterances: 1024
Batch size: 16
Batches per epoch: 64
2018-01-18 10:30:22.290964: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
exp_dir ./exp/0, epoch 0
    Batch...0...1...2...3...

The message may vary a bit depending on your CPU but if it says something like this then training is very likely working. Contact me if you have any trouble getting to this point, or if you had to deviate from the above instructions to get to this point.

On the current settings it will train through at least 10 “epochs”, very likely more. If you don’t have a GPU then this will take quite a while, though you should notice it converging in performance within a couple hours on most personal computers.

After a few epochs you can see how its going by going to opening up exp/<experiment_number>/train_log.txt. This will show you the error rates on the training set and the held-out validation set. In the exp/<experiment_number>/decoded subdirectory, you’ll see the validation set reference in refs and the model hypotheses for each epoch in epoch<epoch_num>_hyps.

Currently the tool assumes each utterance is in its own audio file, and that for each utterance in the training set there is a corresponding transcription file with phonemes (or perhaps characters) delimited by spaces.

3. Using your own data

If you have gotten this far, congratulations! You’re now ready to start using your own data. The example setup we created with the Na data illustrates a couple key points, including how your data should be formatted, and how you make the system read that data. In fact, if you format your data in the same way, you can create your own Persephone Corpus object with:

corp = corpus.Corpus("fbank", "phonemes", "<your-corpus-directory>")

where extension is “txt”, “phonemes”, “tones”, or whatever your file has after the dot.

If you are using the Docker container then to get data in and out of the container you need to create a “volume” that shares data between your computer (the host) and the container. If your data is stored in /home/username/mydata on your machine and in the container you want to store it in /persephone/mydata then run:

docker run -it -v /home/username/mydata:/persephone/mydata oadams/persephone

This is simply an extension of the earlier command to run docker, which additionally specifies the portal with which data is transferred to and from the container. If Persephone—abducted by Hades—is the queen of the underworld, then you might consider this volume to be the gates of hell.

Formatting your data

Interfacing with data is a key bottleneck in useability for speech recognition systems. Providing a simple and flexible interface to your data is currently the most important priority for Persephone at the moment. This is a work in progress.

Current data formatting requirements:

  • Audio files are stored in <your-corpus>/wav/. The WAV format is supported. Persephone will automatically convert wavs to be 16bit mono 16000Hz.
  • Transcriptions are stored in text files in <your-corpus>/label/
  • Each audio file is short (ideally no longer than 10 seconds). There is a script added by Ben Foley, persephone/scripts/split_eafs.py, to split audio files into utterance-length units based on ELAN input files.
  • Each audio file in wav/ has a corresponding transcription file in label/ with the same prefix (the bit of the filename before the extension). For example, if there is wav/utterance_one.wav then there should be label/utterance_one.<extension>. <extension> can be whatever you want, but it should describe how the labelling is done. For example, if it is phonemic then wav/utterance_one.phonemes is a meaningful filename.
  • Each transcript file includes a space-delimited list of labels to the model should learn to transcribe. For example:
    • data/na_example/label/crdo-NRU_F4_ACCOMP_PFV.0.phonemes contains l e dz ɯ z e l e dz ɯ z e
    • data/na_example/label/crdo-NRU_F4_ACCOMP_PFV.0.phonemes_and_tones might contain: l e ˧ dz ɯ ˥ z e ˩ | l e ˧ dz ɯ ˥ z e ˩
  • Persephone is agnostic to what your chosen labels are. It simply tries to figure out how to map speech to that labelling. These labels can be multiple characters long: the spaces demarcate labels. Labels can be any unicode character(s).
  • Spaces are used to delimit the units that the tool predicts. Typically these units are phonemes or tones, however they could also just be orthographic characters (though performance is likely to be a bit lower: consider trying to transcribe “$100”). The model can’t tell the difference between digraphs and unigraphs as long as they’re tokenized in this format, demarcated with spaces.

If your data observes this format then you can load it via the Corpus class. If your data does not observe this format, you have two options:

  1. Do your own separate preprocessing to get the data in this format. If you’re not a programmer this is probably the best option for you. If you have ELAN files, this probably means using persephone/scripts/split_eaf.py.
  2. Create a Python class that inherits from persephone.corpus.Corpus and does all your preprocessing. The API (and thus documentation) for this is work in progress, but the key point is that <corpusobject>.train_prefixes, <corpusobject>.valid_prefixes, and <corpusobject>.test_prefixes are lists of prefixes for the relevant subset of the data. For an example on a full dataset, see at persephone/datasets/na.py (beware: here be dragons).
Creating validation and test sets

Currently Corpus splits the supplied data into three sets (training, validation and test) in a 95:5:5 ratio. The training set is what your model is exposed to during training. Validation is a held-out set that is used to gauge during training how well the model is performing. Testing is what is used to quantitatively assess model performance after training is complete.

When you first load your corpus, Corpus randomly allocates files to each of these subsets. If you’d like to do change the prefixes of which utterances are in in each set, modify <your-corpus>/valid_prefixes.txt and <your-corpus>/test_prefixes.txt. The training set consists of all the available utterances in neither of these text files.

4. Saving and loading models; transcribing untranscribed data

So far, the tutorial described how to load a Corpus object, and perform training and testing with a single function run.train_ready(corpus), which hid some details. This section exposes more of the interface so that you can describe models more fully, save and load models, and apply it to untranscribed data. I’d like to hear people’s thoughts on this interface.

CorpusReaders and Models

The Corpus object, is an object that exposes the files in the corpus (among several other things). Of relevance here is the .get_train_fns(), .get_valid_fns(), .get_test_fns() methods, which provide lists of files in the training, validation and test sets respectively. There is additionally a .get_untranscribed_fns() method which returns a list of files representing speech that has not been transcribed. .get_untranscribed_fns() fetches prefixes of utterances from untranscribed_prefixes.txt, which you can put in the corpus data directory (at the same level as the feat/ and label/ subdirectories).

To fetch data from your Corpus, a CorpusReader is used. The CorpusReader regulates how much data is to be read from the corpus, as well as the size of the “batches” that are fed to the model during training. You create a CorpusReader by feeding it a corpus (here the example na_corpus):

from persephone import corpus
na_corpus = corpus.Corpus("fbank", "phonemes", "data/na_example/")
from persephone import corpus_reader
na_reader = corpus_reader.CorpusReader(na_corpus, num_train=512, batch_size=16)

Here, na_reader is an interface to the corpus which will read from the corpus files 512 training utterances, in batches of 16 utterances. We can now feed data to a Model:

from persephone import rnn_ctc
model = rnn_ctc.Model(exp_dir, na_reader, num_layers=2, hidden_size=250)

where exp_dir is a directory in which experimental results and logging will be stored. In creating an rnn_ctc.Model (recurrent neural network with a connectionist temporal classification loss function) we have also specified what corpus to read from, how many layers there are in the neural network, and the amount of “neurons” in those layers. We can now train the model with:

model.train()

After training, we can transcribe untranscribed data with:

model.transcribe()

which depends on untranscribed_prefixes.txt existing before corpus creation (though there’s no reason why this can’t be changed to simply transcribe the utterances with feature files in <data-dir>/feat/ that don’t have corresponding transcriptions in <data-dir>/label/).

During training, the model will store the model that performs best on the validation set in <exp_dir>/model, across a few different files prefixed with model_best.ckpt. If you later want to load this model to transcribe untranscribed data, you create a model with the same hyperparameters and call model.transcribe() with the restore_model_path keyword argument:

model = rnn_ctc.Model(<new-exp-dir>, na_reader, num_layers=2, hidden_size=250)
model.transcribe(restore_model_path="<old-exp-dir>/model/model_best.ckpt")

This will load a previous model and perform transcription with it.

5. FAQs

Which installation option should I use for my operating system?
  • If using MacOS or Linux, do a local install (Option 2 in the quickstart), though you can also use Docker.
  • If using Windows, use Docker (Option 1 in the quickstart).
No module named virtualenv?

Run this: pip3 install virtualenv

I get bash: $: command not found

Do not copy the $ and > symbols verbatim. The $ sign denotes the start of a command, while the > denotes the continuation of a command from a previous line.

Python 3 not installed?

Install Python 3 from python.org

What is exp_dir?

Replace this with the name of the directory you want to store the model and it’s results in (in quotes). Eg “exp/na_test”

I’ve run out of memory

You’ll need to run with a smaller batch size. Go to the Section 4: CorpusReaders and Models and follow the commands, except set batch_size to 4 when running na_reader = corpus_reader.CorpusReader(…

How do I use the command line?

To do the basics, you need to know how to move files with “cd”, view the current working directory with “pwd”, view the contents of files with “less”, move directories with “mv”. This stuff would be covered by many commandline tutorials online.

How do I look at the output?

Inside the exp/ directory there should be a decoded/ subdirectory. Inside here you will find “refs”, a list of ground truth transcriptions, as well as “hyps” files, which are the model hypotheses. Open these files with a text editor.

How do I choose an appropriate label granularity?

Question:

Suprasegmentals like tone, glottalization, nasalization, and length are all phonemic in the language I am using. Do they belong in one grouping or separately?

Answer:

I’m wary of making sweeping claims about the best approach to handle all these sorts of phenomena that will realise themselves differently between languages, since I’m neither a linguist nor do I have strong understanding for what features the model will learn each situation. (Regarding tones, the literature on this is also inconclusive in general). The best thing is to empirically test both approaches:

  1. Having features as part of the phoneme token. For example, a nasalized /o/ becomes /õ/.
  2. Having a separate token that follows the phoneme. For example, a high tone /o˥/ becomes two tokens: /o ˥/.

Since there are many ways you can mix and match these, one consideration to keep in mind is how much larger the label vocabulary becomes by merging two tokens into one. You don’t want this vocabulary to become too big because then its harder to learn features common to different tokens, and the model is less likely to pick the right one even if it’s on the right track. In the case of vowel nasalization, maybe you only double the number of vowels, so it might be worth having merged tokens for that. If there are 5 different tones though, you might make that vowel vocabulary about 5 times bigger by combining them into one token, so its less likely to be good (though who knows, it might still yield performance improvements).

The Persephone API

In this section we discuss the application program interface (API) exposed by Persephone. We begin with descriptions of the fundamental classes included in the tool. Model training pipelines are described by instantiating these classes. Consider the following example for a preliminary look at how this works:

# Create a corpus from data that has already been preprocessed.
# Among other things, this will divide the corpus into training,
# validation and test sets.
from persephone.corpus import Corpus
corpus = Corpus(feat_type="fbank",
                 label_type="phonemes",
                 tgt_dir="/path/to/preprocessed/data")

# Create an object that reads the corpus data in batches.
from persephone.corpus_reader import CorpusReader
corpus_reader = CorpusReader(corpus, batch_size=64)

# Create a neural network model (LSTM/CTC model) and train
# it on the corpus.
from persephone.rnn_ctc import Model
model = Model("/path/to/experiment/directory",
              corpus_reader,
              num_layers=3,
              num_hidden=250)
model.train()

This will train and evaluate a model, storing information related to the specific experiment in /path/to/experiment/directory.

In the next section we take a closer look at the classes that comprise this example, and reveal additional functionality, such as loading the speech and transcriptions from ELAN files and how preprocessing of the raw transcription text is specified.

On the horizon, but still to be implemented, is description of these pipelines and interaction between classes in a way that is compatible with the YAML files of the eXtensible Neural Machine Translation toolkit (XNMT).

Fundamental classes

The four key classes are the Utterance, Corpus, CorpusReader, and Model classes. Utterance instances comprise Corpus instances, which are loaded by CorpusReader instances and fed into Model instances.

class persephone.utterance.Utterance

An immutable object that represents a single utterance.

Utterance instances capture key data about short segments of speech in the corpus. Their most important role is in representing transcriptions in various states of preprocessing. For instance, Utterance instances may be created when reading from a linguists transcription files, in which case their text attribute is a raw unpreprocessed transcription. These Utterance instances may then be fed to a function that preprocesses the text, returning new Utterance instances with, say, phonemes delimited with spaces so that they are in an appropriate format for model training.

Note that Utterance instances are not required as arguments to Corpus constructors. They exist to aid in preprocessing.

org_media_path

A pathlib.Path to the original source audio that contains the utterance (which may comprise many utterances).

org_transcription_path

A pathlib.Path to the source of the transcription of the utterance (which may comprise many utterances in the case of, say, ELAN files).

prefix

A string identifier for the utterance which is used to prefix the target wav and transcription files, which are called <prefix>.wav, <prefix>.phonemes, etc.

start_time

An integer denoting the offset, in milliseconds, of the utterance in the original media file found in org_media_path.

end_time

An integer denoting the endpoint, in milliseconds, of the utterance in the original media file found in org_media_path.

text

A string representation of the transcription.

speaker

A string representation of the speaker of the utterance.

class persephone.corpus.Corpus(feat_type: str, label_type: str, tgt_dir: pathlib.Path, *, labels: Optional[Set[str]] = None, max_samples: int = 1000, speakers: Optional[Sequence[str]] = None)[source]

Represents a preprocessed corpus that is ready to be used in model training.

Construction of a Corpus instance involves preprocessing data if the data has not previously already been preprocessed. The extent of the preprocessing depends on which constructor is used. If the default constructor, __init__() is used, transcriptions are assumed to already be preprocessed and only speech feature extraction from WAV files is performed. In other constructors such as from_elan(), preprocessing of the transcriptions is performed. See the documentation of the relevant constructors for more information.

Once a Corpus object is created it should be considered immutable. At this point feature extraction from WAVs will have been performed, with feature files in tgt_dir/feat/. Transcriptions will have been segmented into appropriate tokens (labels) and will be stored in tgt_dir/label/.

__init__(feat_type: str, label_type: str, tgt_dir: pathlib.Path, *, labels: Optional[Set[str]] = None, max_samples: int = 1000, speakers: Optional[Sequence[str]] = None) → None[source]

Construct a Corpus instance from preprocessed data.

Assumes that the corpus data has been preprocessed and is structured as follows: (1) WAVs for each utterance are found in <tgt_dir>/wav/ with the filename <prefix>.wav, where prefix is some string uniquely identifying the utterance; (2) For each WAV file, there is a corresponding transcription found in <tgt_dir>/label/ with the filename <prefix>.<label_type>, where label_type is some string describing the type of label used (for example, “phonemes” or “tones”).

If the data is found in the format, WAV normalization and speech feature extraction will be performed during Corpus construction, and the utterances will be randomly divided into training, validation and test_sets. If you would like to define these datasets yourself, include files named train_prefixes.txt, valid_prefixes.txt and test_prefixes.txt in <tgt_dir>. Each file should be a list of prefixes (utterance IDs), one per line. If these are found during Corpus construction, those sets will be used instead.

Parameters:
  • feat_type – A string describing the input speech features. For example, “fbank” for log Mel filterbank features.
  • label_type – A string describing the transcription labels. For example, “phonemes” or “tones”.
  • labels – A set of strings representing labels (tokens) used in transcription. For example: {“a”, “o”, “th”, …}. If this parameter is not provided the experiment directory is scanned for labels present in the transcription files.
  • max_samples – The maximum number of samples an utterance in the corpus may have. If an utterance is longer than this, it is not included in the corpus.
classmethod from_elan(org_dir: pathlib.Path, tgt_dir: pathlib.Path, feat_type: str = 'fbank', label_type: str = 'phonemes', *, utterance_filter: Callable[[persephone.utterance.Utterance], bool] = None, label_segmenter: Optional[persephone.preprocess.labels.LabelSegmenter] = None, speakers: List[str] = None, lazy: bool = True, tier_prefixes: Tuple[str, ...] = ('xv', 'rf')) → CorpusT[source]

Construct a Corpus from ELAN files.

Parameters:
  • org_dir – A path to the directory containing the unpreprocessed data.
  • tgt_dir – A path to the directory where the preprocessed data will be stored.
  • feat_type – A string describing the input speech features. For example, “fbank” for log Mel filterbank features.
  • label_type – A string describing the transcription labels. For example, “phonemes” or “tones”.
  • utterance_filter – A function that returns False if an utterance should not be included in the corpus and True otherwise. This can be used to remove undesirable utterances for training, such as codeswitched utterances.
  • label_segmenter – An object that has an attribute segment_labels, which is creates new Utterance instances from old ones, by segmenting the tokens in their text attribute. Note, LabelSegmenter might be better as a function, the only issue is it needs to carry with it a list of labels. This could potentially be a function attribute.
  • speakers – A list of speakers to filter for. If None, utterances from all speakers are included.
  • tier_prefixes – A collection of strings that prefix ELAN tiers to filter for. For example, if this is (“xv”, “rf”), then tiers named “xv”, “xv@Mark”, “rf@Rose” would be extracted if they existed.
class persephone.corpus_reader.CorpusReader(corpus, num_train=None, batch_size=None, max_samples=None, rand_seed=0)[source]

Interfaces to the preprocessed corpora to read in train, valid, and test set features and transcriptions. This interface is common to all corpora. It is the responsibility of <corpora-name>.py to preprocess the data into a valid structure of <corpus-name>/[mam-train|mam-valid<seed>|mam-test].

__init__(corpus, num_train=None, batch_size=None, max_samples=None, rand_seed=0)[source]

Construct a new CorpusReader instance.

corpus: The Corpus object that interfaces with a given corpus. num_train: The number of training instances from the corpus used. batch_size: The size of the batches to yield. If None, then it is

num_train / 32.0.
max_samples: The maximum length of utterances measured in samples.
Longer utterances are filtered out.
rand_seed: The seed for the random number generator. If None, then
no randomization is used.
class persephone.model.Model(exp_dir: Union[pathlib.Path, str], corpus_reader: persephone.corpus_reader.CorpusReader)[source]

Generic model for our ASR tasks.

exp_dir

Path that the experiment directory is located

corpus_reader

CorpusReader object that provides access to the corpus this model is being trained on.

log_softmax

log softmax function

batch_x

A batch of input features. (“x” is the typical notation in ML papers on this topic denoting model input)

batch_x_lens

The lengths of each utterance. This is used by Tensorflow to know how much to pad utterances that are shorter than this length.

batch_y

Reference labels for a batch (“y” is the typical notation in ML papers on this topic denoting training labels)

optimizer

The gradient descent method being used. (Typically we use Adam because it has provided good results but any stochastic gradient descent method could be substituted here)

ler

Label error rate.

dense_decoded

Dense representation of the model transcription output.

dense_ref

Dense representation of the reference transcription.

saved_model_path

Path to where the Tensorflow model is being saved on disk.

__init__(exp_dir: Union[pathlib.Path, str], corpus_reader: persephone.corpus_reader.CorpusReader) → None[source]

Initialize self. See help(type(self)) for accurate signature.

train(*, early_stopping_steps: int = 10, min_epochs: int = 30, max_valid_ler: float = 1.0, max_train_ler: float = 0.3, max_epochs: int = 100, restore_model_path: Optional[str] = None, epoch_callback: Optional[Callable[[Dict[KT, VT]], None]] = None) → None[source]

Train the model.

min_epochs: minimum number of epochs to run training for. max_epochs: maximum number of epochs to run training for. early_stopping_steps: Stop training after this number of steps

if no LER improvement has been made.
max_valid_ler: Maximum LER for the validation set.
Training will continue until this is met or another stopping condition occurs.
max_train_ler: Maximum LER for the training set.
Training will continue until this is met or another stopping condition occurs.

restore_model_path: The path to restore a model from. epoch_callback: A callback that is called at the end of each training epoch.

The parameters passed to the callable will be the epoch number, the current training LER and the current validation LER. This can be useful for progress reporting.
transcribe(restore_model_path: Optional[str] = None) → None[source]

Transcribes an untranscribed dataset. Similar to eval() except no reference translation is assumed, thus no LER is calculated.

Preprocessing

persephone.preprocess.elan.utterances_from_dir(eaf_dir: pathlib.Path, tier_prefixes: Tuple[str, ...]) → List[persephone.utterance.Utterance][source]

Returns the utterances found in ELAN files in a directory.

Recursively explores the directory, gathering ELAN files and extracting utterances from them for tiers that start with the specified prefixes.

Parameters:
  • eaf_dir – A path to the directory to be searched
  • tier_prefixes – Stings matching the start of ELAN tier names that are to be extracted. For example, if you want to extract from tiers “xv-Jane” and “xv-Mark”, then tier_prefixes = [“xv”] would do the job.
Returns:

A list of Utterance objects.

class persephone.preprocess.labels.LabelSegmenter

An immutable object that segments the phonemes of an utterance. This could probably actually have a __call__ implementation. That won’t work because namedtuples can’t have special methods. Perhaps it could instead just be a function which we give a labels attribute. Perhaps that obfuscates things a bit, but it could be okay.

segment_labels

A function that takes an Utterance and returns another Utterance where the text field has changed to be phonemically segmented, using spaces as delimiters. Eg “this is” -> “th i s i s”.

labels

A set of labels (eg. phonemes or tones) relevant for segmenting.

persephone.preprocess.wav.extract_wavs(utterances: List[persephone.utterance.Utterance], tgt_dir: pathlib.Path, lazy: bool) → None[source]

Extracts WAVs from the media files associated with a list of Utterance objects and stores it in a target directory.

Parameters:
  • utterances – A list of Utterance objects, which include information about the source media file, and the offset of the utterance in the media_file.
  • tgt_dir – The directory in which to write the output WAVs.
  • lazy – If True, then existing WAVs will not be overwritten if they have the same name

Models

class persephone.rnn_ctc.Model(exp_dir: Union[str, pathlib.Path], corpus_reader, num_layers: int = 3, hidden_size: int = 250, beam_width: int = 100, decoding_merge_repeated: bool = True)[source]

An acoustic model with a LSTM/CTC architecture.

write_desc() → None[source]

Writes a description of the model to the exp_dir.

Distance measurements

persephone.distance.min_edit_distance(source: Sequence[T], target: Sequence[T], ins_cost: Callable[[...], int] = <function <lambda>>, del_cost: Callable[[...], int] = <function <lambda>>, sub_cost: Callable[[...], int] = <function <lambda>>) → int[source]

Calculates the minimum edit distance between two sequences.

Uses the Levenshtein weighting as a default, but offers keyword arguments to supply functions to measure the costs for editing with different elements.

Parameters:
  • ins_cost – A function describing the cost of inserting a given char
  • del_cost – A function describing the cost of deleting a given char
  • sub_cost – A function describing the cost of substituting one char for
Returns:

The edit distance between the two input sequences.

persephone.distance.min_edit_distance_align(source, target, ins_cost=<function <lambda>>, del_cost=<function <lambda>>, sub_cost=<function <lambda>>)[source]

Finds a minimum cost alignment between two strings.

Uses the Levenshtein weighting as a default, but offers keyword arguments to supply functions to measure the costs for editing with different characters. Note that the alignment may not be unique.

Parameters:
  • ins_cost – A function describing the cost of inserting a given char
  • del_cost – A function describing the cost of deleting a given char
  • sub_cost – A function describing the cost of substituting one char for
Returns:

A sequence of tuples representing character level alignments between the source and target strings.

persephone.distance.word_error_rate(ref: Sequence[T], hyp: Sequence[T]) → float[source]

Calculate the word error rate of a sequence against a reference.

Parameters:
  • ref – The gold-standard reference sequence
  • hyp – The hypothesis to be evaluated against the reference.
Returns:

The word error rate of the supplied hypothesis with respect to the reference string.

Raises:

persephone.exceptions.EmptyReferenceException – If the length of the reference sequence is 0.

Exceptions

exception persephone.exceptions.PersephoneException[source]

Base class for all exceptions raised by the Persephone library

exception persephone.exceptions.NoPrefixFileException[source]

Thrown if files like train_prefixes.txt, test_prefixes.txt can’t be found.

exception persephone.exceptions.DirtyRepoException[source]

An exception that is raised if the current working directory is in a dirty state according to Git.

exception persephone.exceptions.EmptyReferenceException[source]

When calculating word error rates, the reference string must be of length >= 1. Otherwise, this exception will be thrown.

Installation

As much as possible the Persephone library strives to be a Python-only library to make installation as easy as possible. Due to the nature of processing sound files we have to interact with various utilities that are non-Python.

The Persephone library requires you have Python 3.5 or Python 3.6 installed. Currently Python 3.7 is not supported because we depend on Tensorflow which currently does not support Python 3.7 (see the relevant issue thread)

Installation from PyPi

The Persephone library is available on PyPi: https://pypi.org/project/persephone/

The easiest way to install is via the pip package manager

pip install persephone

External binaries

The library depends on a few binaries being installed:

  • FFMPEG
  • SOX
  • Kaldi (Optional, required for pitch features support)

There are some bootstrap scripts that are used to provision a development environment which will install the required system packages from apt.

See the Configuration section for how to configure Persephone to use these binaries.

Configuration

The library requires various binaries to be available and directories to be present in order to work. Various defaults are defined as per values in persephone/config.py to override any of these you can create a file called settings.ini at the same base path that you are invoking Persephone from.

Binaries

Once you have the binaries in the External binaries section installed you may need to configure the paths to them. Here is an example of how to specify the path to required binaries in the settings.ini file:

[PATHS]
SOX_PATH = "sox"
FFMPEG_PATH = "ffmpeg"
KALDI_ROOT = "/home/oadams/tools/kaldi"

Here “sox” and “ffmpeg” must be available on the path and KALDI_ROOT specifies an absolute path. Note that these paths can also be specified as absolute paths if you wish.

Paths

There’s a variety of filesystem paths that will be used for storage of data. Here is an example of how to specify paths in the settings.ini file:

[PATHS]
CORPORA_BASE_PATH = "./ourdata/original/"
TGT_DIR = "./preprocessed_data"
EXP_DIR = "./experiments"

CORPORA_BASE_PATH will specify the base for paths that contain the original un-preprocessed source corpora. The default for this is ./data/org/.

TGT_DIR will specify the target directory to store preprocessed data in. The default for this is ./data.

EXP_DIR will specify the directory where experiment results are saved in. The default for this is ./exp.

Support

We are happy to offer direct help to anyone who wants to use it. We are also very welcome to thoughts, constructive criticism, help with design, development and documentation, along with any bug reports or pull requests you may have.

If you find an issue or bug with this code please open an issue on the issues tracker. Please use the discussion mailing list to discuss other questions regarding this project.

If you are having trouble, contact Oliver Adams at oliver.adams@gmail.com.

Indices and tables