Welcome to SpeechCorpusTools’s documentation!

Contents:

Introduction

Speech Corpus Tools is an application for working with speech datasets, with a focus on large-scale speech corpora. It uses PolyglotDB as the underlying data storage, which allows for consistent queries across a range of possible input formats. This presentation describes the motivation and design of SCT, as well as its application in a case study of speech from 12 languages.

This site consists of two parts:

  1. Tutorial: a stand-alone page containing full instructions for installation of SCT, and worked examples using a sample dataset, providing an introduction to SCT’s basic functionality. We recommend the tutorial for first-time users of SCT.
  2. Documentation: the rest of the site, beginning at Navigation Tour, provides documentation of SCT’s functionality. The documentation is still in progress (July 2016), and questions not addressed here should be sent to michael.mcauliffe@gmail.com.

Tutorial

Speech Corpus Tools is a system for going from a raw speech corpus to a data file (CSV) ready for further analysis (e.g. in R), which conceptually consists of a pipeline of four steps:

  1. Import the corpus into SCT
    • Result: a structured database of linguistic objects (words, phones, discourses).
  2. Enrich the database
    • Result: Further linguistic objects (utterances, syllables), and information about objects (e.g. speech rate, word frequencies).
  3. Query the database
    • Result: A set of linguistic objects of interest (e.g. utterance-final words ending with a stop),
  4. Export the results
    • Result: A CSV file containing information about the set of objects of interest

Ideally, the import and enrichment steps are only performed once for a given corpus. The typical use case of SCT is performing a query and export corresponding to some linguistic question(s) of interest.

This tutorial is structured as follows:

  • Installation: Install necessary software – Neo4j and SCT.

  • Librispeech database: Obtain a database for the Librispeech Corpus where the import and enrichment steps have been completed , either by using a premade version, or doing the import and enrichment steps yourself.

  • Examples:
    • Two worked examples (1, 2) illustrating the Query and Export steps, as well as (optional) basic analysis of the resulting data files (CSV’s) in R.
    • One additional example (3) left as an exercise.

Installation

Installing Neo4j

SCT currently requires that Neo4j version 3.0 be installed locally and running. To install Neo4j, please use the following links:

Once downloaded, just run the installer and it’ll install the database software that SCT uses locally.

SCT currently requires you to change the configuration for Neo4j, by doing the following once, before running SCT:

  • Open the Neo4j application/executable (Mac/Windows)
  • Click on Options ...
  • Click on the Edit... button for Database configuration
  • Replace the text in the window that comes up with the following, then save the file:
#***************************************************************
# Server configuration
#***************************************************************

# This setting constrains all `LOAD CSV` import files to be under the `import` directory. Remove or uncomment it to
# allow files to be loaded from anywhere in filesystem; this introduces possible security problems. See the `LOAD CSV`
# section of the manual for details.
#dbms.directories.import=import

# Require (or disable the requirement of) auth to access Neo4j
dbms.security.auth_enabled=false

#
# Bolt connector
#
dbms.connector.bolt.type=BOLT
dbms.connector.bolt.enabled=true
dbms.connector.bolt.tls_level=OPTIONAL
# To have Bolt accept non-local connections, uncomment this line:
# dbms.connector.bolt.address=0.0.0.0:7687

#
# HTTP Connector
#
dbms.connector.http.type=HTTP
dbms.connector.http.enabled=true
#dbms.connector.http.encryption=NONE
# To have HTTP accept non-local connections, uncomment this line:
#dbms.connector.http.address=0.0.0.0:7474

#
# HTTPS Connector
#
# To enable HTTPS, uncomment these lines:
#dbms.connector.https.type=HTTP
#dbms.connector.https.enabled=true
#dbms.connector.https.encryption=TLS
#dbms.connector.https.address=localhost:7476

# Certificates directory
# dbms.directories.certificates=certificates

#*****************************************************************
# Administration client configuration
#*****************************************************************


# Comma separated list of JAX-RS packages containing JAX-RS resources, one
# package name for each mountpoint. The listed package names will be loaded
# under the mountpoints specified. Uncomment this line to mount the
# org.neo4j.examples.server.unmanaged.HelloWorldResource.java from
# neo4j-examples under /examples/unmanaged, resulting in a final URL of
# http://localhost:${default.http.port}/examples/unmanaged/helloworld/{nodeId}
#dbms.unmanaged_extension_classes=org.neo4j.examples.server.unmanaged=/examples/unmanaged

#*****************************************************************
# HTTP logging configuration
#*****************************************************************

# HTTP logging is disabled. HTTP logging can be enabled by setting this
# property to 'true'.
dbms.logs.http.enabled=false

# Logging policy file that governs how HTTP log output is presented and
# archived. Note: changing the rollover and retention policy is sensible, but
# changing the output format is less so, since it is configured to use the
# ubiquitous common log format
#org.neo4j.server.http.log.config=neo4j-http-logging.xml

# Enable this to be able to upgrade a store from an older version.
#dbms.allow_format_migration=true

# The amount of memory to use for mapping the store files, in bytes (or
# kilobytes with the 'k' suffix, megabytes with 'm' and gigabytes with 'g').
# If Neo4j is running on a dedicated server, then it is generally recommended
# to leave about 2-4 gigabytes for the operating system, give the JVM enough
# heap to hold all your transaction state and query context, and then leave the
# rest for the page cache.
# The default page cache memory assumes the machine is dedicated to running
# Neo4j, and is heuristically set to 50% of RAM minus the max Java heap size.
#dbms.memory.pagecache.size=10g

#*****************************************************************
# Miscellaneous configuration
#*****************************************************************

# Enable this to specify a parser other than the default one.
#cypher.default_language_version=3.0

# Determines if Cypher will allow using file URLs when loading data using
# `LOAD CSV`. Setting this value to `false` will cause Neo4j to fail `LOAD CSV`
# clauses that load data from the file system.
dbms.security.allow_csv_import_from_file_urls=true

# Retention policy for transaction logs needed to perform recovery and backups.
dbms.tx_log.rotation.retention_policy=false

# Enable a remote shell server which Neo4j Shell clients can log in to.
#dbms.shell.enabled=true
# The network interface IP the shell will listen on (use 0.0.0.0 for all interfaces).
#dbms.shell.host=127.0.0.1
# The port the shell will listen on, default is 1337.
#dbms.shell.port=1337

# Only allow read operations from this Neo4j instance. This mode still requires
# write access to the directory for lock purposes.
#dbms.read_only=false

# Comma separated list of JAX-RS packages containing JAX-RS resources, one
# package name for each mountpoint. The listed package names will be loaded
# under the mountpoints specified. Uncomment this line to mount the
# org.neo4j.examples.server.unmanaged.HelloWorldResource.java from
# neo4j-server-examples under /examples/unmanaged, resulting in a final URL of
# http://localhost:7474/examples/unmanaged/helloworld/{nodeId}
#dbms.unmanaged_extension_classes=org.neo4j.examples.server.unmanaged=/examples/unmanaged

Installing SCT

Once Neo4j is set up as above, the latest version of SCT can be downloaded from the SCT releases page. As of 12 July 2016, the most current release is v0.5.

Windows
  1. Download the zip archive for Windows
  2. Extract the folder
  3. Double click on the executable to run SCT.
Mac
  1. Download the DMG file.
  2. Double-click on the DMG file.
  3. Drag the sct icon to your Applications folder.
  4. Double click on the SCT application to run.

LibriSpeech database

The examples in this tutorial use a subset of the LibriSpeech ASR corpus, a corpus of read English speech prepared by Vassil Panayotov, Daniel Povey, and collaborators. The subset used here is the test-clean subset, consisting of 5.4 hours of speech from 40 speakers. This subset was force-aligned using the Montreal Forced Aligner, and the pronunciation dictionary provided with this corpus. This procedure results in one Praat TextGrid per sentence in the corpus, with phone and word boundaries. We refer to the resulting dataset as the LibriSpeech dataset: 5.4 hours of read sentences with force-aligned phone and word boundaires.

The examples require constructing a Polyglot DB database for the LibriSpeech dataset, in two steps: [2]

  1. Importing the LibriSpeech dataset using SCT, into a database containing information about words, phones, speakers, and files.
  2. Enriching the database to include additional information about other linguistic objects (utterances, syllables) and properties of objects (e.g. speech rate).

Instructions are below for using a premade copy of the LibriSpeech database, where steps (1) and (2) have been carried out for you. Instructions for making your own are coming soon. (For BigPhon 2016 tutorial, just use the pre-made copy.)

Use pre-made database

Make sure you have opened the SCT application and started Neo4j, at least once. This creates folders for Neo4j databases and for all SCT’s local files (including SQL databases):

  • OS X: /Users/username/Documents/Neo4j, /Users/username/Documents/SCT
  • Windows: C:\Users\username\Documents\Neo4j, C:\Users\username\Documents\SCT

Download and unzip the librispeechDatabase.zip file. It contains two folders, librispeech.graphdb and LibriSpeech. Move these (using Finder on OS X, or File Explorer on Windows) to the Neo4j and SCT folders. After doing so, these directories should exist:

  • /Users/username/Documents/Neo4j/librispeech.graphdb
  • /Users/username/Documents/SCT/LibriSpeech

When starting the Neo4j server the next time, select the librispeech.graphdb rather than the default folder.

Some important information about the database (to replicate if you are building your own):

  • Utterances have been defined as speech chunks separated by non-speech (pauses, disfluencies, other person talking) chunks of at least 150 msec.
  • Syllabification has been performed using maximal onset.

Building your own Librispeech database

Coming soon! Some general information on building a database in SCT (= importing data) is here.

Examples

Several worked examples follow, which demonstrate the workflow of SCT and how to construct queries and exports. You should be able to complete each example by following the steps listed in bold. The examples are designed to be completed in order.

Each example results in a CSV file containing data, which you should then be able to use to visualize the results. Instructions for basic visualization in R are given.

Example 1 : Factors affecting vowel duration

Example 2 : Polysyllabic shortening

Example 3 : Menzerath’s Law

Example 1: Factors affecting vowel duration

Motivation

A number of factors affect the duration of vowels, including:

  1. Following consonant voicing (voiced > voiceless)
  2. Speech rate
  3. Word frequency
  4. Neighborhood density

#1 is said to be particularly strong in varieties of English, compared to other languages (e.g. Chen, 1970). Here we are interested in examining whether these factors all affect vowel duration, and in particular in seeing how large and reliable the effect of consonant voicing is compared to other factors.

Step 1: Creating a query profile

Based on the motivation above, we want to make a query for:

  • All vowels in CVC words (fixed syllable structure)
  • Only words where the second C is a stop (to examine following C voicing)
  • Only words at the end of utterances (fixed prosodic position)

To perform a query, you need a query profile. This consists of:

  • The type of linguistic object being searched for (currently: phone, word, syllable, utterance)
  • Filters which restrict the set of objects returned by the query

Once a query profile has been constructed, it can be saved (“Save query profile”). Thus, to carry out a query, you can either create a new one or select an existing one (under “Query profiles”). We’ll assume here that a new profile is being created:

  1. Make a new profile: Under “Query profiles”, select “New Query”.

  2. Find phones: Select “phone” under “Linguistic objects to find”. The screen should now look like:

    Image cannot be displayed in your browser
  3. Add filters to the query. A single filter is added by pressing “+” and constructing it, by making selections from drop-down menus which appear. For more information on filters and the intuition behind them, see here.

The first three filters are:

Image cannot be displayed in your browser

These do the following:

  • Restrict to utterance-final words :

    • word: the word containing the phone
    • alignment: something about the word’s alignment with respect to a higher unit
    • Right aligned with, utterance: the word should be right-aligned with its containing utterance
  • Restrict to syllabic phones (vowels and syllabic consonants):

    • subset: refer to a “phone subset”, which has been previously defined. Those available in this example include syllabics and consonants.
    • ==, syllabic: this phone should be a syllabic.
  • Restrict to phones followed by a stop (i.e., not a syllabic)

    • following: refer to the following phone
    • manner_of_articulation: refer to a property of phones, which has been previously defined. Those available here include “manner_of_articulation” and “place_of_articulation”
    • ==, stop: the following phone should be a stop.

Then, add three more filters:

Image cannot be displayed in your browser

These do the following:

  • Restrict to phones preceded by a consonant

  • Restrict to phones which are the second phone in a word

    • previous: refer to the preceding phone
    • alignment, left aligned with, word: the preceding phone should be left-aligned with (= begin at the same time as) the word containing the target phone. (So in this case, this ensures both that V is preceded by a word-initial C in the same word: #CV.)
  • Restrict to phones which precede a word-final phone

These filters together form a query corresponding to the desired set of linguistic objects (vowels in utterance-final CVC words, where C2 is a stop).

You should now:

  1. Save the query : Selecting Save query profile, and entering a name, such as “LibriSpeech CVC”.
  2. Run the query : Select “Run query”.
Step 2: Creating an export profile

The next step is to export information about each vowel token as a CSV file. We would like the vowel’s duration and identity, as well as the following factors which are expected to affect the vowel’s duration:

  • Voicing of the following consonant
  • The word’s frequency and neighborhood density
  • The utterance’s speech rate

In addition, we want some identifying information (to debug, and potentially for building statistical models):

  • What speaker each token is from
  • The time where the token occurs in the file
  • The orthography of the word.

Each of these 9 variables we would like to export corresponds to one row in an export profile.

To create a new export profile:

  1. Select “New export profile” from the “Export query results” menu.

  2. Add one row per variable to be exported, as follows: [1]
    • Press “+” (create a new row)
    • Make selections from drop-down menus to describe the variable.
    • Put the name of the variable in the Output name field. (This will be the name of the corresponding column in the exported CSV. You can use whatever name makes sense to you.)

The nine rows to be added for the variables above result in the following export profile:

Image cannot be displayed in your browser

Some explanation of these rows, for a single token: (We use the [u] in /but/ as a running example)

  • Rows 1, 2, 8 are the duration, label, and the beginning time (time) of the phone object (the [u]), in the containing file.
  • Row 3 refers to the voicing of the following phone object(the [t])
    • Note that “following” automatically means “following phone”” (i.e., phone doesn’t need to put put after following) because the linguistic objects being found are phones. If the linguistic objects being found were syllabes (as in Example 2 below), “following” would automatically mean “following syllable”.
  • Rows 4, 5, and 9 refer to properties of the word which contains the phone object: its frequency, neighborhood density, and label (= orthography, here “boot”)
  • Row 6 refers to the utterance which contains the phone: its speech_rate, defined as syll`ables per second over the utterance.
  • Row 7 refers to the speaker (their name) whose speech contains this phone.

Each case can be thought of as a property (shown in teletype) of a linguistic object or organizational unit (shown in italics).

You can now:

  1. Save the export profile : Select “Save as...”, then enter a name, such as “LibriSpeech CVC export”.
  2. Perform the export : Select “Run”. You will be prompted to enter a filename to export to; make sure it ends in .csv (e.g. librispeechCvc.csv).
Step 3: Examine the data

Here are the first few rows of the resulting data file, in Excel:

Image cannot be displayed in your browser

We will load the data and do basic visualization in R. (Make sure that you have the ggplot2 library.)

First, load the data file:

cvc <- read.csv("librispeechCvc.txt")
Voicing

A plot of the basic voicing effect, by vowel:

ggplot(aes(x=following_consonant_voicing, y=vowel_duration), data=cvc) + geom_boxplot() +
facet_wrap(~vowel, scales = "free_y") + xlab("Consonant voicing") + ylab("Vowel duration (sec)")

It looks like there is generally an effect in the expected direction, but the size of the effect may differ by vowel.

Speech rate

A plot of the basic speech rate effect, divided up by consonant voicing:

ggplot(aes(x=speech_rate, y=vowel_duration), data=cvc) +
 geom_smooth(aes(color=following_consonant_voicing)) +
geom_point(aes(color=following_consonant_voicing), alpha=0.1, size=1) +
 xlab("Speech rate (sylls/sec)") + ylab("Vowel duration")

There is a large (and possibly nonlinear) speech rate effect. The size of the voicing effect is small compared to speech rate, and the voicing effect may be modulated by speech rate.

Frequency

A plot of the basic frequency effect, divided up by consonant voicing:

ggplot(aes(x=word_frequency, y=vowel_duration), data=cvc) +
geom_smooth(aes(color=following_consonant_voicing), method="lm") +
geom_point(aes(color=following_consonant_voicing), alpha=0.1, size=1) +
 xlab("Word frequency (log)") + ylab("Vowel duration") + scale_x_log10()

(Note that we have forced a linear trend here, to make the effect clearer given the presence of more tokens for more frequent words. This turns out to be what the “real” effect looks like, once token frequency is accounted for.)

The basic frequency effect is as expected: shorter duration for higher frequency words. The voicing effect is (again) small in comparison, and may be modulated by word frequency: more frequent words (more reduced?) show a smaller effect.

Neighborhood density

In contrast, there is no clear effect of neighborhood density:

ggplot(aes(x=word_neighborhood_density, y=vowel_duration), data=cvc) +
geom_smooth(aes(color=following_consonant_voicing)) +
geom_point(aes(color=following_consonant_voicing), alpha=0.1, size=1) +
xlab("Neighborhood density") + ylab("Vowel duration")

This turns out to be not unexpected, given previous work: while word duration and vowel quality (e.g., centralization) depend on neighborhood density (e.g. Gahl & Yao, 2011), vowel duration has not been consistently found to depend on neighborhood density (e.g. Munson, 2007).

Example 2: Polysyllabic shortening

Motivation

Polysyllabic shortening refers to the “same” rhymic unit (syllable or vowel) becoming shorter as the size of the containing domain (word or prosodic domain) increases. Two classic examples:

  • English: stick, sticky, stickiness (Lehiste, 1972)
  • French: pâte, pâté, pâtisserie (Grammont, 1914)

Polysyllabic shortening is often – but not always – defined as being restricted to accented syllables. (As in the English, but not the French example.) Using SCT, we can check whether a couple simple versions of polysyllabic shortening holds in the LibriSpeech corpus:

  1. Considering all utterance-final words, does the initial syllable duration decrease as word length increases?
  2. Considering just utterance-final words with primary stress on the initial syllable, does the initial syllable duration decrease as word length increases?

We show (1) here, and leave (2) as an exercise.

Step 1: Query profile

In this case, we want to make a query for:

  • Word-initial syllables
  • ...only in words at the end of utterances (fixed prosodic position)

For this query profile:

  • “Linguistic objects to find” = “syllables”

  • Filters are needed to restrict to:
    • Word-initial syllables
    • Utterance-final words

This corresponds to the following query profile, which has been saved (in this screenshot) as “PSS: first syllable” in SCT:

Image cannot be displayed in your browser

The first and second filters are similar to those in Example 1:

  • Restrict to word-initial syllables

    • alignment: something about the syllable’s alignment
    • left aligned with word: what it says
  • Restrict to utterance-final words

    • word: word containing the syllable
    • right aligned with utterance``: the word and utterance have the same end time.

You should input this query profile, then run it (optionally saving first).

Step 2: Export profile

This query has found all word-initial stressed syllables for words in utterance-final position. We now want to export information about these linguistic objects to a CSV file, for which we again need to construct a query profile. (You should now Start a new export profile.)

We want it to contain everything we need to examine how syllable duration (in seconds) depends on word length (which could be defined in several ways):

  • The duration of the syllable
  • Various word duration measures: the number of syllables and number of phones in the word containing the syllable, as well as the duration (in seconds) of the word.

We also export other information which may be useful (as in Example 1): the syllable label, the speaker name, the time the token occurs in the file, the word label (its orthography), and the word’s stress pattern.

The following export profile contains these nine variables:

Image cannot be displayed in your browser

After you enter these rows in the export profile, run the export (optionally saving the export profile first). I exported it as librispeechCvc.csv.

Step 3: Examine the data

In R: load in the data:

pss <- read.csv("librispeechCvc.csv")

There are very few words with 6+ syllables:

library(dplyr)
group_by(pss, word_num_syllables) %>% summarise(n_distinct(word))
## Source: local data frame [6 x 2]
##
##   word_num_syllables n_distinct(word)
##                (int)            (int)
## 1                  1             1019
## 2                  2             1165
## 3                  3              612
## 4                  4              240
## 5                  5               60
## 6                  6                7

So let’s just exclude words with 6+ syllables:

pss <- subset(pss, word_num_syllables<6)

Plot of the duration of the initial stressed syllable as a function of word duration (in syllables):

library(ggplot2)
ggplot(aes(x=factor(word_num_syllables), y=syllable_duration), data=pss) +
geom_boxplot() + xlab("Number of syllables in word") +
ylab("Duration of initial syllable (stressed)") + scale_y_sqrt()

Here we see a clear polysyllabic shortening effect from 1 to 2 syllables, and possibly one from 2 to 3 and 3 to 4 syllables.

This plot suggests that the effect is pretty robust across speakers (at least for 1–3 syllables):

ggplot(aes(x=factor(word_num_syllables), y=syllable_duration), data=pss) +
geom_boxplot() + xlab("Number of syllables in word") +
ylab("Duration of initial syllable (stressed)") + scale_y_sqrt() + facet_wrap(~speaker)

Example 3: Menzerath’s Law

Motivation: Menzerath’s Law (Menzerath 1928, 1954) refers to the general finding that segments and syllables are shorter in longer words, both in terms of

  • duration per unit
  • number of units (i.e. segments per syllable)

(Menzerath’s Law is related to polysyllabic shortening, but not the same.)

For example, Menzerath’s Law predicts that for English:

  1. The average duration of syllables in a word (in seconds) should decrease as the number of syllables in the word increases.
  2. `` `` for segments in a word.
  3. The average number of phones per syllable in a word should decrease as the number of syllables in the word increases.

Exercise: Build a query profile and export profile to export a data file which lets you test Menzerath’s law for the LibriSpeech corpus. For example, for prediction (1), you could:

  • Find all utterance-final words (to hold prosodic position somewhat constant)
  • Export word duration (seconds), number of syllables, anything else necessary.

(This exercise should be possible using pieces covered in Examples 1 and 2, or minor extensions.)

[1]Note that it is also possible to input some of these rows automatically, using the checkboxes in the Simple exports tab.
[2]Technically, this database consists of two sub-databases: a Neo4j database (which contains the hierarchical representation of discourses), and a SQL database (which contains lexical and featural information, and cached acoustic measurements).

Connection

To see an example connection, go to Connection example

In the connection tab, there are various features.

Image cannot be displayed in your browser

These are detailed below

IP address (or localhost)

This is the address of the Neo4j server. In most cases, it will be ‘localhost’

Port

This is the port through which a connection to the Neo4j server is made. By default, it is 7474. It must always match the port shown in the Neo4j window.

Image cannot be displayed in your browser

Username and Password

These are by default not required, but available should you need authentication for your Neo4j server

Connect

This button will actually connect the user to the specified server.

Find local audio files

Pressing this allows the user to browse his/her file system for directories containing audiofiles that correspond to files in a corpus.

Corpora

The user select a corpus (for runnning queries, viewing discourses, enrichment, etc.) by clicking that corpus in the “Available corpora” menu. The selected corpus will be highlighted in blue or grey.

Import local corpus

This is strictly for constructing a new relational database in Neo4j that does not already exist. Any corpus that has already been imported can be accessed by pressing “Connect” and selecting it instead. Re-importing the same corpus will overwrite the previous corpus of the same name, as well as remove any enrichment the user has done on the corpus.

When importing a new corpus, the user selects the directory of the overall corpus, not specific files or subdirectories.

Building Queries

In this panel, the user constructs queries by adding filters (these will be explained more thoroughly in a moment). There are two key concepts that drive a query in SCT:

  • Linguistic Object A linguistic object can be an utterance, word, or phone. By selecting a linguistic object, the user is specifying the set of elements over which the query is to be made. For example, selecting “phones” will cause the program to look for phones with properties specified by the user (if “words” were selected, then the program would look for words, etc.)

  • Filters Filters are statements that limit the data returned to a specific set. Each filter added provides another constraint on the data. Click here for more information on filters. Here’s an example of a filter:

    Image cannot be displayed in your browser

This filter specifies all the object (utterance, phone, syllable) which are followed by an object of the same type that shares its rightmost boundary with a word.

Now you’re ready to start building queries. Here’s an overview of what each dropdown item signifies

Linguistic Objects

  • Utterance: An utterance is (loosely) a group of sounds delimited by relatively long pauses on either side. This could be a clause, sentence, or phrase. Note that utterances need to be encoded before they are available.
  • Syllables Syllables currently have to be encoded before this option is available. The encoding is done through maximum attested onset
  • Word: A word is a collection of phones that form a single meaningful element.
  • Phone: A phone is a single speech segment.

The following is avaiable only for the TIMIT database:

  • surface_transcription This is the phonetic transcription of the utterance

Filters

Filters are conditions that must be satisfied for data to pass through. For example

Image cannot be displayed in your browser

is a filter

Many filters have dropdown menus. These look like this:

Image cannot be displayed in your browser

Generally speaking, the first dropdown menu is used to target a property. These properties are available without enrichment for all databases:

  • alignment The position of the object in a super-object (i.e. a word in an utterance, a phone in a word...)
  • following Specifies the object after the current object
  • previous Specifies the object before the current object
  • subset Used to delineate classes of phones and words. Certain classes come premade. Others are avaiable through enrichment
  • duration How much time the object occupies
  • begin The start of the object in time (seconds)
  • end The end of the object in time (seconds)
  • label The orthographic contents of an object
  • word Specifies a word (only available for Utterance, Syllable, and Phone)
  • syllable Specifies a syllable
  • phone Specifies a phone
  • speaker Specifies the speaker
  • discourse Specifies the discourse, or file
  • category Only available for words, specifies the word category
  • transcription Only available for words, specifies the phonetic transcription of the word in the corpus

These are available after enrichment:

  • utterance Available for all objects except utterance, specifies the utterance that the object came from
  • syllable_position Only available for phones, specifies the phone’s position in a syllable
  • num_phones Only available for words, specifies the number of phones in a word
  • num_syllables Only available for words, specifies the number of syllables in a word
  • position_in_utterance Only available for words, specifies the word’s index in the utterance

These are only available for force-aligned database:

  • manner_of_articulation Only available for phones
  • place_of_articulation Only available for phones
  • voicing Only available for phones
  • vowel_backness Only available for phones
  • vowel_rounding Only available for phones
  • vowel_height Only available for phones
  • frequency Only available for words, specifies the word frequency in the corpus
  • neighborhood_density Only available for words, specifies the number of phonological neighbours of a given word.
  • stress_pattern Only available for words, specifies the stress pattern for a word
The second filter will depend on which filter you chose in the first column. For example, if you chose phone you will get all of the phone options specified above. However if you choose label you will be presented with a different type of dropdown menu. This section covers some of these possibilities.
  • alignment

    • right aligned with This will filter for objects whose rightmost boundary lines up with the rightmost boundary of the object you will select in the third column of dropdown menus (utterance, syllable, word, or phone).
    • left aligned with This will filter for objects whose leftmost boundary lines up with the left most boundary of the object you will select in the third column of dropdown menus (utterance, syllable, word, or phone).
    • not right aligned with This will exclude objects whose rightmost boundary lines up with the rightmost boundary of the object you will select in the third column of dropdown menus (utterance, syllable, word, or phone).
    • not left aligned with This will exclude objects whose leftmost boundary lines up with the left most boundary of the object you will select in the third column of dropdown menus (utterance, syllable, word, or phone).
  • subset
    • == This will filter for objects that are in the class that you select in the third dropdown menu.
  • begin/end/num_phones/num_syllables/ position_in_utterance/frequency/ neighborhood_density/duration

    • == This will filter for objects whose property is equal to what you have specified in the text box following this menu.
    • != This will exclude objects whose property is equal to what you have specified in the text box following this menu.
    • >= This will filter for objects whose property is greater than or equal to what you have specified in the text box following this menu.
    • <= This will filter for objects whose property is less than or equal to what you have specified in the text box following this menu.
    • > This will filter for objects whose property is greater than what you have specified in the text box following this menu.
    • < This will filter for objects whose property is less than what you have specified in the text box following this menu.
  • stress_pattern/category/label/ speaker + name/discourse + name/ transcription/vowel_height/ vowel_backness/vowel_rounding/ manner_of_articulation/ place_of_articulation/voicing

    • == This will filter for objects whose property is equivalent to what you have specified in the text box or dropdown menu following this menu.
    • != This will exclude objects whose property name is equivalent to what you have specified in the text box or dropdown menu following this menu.
    • regex This option allows you to input a regular expression to match certain properties.

Experiment with combining these filters. Remember that each time you add a filter, you are applying further constraints on the data.

Some complex queries come pre-made. These include “all vowels in mono-syllabic words” and “phones before word-final consonants”. Translating from English to filters can be complicated, so here we’ll cover which filters constitute these two queries.
  • All vowels in mono-syllabic words

    • Since we’re looking for vowels, we know that the linguistic object to search for must be “phones”

    • To get mono-syllabic words, we have to go through three phases of enrichment

      • First, we need to encode syllabic segments
      • Second, we need to encode syllables
      • Finally, we can encode the hierarchical property: count of syllables in word
    • Now that we have this property, we can add a filter to look for monosyllabic words: word: count_of_syllable_in_word == 1

      • Notice that we had to select “word” for “count_of_syllable_in_word” to be available
    • The next filter we want to add would be to get only the vowels from this word. subset == syllabic

      • This will get the syllabic segments (vowels) that we encoded earlier
  • Phones before word-final consonants

    • Once again, it is clear that we are looking for “phones” as our linguistic object.

    • The word “before” should tip you off that we will need to use the “following” or “previous” property.

    • We start by getting all phones that are in the penultimate position in a word. following phone right-aligned with word

      • This will ensure that the phone after the one we are looking for is the word-final phone
    • Now we need to limit it to consonants following phone subset != syllabic

      • This will further limit the results to only phones before non-syllabic word-final segments (word-final consonants)

Exporting Queries

While getting in-app results can be a quick way to visualize data, most often the user will want to further manipulate the data (i.e. in R, MatLab, etc.) To this end, there is the “Export query results” feature. It allows the user to specify the information that is exported by adding columns to the final output file. This is somewhat similar to building queries , but not quite the same. Insttead of filters, pressing the “+” button will add a column to the exported file.

For example, if the user wanted the timing information (begin/end) and lables for the object found and the object before it, the export profile would look like:

Image cannot be displayed in your browser

Perhaps a researcher would be interested in knowing whether word-initial segments in some word categories are longer than in others. To get related information (phone timing information and label, word category) into a .csv file, the export profile would be something like:

Image cannot be displayed in your browser

Here, “phone” has been selected as the linguistic object to find (since that is what we’re interested in) so any property without a preceding dropdown menu is a property of the target phone – in this case, alignment would have been used to specify “word-initial phones”.

Another option is to use the “simple export” window.

Image cannot be displayed in your browser

Here, there are several common options that can be selected by checking them. Once checked, they will appear as columns in the query profile:

Image cannot be displayed in your browser

While many of the column options are the same as ones available for building queries there are some differences :

  • “alignment” and “subset” are not valid column options

  • column options do not change depending on the linguistic object that was chosen earlier

    • instead, you can select “word” and then “label” (or some other option) or “phone” + options, etc.
  • you can edit the column name by typing what you would like to call it in the “Output name:” box. These names are by default very descriptive, but perhaps too long for the user’s purposes.

Since the options are similar but not all identical, here is a full list of all the options available:

  • following Specifies the object after the current object. There will be another dropdown menu to select a property of this following object.

  • previous Specifies the object before the current object. There will be another dropdown menu to select a property of this preceding object.

  • duration Adds how much time the object occupies as a column

  • begin Adds the start of the object in time (seconds) as a column

  • end Adds the end of the object in time (seconds) as a column

  • label Adds the orthographic contents of an object as a column

  • word Specifies a word (another dropdown menu will become available to specify another property to add as a column). The following are only available if “word” is selected either as the original object to search for, or as the first property in a column.

    • category Adds the word category as a column
    • transcription Adds the underlying phonetic transcription of the word in the corpus as a column
    • surface_transcription Adds the surface transcription of the word in the corpus as a column
    • utterance Specifies the utterance that the word came from (another dropdown menu will become available to specify another property to add as a column)
  • phone Specifies a phone (another dropdown menu will become available to specify another property to add as a column)

  • speaker Specifies the speaker (another dropdown menu will become available to specify another property to add as a column)

  • discourse Specifies the discourse, or file (another dropdown menu will become available to specify another property to add as a column)

Once the profile is ready, pressing “run” will open the following window:

Image cannot be displayed in your browser

Here the user can pick a name and location for the final file. After pressing save, the query will run and the results will be written in the desired columns to the file.

Viewing Discourses

After completing a query, it might be useful to take a closer look at the discourse, or file, that a result came from. To this end, SCT has the ‘Discourse’ window on the bottom left.

Image cannot be displayed in your browser

The user is presented with two windows inside of the ‘Discourse’ window. The top one shows the waveform of the file as well as the transcriptions of words and phones.

Image cannot be displayed in your browser

The bottom window is a spectrogram. This maps time and frequency on the X and Y axes respectively, while the darkeness of an area indicates the amplitude. Lines generated by the software also indicate pitch and formants when available.

Image cannot be displayed in your browser

Pitch and formants will only become available by first selecting “Analyze acoustics” in the enrichment menu. Viewing one of the discourses’ acoustic information can be done by clicking on a discourse either in the “Discourse” tab of the top right window (right next to “Connection”),

Image cannot be displayed in your browser
or by double-clicking on a result from a query in the “Query #” tab*.
Image cannot be displayed in your browser

* NB A successful query must first be run for this tab to appear

Now something like this should be displayed:

Image cannot be displayed in your browser

The waveform is displayed, with annotations

Image cannot be displayed in your browser

as well as the spectrogram, whose features can be toggled on and off by clicking on them.

  • Spectrogram On

    Image cannot be displayed in your browser
  • Spectrogram Off (just formants and pitch)

    Image cannot be displayed in your browser

Viewing Results

Having run a query, a user will want to make sense of the results. These can be found in the “Query #” that will appear as soon as the query has finished running.

Image cannot be displayed in your browser

Within this tab, based on the linguistic objects the user was searching for (utterance, word, phone, or syllable) there will be different columns*. Here is a list of the default columns

Utterance

  • begin
  • end
  • discourse
  • speaker

Word

  • begin
  • category * only in buckeye
  • end
  • label
  • surface_transcription * only in buckeye
  • transcription
  • discourse
  • speaker

Phone

  • begin
  • end
  • label
  • discourse
  • speaker

Syllable

  • begin
  • end
  • label
  • discourse
  • speaker

* NB Scrolling horizontally may be required to view all of these options.

Example: Connecting to Servers

If you already have Neo4j open and started, you’re ready to start connecting to servers.

Go to the upper right panel in SCT

Image cannot be displayed in your browser

You’re not connected to the Neo4j graph database the first time you start the program. Let’s fix that. Make sure that the port is the same as in your Neo4j window.

Image cannot be displayed in your browser

If they match, you’re ready to proceed. Press connect. Because it is your first time using the program, nothing will appear in “Available Corpora”, but the “Reset Local Cache” button should now be clickable.

Next, go to “Import Local Corpus” at the bottom center and click on it.

Image cannot be displayed in your browser

Press “Buckeye Corpus”. This was included with the tutorial. Go to the tutorial folder and select “buckeyeDataForTutorial”. You will have to wait for the corpus to be imported.

When the process has completed, you are ready to make some queries. Simply select the corpus by clicking on it under “Available corpora” and begin adding filters.

Enrichment

Databases can be enriched by encoding various elements. Usually, the database starts off with just words and phones, but by using enrichment options a diverse range of options will become available to the user. Here are some of the options:

  • Encode non-speech elements this allows the user to specify for a given database what should not count as speech

  • Encode utterances After encoding non-speech elements, we can use them to define utterances (segments of speech separated by a .15-.5 second pause)

  • Encode syllabic segments This allows the user to specify which segments in the corpus are counted as syllabic

  • Encode syllables if the user has encoded syllabic segments, syllables can now be encoded using maximum attested onset

  • Encode hierarchical properties These allow the user to encode such properties as number of syllables in each utterance, or rate of syllables per second

  • Enrich lexicon This allows the user to assign certain properties to specific words using a CSV file. For example the user might want to encode word frequency. This can be done by having words in one column and corresponding frequencies in the other column of a column-delimited text file.

  • Enrich phonological inventory Similarly to lexical enrichment, this allows the user to add certain helpful features to phonological properties – for example, adding ‘fricative’ to ‘manner_of_articulation’ for some phones

  • Enrich speakers Like phonological and lexial enrichment, this allows the user to add speaker metadata from a CSV such as sex and age.

  • Encode subsets Similarly to how syllabic phones were encoded into subsets, the user can encode other phones in the corpus into subsets as well

  • Encode relativized measures This permits the user to encode the following statistics

    • Phone
      • Mean duration
      • Median duration
      • Standard deviation of duration
    • Word
      • Mean duration
      • Median duration
      • Standard deviation of duration
      • Baseline duration - this is the sum of the mean durations of the constituent phones
    • Syllable
      • Mean duration
      • Median duration
      • Standard deviation of duration
    • Speaker
      • Average speech rate
  • Encode stress/tone Certain corpus alphabets will come with stress or tone information embedded in vowel characters. For example, in some CMUdict corpora primary stress on the vowel “AA” is represented by “AA1”. This enrichment function allows the user to specify a regular expression to split this information off of the vowel and encode it onto the syllable. The default expressions are for LibriSpeech (stress) and GlobalPhone (tone)

  • Analyze acousticcs This will encode pitch and formants into the corpus. This is necessary to view the waveforms and spectrogram.

Filters Explained

So far, there has been a lot of talk about objects, filters, and alignment, but these can be a difficult-to-grasp concepts. These illustrated examples might be helpful in gleaning a better understanding of what is meant by “object”, “filter” and “alignment”.

The easiest way to start is with an example. Let’s say the user wanted to search for word-final fricatives in utterance-initial words

While to a person this seems like a fairly simple task that can be accomplished at a glance, for SCT it has to be broken up into its constituent steps. Let’s see how this works on this sample sentence:

Image cannot be displayed in your browser

Here, each level (utterance, word, phone) corresponds to an object. Since we are ultimately looking for fricatives, we would want to select “phones” as our linguistic object to find.

Right now we have all phones selected, since we haven’t added any filters. Let’s limit these phones by adding the first part of our desired query: word-final phones. To accomplish this, we need to grasp the idea of alignment.

Each object (utterances, words, phones) has two boundaries, left and right. These are represented by the walls of the boxes containing each object in the picture. To be “aligned”, two objects must share a boundary. For example, the non-opaque objects in the next 2 figures are all aligned. Shared boundaries are indicated by thick black lines. Parent objects (for example, words in which a target phone is found) are outlined in dashed lines. In the first picture, the words and phones are “left aligned” with the utterance (their left boundaries are the same as that of the utterance) and in the second image, words and phones are “right aligned” with the utterance.

Image cannot be displayed in your browser Image cannot be displayed in your browser

Now that we understand alignment, we can use it to filter for word-final phones, by adding in this filter:

Image cannot be displayed in your browser

By specifying that we only want phones which share a right boundary with a word, we are getting all word-final phones.

Image cannot be displayed in your browser

However, recall that our query asked for word-final fricatives, and not all phones. This can easily be remedied by adding another filter *:

Image cannot be displayed in your browser

* NB the “fricative” property is only available through enrichment.

Now the following phones are found:

Image cannot be displayed in your browser

Finally, in our query we wanted to specify only utterance-intial words. This will again be done with alignment. Since English reads left to right, the first word in an utterance will be the leftmost word, or the word which shares its leftmost boundary with the utterance. To get this, we add the filter:

_images/finalfilter.png

This gives us the result we are looking for: word-final fricatives in utterance-initial words

Image cannot be displayed in your browser

Another thing we can do is specify previous and following words/phones and their properties. For example: what if we wanted the final segment of the second word in an utterance?

Image cannot be displayed in your browser

This is where the “following” and “previous” options come into play. We can use “previous” to specify the object before the one we are looking for. If we wanted the last phone of the second word in our sample utterance (the “s” in “reasons”) we would want to specify something about the previous word’s alignment. If we wanted to get the final phone of the words in this position, our filters would be:

Image cannot be displayed in your browser

For a full list of filters and their uses, see the section on building queries

Build your own database

Import

SCT currently supports the following corpus formats:

  • Buckeye

  • TIMIT

  • Force-aligned TextGrids
    • FAVE (multiple talkers okay)
    • LaBB-CAT TextGrid export
    • Prosodylab

To import one of those corpora, press the “Import local corpus” button below the “Available corpora” list. Once it has been pressed, select one of the three main options to import. From there, you will have to select where on the local computer the corpus files live and they will be imported into the local server.

At the moment, importing ignores any connections to remote servers, and requires that a local version of Neo4j is running. Sound files will be detected based on sharing a name with a text file or TextGrid. If the location of the sound files is changed, you can update where SCT thinks they are through the “Find local audio files” button.

See Speech Corpus Tools: Tutorial and examples for more details on how to further enrich your database

Indices and tables