Snips Natural Language Understanding

https://travis-ci.org/snipsco/snips-nlu.svg?branch=master https://img.shields.io/pypi/v/snips-nlu.svg?branch=master https://img.shields.io/pypi/pyversions/snips-nlu.svg?branch=master https://codecov.io/gh/snipsco/snips-nlu/branch/master/graph/badge.svg

Welcome to Snips NLU’s documentation.

Snips NLU is a Natural Language Understanding python library that allows to parse sentences written in natural language, and extract structured information.

It’s the library that powers the NLU engine used in the Snips Console that you can use to create awesome and private-by-design voice assistants.

Let’s look at the following example, to illustrate the main purpose of this lib:

"What will be the weather in paris at 9pm?"

Properly trained, the Snips NLU engine will be able to extract structured data such as:

{
   "intent": {
      "intentName": "searchWeatherForecast",
      "probability": 0.95
   },
   "slots": [
      {
         "value": "paris",
         "entity": "locality",
         "slotName": "forecastLocality"
      },
      {
         "value": {
            "kind": "InstantTime",
            "value": "2018-02-08 20:00:00 +00:00"
         },
         "entity": "snips/datetime",
         "slotName": "forecastStartDatetime"
      }
   ]
}

Note

The exact output is a bit richer, the point here is to give a glimpse on what kind of information can be extracted.

This documentation is divided into different parts. It is recommended to start by the first two ones.

The Installation part will get you set up. Then, the Quickstart section will help you build a toy example.

After this, you can either start the Tutorial which will guide you through the steps to create your own NLU engine and start parsing sentences, or you can alternatively check the Key Concepts & Data Model to know more about the NLU concepts used in this lib.

If you want to dive into the codebase or customize some parts, you can use the API reference documentation or alternatively check the github repository.

Installation

System requirements

  • 64-bit Linux, MacOS >= 10.11, 64-bit Windows
  • Python 2.7 or Python >= 3.5
  • RAM: Snips NLU will typically use between 100MB and 200MB of RAM, depending on the language and the size of the dataset.

Install Snips NLU

It is recommended to use a virtual environment and activate it before installing Snips NLU in order to manage your project dependencies properly.

Snips NLU can be installed via pip with the following command:

pip install snips-nlu

We currently have pre-built binaries (wheels) for snips-nlu and its dependencies for MacOS (10.11 and later), Linux x86_64 and Windows 64-bit. If you use a different architecture/os you will need to build these dependencies from sources which means you will need to install setuptools_rust and Rust before running the pip install snips-nlu command.

Language resources

Snips NLU relies on language resources which must be downloaded beforehand. To fetch the resources for a specific language, run the following command:

python -m snips_nlu download <language>

Or simply:

snips-nlu download <language>

The list of supported languages is described here.

Extra dependencies

Metrics

If at some point you want to compute metrics, you will need some extra dependencies that can be installed via:

pip install 'snips-nlu[metrics]'

Tests

pip install 'snips-nlu[test]'

Documentation

pip install 'snips-nlu[doc]'

Quickstart

In this section, we assume that you have installed snips-nlu and loaded resources for English:

pip install snips-nlu
python -m snips_nlu download en

The Snips NLU engine, in its default configuration, needs to be trained on some data before it can start extracting information. Thus, the first thing to do is to build a dataset that can be fed into Snips NLU. For now, we will use this sample dataset which contains data for two intents:

  • sampleGetWeather -> "What will be the weather in Tokyo tomorrow?"
  • sampleTurnOnLight -> "Turn on the light in the kitchen"

The format used here is json, so let’s load it into a python dict:

import io
import json

with io.open("path/to/sample_dataset.json") as f:
    sample_dataset = json.load(f)

Now that we have our dataset, we can move forward to the next step which is building a SnipsNLUEngine. This is the main object of this lib.

from snips_nlu import SnipsNLUEngine

nlu_engine = SnipsNLUEngine()

Now that we have our engine object created, we need to feed it with our sample dataset. In general, this action will require some machine learning, so we will actually fit the engine:

nlu_engine.fit(sample_dataset)

Our NLU engine is now trained to recognize new utterances that extend beyond what is strictly contained in the dataset: it is able to generalize.

Let’s try to parse something now!

import json

parsing = nlu_engine.parse("What will be the weather in San Francisco next week?")
print(json.dumps(parsing, indent=2))

You should get something that looks like this:

{
  "input": "What will be the weather in San Francisco next week?",
  "intent": {
    "intentName": "sampleGetWeather",
    "probability": 0.641227710154331
  },
  "slots": [
    {
      "range": {
        "start": 28,
        "end": 41
      },
      "rawValue": "San Francisco",
      "value": {
        "kind": "Custom",
        "value": "San Francisco"
      },
      "entity": "location",
      "slotName": "weatherLocation"
    },
    {
      "range": {
        "start": 42,
        "end": 51
      },
      "rawValue": "next week",
      "value": {
        "type": "value",
        "grain": "week",
        "precision": "exact",
        "latent": false,
        "value": "2018-02-12 00:00:00 +01:00"
      },
      "entity": "snips/datetime",
      "slotName": "weatherDate"
    }
  ]
}

Congrats, you parsed your first intent!

Tutorial

In this section, we will build an NLU assistant for home automation tasks. It will be able to understand queries about lights and thermostats. More precisely, our assistant will contain three intents:

  • turnLightOn
  • turnLightOff
  • setTemperature

The first two intents will be about turning on and off the lights in a specific room. These intents will have one Slot which will be the room. The third intent will let you control the temperature of a specific room. It will have two slots: the roomTemperature and the room.

The first step is to create an appropriate dataset for this task.

Training Data

Check the Training Dataset Format section for more details about the format used to describe the training data.

In this tutorial, we will create our dataset using the YAML format, and create a dataset.yaml file with the following content:

# turnLightOn intent
---
type: intent
name: turnLightOn
slots:
  - name: room
    entity: room
utterances:
  - Turn on the lights in the [room](kitchen)
  - give me some light in the [room](bathroom) please
  - Can you light up the [room](living room) ?
  - switch the [room](bedroom)'s lights on please

# turnLightOff intent
---
type: intent
name: turnLightOff
slots:
  - name: room
    entity: room
utterances:
  - Turn off the lights in the [room](entrance)
  - turn the [room](bathroom)'s light out please
  - switch off the light the [room](kitchen), will you?
  - Switch the [room](bedroom)'s lights off please

# setTemperature intent
---
type: intent
name: setTemperature
slots:
  - name: room
    entity: room
  - name: roomTemperature
    entity: snips/temperature
utterances:
  - Set the temperature to [roomTemperature](19 degrees) in the [room](bedroom)
  - please set the [room](living room)'s temperature to [roomTemperature](twenty two degrees celsius)
  - I want [roomTemperature](75 degrees fahrenheit) in the [room](bathroom) please
  - Can you increase the temperature to [roomTemperature](22 degrees) ?

# room entity
---
type: entity
name: room
automatically_extensible: no
values:
- bedroom
- [living room, main room, lounge]
- [garden, yard, backyard]

Here, we put all the intents and entities in the same file but we could have split them in dedicated files as well.

The setTemperature intent references a roomTemperature slot which relies on the snips/temperature entity. This entity is a builtin entity. It allows to resolve the temperature values properly.

The room entity makes use of synonyms by defining lists like [living room, main room, lounge]. In this case, main room and lounge will point to living room, the first item of the list, which is the reference value.

Besides, this entity is marked as not automatically extensible which means that the NLU will only output values that we have defined and will not try to match other values.

We are now ready to generate our dataset using the CLI:

snips-nlu generate-dataset en dataset.yaml > dataset.json

Note

We used en as the language here but other languages are supported, please check the Supported languages section to know more.

Now that we have our dataset ready, let’s move to the next step which is to create an NLU engine.

The Snips NLU Engine

The main API of Snips NLU is an object called a SnipsNLUEngine. This engine is the one you will train and use for parsing.

The simplest way to create an NLU engine is the following:

from snips_nlu import SnipsNLUEngine

default_engine = SnipsNLUEngine()

In this example the engine was created with default parameters which, in many cases, will be sufficient.

However, in some cases it may be required to tune the engine a bit and provide a customized configuration. Typically, different languages may require different sets of features. You can check the NLUEngineConfig to get more details about what can be configured.

We have built a list of default configurations, one per supported language, that have some language specific enhancements. In this tutorial we will use the english one.

import io
import json

from snips_nlu import SnipsNLUEngine
from snips_nlu.default_configs import CONFIG_EN

engine = SnipsNLUEngine(config=CONFIG_EN)

At this point, we can try to parse something:

engine.parse("Please give me some lights in the entrance !")

That will raise a NotTrained error, as we did not train the engine with the dataset that we created.

Training the engine

In order to use the engine we created, we need to train it or fit it with the dataset we generated earlier:

with io.open("dataset.json") as f:
    dataset = json.load(f)

engine.fit(dataset)

Note that, by default, training of the NLU engine is non-deterministic: training and testing multiple times on the same data may produce different outputs.

Reproducible trainings can be achieved by passing a random seed to the engine:

seed = 42
engine = SnipsNLUEngine(config=CONFIG_EN, random_state=seed)
engine.fit(dataset)

Note

Due to a scikit-learn bug fixed in version 0.21 we can’t guarantee any deterministic behavior if you’re using a Python version <3.5 since scikit-learn>=0.21 is only available starting from Python >=3.5

Parsing

We are now ready to parse:

parsing = engine.parse("Hey, lights on in the lounge !")
print(json.dumps(parsing, indent=2))

You should get the following output (with a slightly different probability value):

{
  "input": "Hey, lights on in the lounge !",
  "intent": {
    "intentName": "turnLightOn",
    "probability": 0.4879843917522865
  },
  "slots": [
    {
      "range": {
        "start": 22,
        "end": 28
      },
      "rawValue": "lounge",
      "value": {
        "kind": "Custom",
        "value": "living room"
      },
      "entity": "room",
      "slotName": "room"
    }
  ]
}

Notice that the lounge slot value points to living room as defined earlier in the entity synonyms of the dataset.

Now, let’s say the intent is already known and provided by the context of the application, but the slots must still be extracted. A second parsing API allows to extract the slots while providing the intent:

parsing = engine.get_slots("Hey, lights on in the lounge !", "turnLightOn")
print(json.dumps(parsing, indent=2))

This will give you only the extracted slots:

[
  {
    "range": {
      "start": 22,
      "end": 28
    },
    "rawValue": "lounge",
    "value": {
      "kind": "Custom",
      "value": "living room"
    },
    "entity": "room",
    "slotName": "room"
  }
]

Finally, there is another method that allows to run only the intent classification and get the list of intents along with their score:

intents = engine.get_intents("Hey, lights on in the lounge !")
print(json.dumps(intents, indent=2))

This should give you something like below:

[
  {
    "intentName": "turnLightOn",
    "probability": 0.6363648460343694
  },
  {
    "intentName": null,
    "probability": 0.2580088944934134
  },
  {
    "intentName": "turnLightOff",
    "probability": 0.22791834836267366
  },
  {
    "intentName": "setTemperature",
    "probability": 0.181781583254962
  }
]

You will notice that the second intent is null. This intent is what we call the None intent and is explained in the next section.

Important

Even though the term "probability" is used here, the values should rather be considered as confidence scores as they do not sum to 1.0.

The None intent

On top of the intents that you have declared in your dataset, the NLU engine generates an implicit intent to cover utterances that does not correspond to any of your intents. We refer to it as the None intent.

The NLU engine is trained to recognize when the input corresponds to the None intent. Here is the kind of output you should get if you try parsing "foo bar" with the engine we previously created:

{
  "input": "foo bar",
  "intent": {
    "intentName": None,
    "probability": 0.552122
  },
  "slots": []
}
{
  "input": "foo bar",
  "intent": {
    "intentName": null,
    "probability": 0.552122
  },
  "slots": []
}

The None intent is represented by a None value in python which translates in JSON into a null value.

Intents Filters

In some cases, you may have some extra information regarding the context in which the parsing occurs, and you may already know that some intents won’t be triggered. To leverage that, you can use intents filters and restrict the parsing output to a given list of intents:

parsing = engine.parse("Hey, lights on in the lounge !",
                        intents=["turnLightOn", "turnLightOff"])

This will improve the accuracy of the predictions, as the NLU engine will exclude the other intents from the classification task.

Persisting

As a final step, we will persist the engine into a directory. That may be useful in various contexts, for instance if you want to train on a machine and parse on another one.

You can persist the engine with the following API:

engine.persist("path/to/directory")

And load it:

loaded_engine = SnipsNLUEngine.from_path("path/to/directory")

loaded_engine.parse("Turn lights on in the bathroom please")

Alternatively, you can persist/load the engine as a bytearray:

engine_bytes = engine.to_byte_array()
loaded_engine = SnipsNLUEngine.from_byte_array(engine_bytes)

Key Concepts & Data Model

This section is meant to explain the concepts and data model that we use to represent input and output data.

The main task that this lib performs is Information Extraction, or Intent Parsing, to be even more specific. At this point, the output of the engine may still not be very clear to you.

The task of parsing intents is actually two-folds. The first step is to understand which intent the sentence is about. The second step is to extract the parameters, a.k.a. the slots of the sentence.

Intent

In the context of information extraction, an intent corresponds to the action or intention contained in the user’s query, which can be more or less explicit.

Lets’ consider for instance the following sentences:

"Turn on the light"
"It's too dark in this room, can you fix this?"

They both express the same intent which is switchLightOn, but they are expressed in two very different ways.

Thus, the first task in intent parsing is to be able to detect the intent of the sentence, or say differently to classify sentences based on their underlying intent.

In Snips NLU, this is represented within the parsing output in this way:

{
    "intentName": "switchLightOn",
    "probability": 0.87
}

So you have an additional information which is the probability that the extracted intent correspond to the actual one.

As explained in the tutorial, on top of the intents you have declared there is another implicit intent handled internally, called the None intent. Any input which corresponds to none of the intents you have declared will be classified as a None intent. In this case the parsing output looks like this:

{
  "input": "foo bar",
  "intent": null,
  "slots": null
}

Slot

The second part of the task, once the intent is known, is to extract the parameters that may be contained in the sentence. We called them slots.

For example, let’s consider this sentence:

"Turn on the light in the kitchen"

As before the intent is switchLightOn, however there is now an additional piece of information which is contained in the word kitchen.

This intent contains one slot, which is the room in which the light is to be turned on.

Let’s consider another example:

"Find me a flight from Paris to Tokyo"

Here the intent would be searchFlight, and now there are two slots in the sentence being contained in "Paris" and "Tokyo". These two values are of the same type as they both correspond to a location however they have different roles, as Paris is the departure and Tokyo is the arrival.

In this context, we call location a slot type (or entity) and departure and arrival are slot names.

Note

We may refer equally to slot type or entity to describe the same concept

Slot type vs. slot name

A slot type or entity is to NLU what a type is to coding. It describes the nature of the value. In a piece of code, multiple variables can be of the same type while having different purposes, usually transcribed in their name. All variables of a same type will have some common characteristics, for instance they have the same methods, they may be comparable etc.

In information extraction, a slot type corresponds to a class of values that fall into the same category. In our previous example, the location slot type corresponds to all values that correspond to a place, a city, a country or anything that can be located.

The slot name can be thought as the role played by the entity in the sentence.

In Snips NLU, extracted slots are represented within the output in this way:

[
  {
    "rawValue": "Paris",
    "value": {
      "kind": "Custom",
      "value": "Paris"
    },
    "entity": "location",
    "slotName": "departure",
    "range": {
      "start": 28,
      "end": 41
    }
  },
  {
    "rawValue": "Tokyo",
    "value": {
      "kind": "Custom",
      "value": "Tokyo"
    },
    "entity": "location",
    "slotName": "arrival",
    "range": {
      "start": 28,
      "end": 41
    }
  }
]

In this example, the slot value contains a "kind" attribute whose value here is "Custom". There are two classes of slot types or entity:

  • Builtin entities
  • Custom entities

Builtin Entities and resolution

Snips NLU actually goes a bit further than simply extracting slots, let’s illustrate this with another example:

"What will be the weather tomorrow at 10am?"

This sentence contains a slot, "tomorrow at 10am", which is a datetime. Here is how the slot extracted by Snips NLU would look like in this case:

{
  "rawValue": "tomorrow at 10am",
  "value": {
    "kind": "InstantTime",
    "value": "2018-02-10 10:00:00 +00:00",
    "grain": "Hour",
    "precision": "Exact"
  },
  "range": {
    "start": 20,
    "end": 36
  },
  "entity": "snips/datetime",
  "slotName": "weatherDate"
}

As you can see, the "value" field here contains more information than in the previous example. This is because the entity used here, "snips/datetime", is what we call a Builtin Entity.

Snips NLU supports multiple builtin entities that are typically strongly typed entities such as date, temperatures, numbers etc, and for which a specific extractor is available.

These entities have special labels starting with "snips/" and making use of them when appropriate will not only give better results, but it will also provide some entity resolution such as an ISO format for a date.

Builtin entities and their underlying extractors are maintained by the Snips team. You can find the list of all the builtin entities supported per language in the builtin entities section. Snips NLU relies on the powerful Rustling library to extract builtin entities from text.

On the other hand, entities that are declared by the developer are called custom entities.

Custom Entities

As soon as you use a slot type which is not part of Snips builtin entities, you are using a custom entity. There are several things you can do to customize it, and make it fit with your use case.

Entity Values & Synonyms

The first thing you can do is add a list of possible values for your entity.

By providing a list of example values for your entity, you help Snips NLU grasp what the entity is about.

Let’s say you are creating an assistant whose purpose is to let you set the color of your connected light bulbs. What you will do is define a "color" entity. On top of that you can provide a list of sample colors by editing the entity in your dataset as follows:

{
  "color": {
    "automatically_extensible": true,
    "use_synonyms": true,
    "data": [
      {
        "value": "white",
        "synonyms": []
      },
      {
        "value": "yellow",
        "synonyms": []
      },
      {
        "value": "pink",
        "synonyms": []
      },
      {
        "value": "blue",
        "synonyms": []
      }
    ],
    "matching_strictness": 1.0
  }
}
---
type: entity
name: color
values:
  - white
  - yellow
  - pink
  - blue

Now imagine that you want to allow some variations around these values e.g. using "pinky" instead of "pink". You could add these variations in the list by adding a new value, however in this case what you want is to tell the NLU to consider "pinky" as a synonym of "pink":

{
  "value": "pink",
  "synonyms": ["pinky"]
}
- [pink, pinky]

In this context, Snips NLU will map "pinky" to its reference value, "pink", in its output.

Let’s consider this sentence:

Please make the light pinky

Here is the kind of NLU output that you would get in this context:

{
  "input": "Please make the light pinky",
  "intent": {
    "intentName": "setLightColor",
    "probability": 0.95
  },
  "slots": [
    {
      "rawValue": "pinky",
      "value": {
        "kind": "Custom",
        "value": "pink"
      },
      "entity": "color",
      "slotName": "lightColor",
      "range": {
        "start": 22,
        "end": 27
      }
    }
  ]
}

The "rawValue" field contains the color value as written within the input, but now the "value" field has been resolved and it contains the reference color, "pink", that the synonym refers to.

Automatically Extensible Entities

On top of declaring color values and color synonyms, you can also decide how Snips NLU reacts to unknown entity values.

In the light color assistant example, one of the first things to do would be to check what are the colors that are supported by the bulb, for instance:

["white", "yellow", "red", "blue", "green", "pink", "purple"]

As you can only handle these colors, you can enforce Snips NLU to filter out slot values that are not part of this list, so that the output always contain valid values, i.e. supported colors.

On the contrary, let’s say you want to build a smart music assistant that will let you control your speakers and play any artist you want.

Obviously, you can’t list all the artist and songs that you might want to listen to at some point. This means that your dataset will contain some examples of such artist but you expect Snips NLU to extend beyond these values and extract any other artist or song that appear in the same context.

Your entity must be automatically extensible.

Now in practice, there is a flag in the dataset that lets you choose whether or not your custom entity is automatically extensible:

{
  "my_custom_entity": {
    "automatically_extensible": true,
    "use_synonyms": true,
    "data": [],
    "matching_strictness": 1.0
  }
}
---
type: entity
name: my_custom_entity
automatically_extensible: yes

Training Dataset Format

The Snips NLU library leverages machine learning algorithms and some training data in order to produce a powerful intent recognition engine.

The better your training data is, and the more accurate your NLU engine will be. Thus, it is worth spending a bit of time to create a dataset that matches well your use case.

Snips NLU accepts two different dataset formats. The first one, which relies on YAML, is the preferred option if you want to create or edit a dataset manually. The other dataset format uses JSON and should rather be used if you plan to create or edit datasets programmatically.

YAML format

The YAML dataset format allows you to define intents and entities using the YAML syntax.

Entity

Here is what an entity file looks like:

# City Entity
---
type: entity # allows to differentiate between entities and intents files
name: city # name of the entity
values:
  - london # single entity value
  - [new york, big apple] # entity value with a synonym
  - [paris, city of lights]

You can specify entity values either using single YAML scalars (e.g. london), or using lists if you want to define some synonyms (e.g. [paris, city of lights])

Here is a more comprehensive example which contains additional attributes that are optional:

# City Entity
---
type: entity
name: city
automatically_extensible: false # default value is true
use_synonyms: false # default value is true
matching_strictness: 0.8 # default value is 1.0
values:
  - london
  - [new york, big apple]
  - [paris, city of lights]

Intent

Here is the format used to describe an intent:

# searchFlight Intent
---
type: intent
name: searchFlight # name of the intent
utterances:
  - find me a flight from [origin:city](Paris) to [destination:city](New York)
  - I need a flight leaving [date:snips/datetime](this weekend) to [destination:city](Berlin)
  - show me flights to go to [destination:city](new york) leaving [date:snips/datetime](this evening)

We use a standard markdown-like annotation syntax to annotate slots within utterances. The [origin:city](Paris) chunk describes a slot with its three components:

  • origin: the slot name
  • city: the slot type
  • Paris: the slot value

Note that different slot names can share the same slot type. This is the case for the origin and destination slot names in the previous example, which have the same slot type city.

If you are to write more than just three utterances, you can actually specify the slot mapping explicitly in the intent file and remove it from the utterances. This will result in simpler annotations:

# searchFlight Intent
---
type: intent
name: searchFlight
slots:
  - name: origin
    entity: city
  - name: destination
    entity: city
  - name: date
    entity: snips/datetime
utterances:
  - find me a flight from [origin](Paris) to [destination](New York)
  - I need a flight leaving [date](this weekend) to [destination](Berlin)
  - show me flights to go to [destination](new york) leaving [date](this evening)

Important

If one of your utterances starts with [, you must put it between double quotes to respect the YAML syntax: "[origin] to [destination]".

Dataset

You are free to organize the yaml documents as you want. Either having one yaml file for each intent and each entity, or gathering some documents together (e.g. all entities together, or all intents together) in the same yaml file. Here is the yaml file corresponding to the previous city entity and searchFlight intent merged together:

# searchFlight Intent
---
type: intent
name: searchFlight
slots:
  - name: origin
    entity: city
  - name: destination
    entity: city
  - name: date
    entity: snips/datetime
utterances:
  - find me a flight from [origin](Paris) to [destination](New York)
  - I need a flight leaving [date](this weekend) to [destination](Berlin)
  - show me flights to go to [destination](new york) leaving [date](this evening)

# City Entity
---
type: entity
name: city
values:
  - london
  - [new york, big apple]
  - [paris, city of lights]

Important

If you plan to have more than one entity or intent in a YAML file, you must separate them using the YAML document separator: ---

Implicit entity values and slot mapping

In order to make the annotation process even easier, there is a mechanism that allows to populate entity values automatically based on the entity values that are already provided.

This results in a much simpler dataset file:

# searchFlight Intent
---
type: intent
name: searchFlight
slots:
  - name: origin
    entity: city
  - name: destination
    entity: city
  - name: date
    entity: snips/datetime
utterances:
  - find me a flight from [origin] to [destination]
  - I need a flight leaving [date] to [destination]
  - show me flights to go to [destination] leaving [date]

# City Entity
---
type: entity
name: city
values:
  - london
  - [new york, big apple]
  - [paris, city of lights]

For this to work, you need to provide at least one value for each custom entity. This can be done either through an entity file, or simply by providing an entity value in one of the annotated utterances. Entity values are automatically generated for builtin entities.

Here is a final example of a valid YAML dataset leveraging implicit entity values as well as implicit slot mapping:

# searchFlight Intent
---
type: intent
name: searchFlight
utterances:
  - find me a flight from [origin:city](Paris) to [destination:city]
  - I need a flight leaving [date:snips/datetime] to [destination]
  - show me flights to go to [destination] leaving [date]

Note that the city entity was not provided here, but one value (Paris) was provided in the first annotated utterance. The mapping between slot name and entity is also inferred from the first two utterances.

Once your intents and entities are created using the YAML format described previously, you can produce a dataset using the Command Line Interface (CLI):

snips-nlu generate-dataset en city.yaml searchFlight.yaml > dataset.json

Or alternatively if you merged the yaml documents into a single file:

snips-nlu generate-dataset en dataset.yaml > dataset.json

This will generate a JSON dataset and write it in the dataset.json file. The format of the generated file is the second allowed format that is described in the JSON format section.

JSON format

The JSON format is the format which is eventually used by the training API. It was designed to be easy to parse.

We created a sample dataset that you can check to better understand the format.

There are three attributes at the root of the JSON document:

  • "language": the language of the dataset in ISO format
  • "intents": a dictionary mapping between intents names and intents data
  • "entities": a dictionary mapping between entities names and entities data

Here is how the entities are represented in this format:

{
  "entities": {
    "snips/datetime": {},
    "city": {
      "data": [
        {
          "value": "london",
          "synonyms": []
        },
        {
          "value": "new york",
          "synonyms": [
            "big apple"
          ]
        },
        {
          "value": "paris",
          "synonyms": [
            "city of lights"
          ]
        }
      ],
      "use_synonyms": true,
      "automatically_extensible": true,
      "matching_strictness": 1.0
    }
  }
}

Note that the "snips/datetime" entity data is empty as it is a builtin entity.

The intent utterances are defined using the following format:

{
  "data": [
    {
      "text": "find me a flight from "
    },
    {
      "text": "Paris",
      "entity": "city",
      "slot_name": "origin"
    },
    {
      "text": " to "
    },
    {
      "text": "New York",
      "entity": "city",
      "slot_name": "destination"
    }
  ]
}

Once you have created a JSON dataset, either directly or with YAML files, you can use it to train an NLU engine. To do so, you can use the CLI as documented here, or the python API.

Custom Processing Units

The Snips NLU library provides a default NLU pipeline containing built-in processing units such as the LookupIntentParser or the ProbabilisticIntentParser.

However, it is possible to define custom processing units and use them in a SnipsNLUEngine.

The main processing unit of the Snips NLU processing pipeline is the SnipsNLUEngine. This engine relies on a list of IntentParser that are called successively until one of them manages to extract an intent. By default, two parsers are used by the engine: a LookupParser and a ProbabilisticIntentParser.

Let’s focus on the probabilistic intent parser. This parser parses text using two steps: first it classifies the intent using an IntentClassifier and once the intent is known, it using a SlotFiller in order to extract the slots.

For the purpose of this tutorial, let’s build a custom alternative to the CRFSlotFiller which is the default slot filler used by the probabilistic intent parser.

Our custom slot filler will extract slots by relying on a very simple and naive keyword matching logic:

import json

from snips_nlu.common.utils import json_string
from snips_nlu.preprocessing import tokenize
from snips_nlu.result import unresolved_slot
from snips_nlu.slot_filler import SlotFiller


@SlotFiller.register("keyword_slot_filler")
class KeywordSlotFiller(SlotFiller):
    def __init__(self, config=None, **shared):
        super(KeywordSlotFiller, self).__init__(config, **shared)
        self.slots_keywords = None
        self.language = None

    @property
    def fitted(self):
        return self.slots_keywords is not None

    def fit(self, dataset, intent):
        self.language = dataset["language"]
        self.slots_keywords = dict()
        utterances = dataset["intents"][intent]["utterances"]
        for utterance in utterances:
            for chunk in utterance["data"]:
                if "slot_name" in chunk:
                    text = chunk["text"]
                    self.slots_keywords[text] = [
                        chunk["entity"],
                        chunk["slot_name"]
                    ]
        return self

    def get_slots(self, text):
        tokens = tokenize(text, self.language)
        slots = []
        for token in tokens:
            value = token.value
            if value in self.slots_keywords:
                entity = self.slots_keywords[value][0]
                slot_name = self.slots_keywords[value][1]
                slot = unresolved_slot((token.start, token.end), value,
                                       entity, slot_name)
                slots.append(slot)
        return slots

    def persist(self, path):
        model = {
            "language": self.language,
            "slots_keywords": self.slots_keywords,
            "config": self.config.to_dict()
        }
        with path.open(mode="w", encoding="utf8") as f:
            f.write(json_string(model))

    @classmethod
    def from_path(cls, path, **shared):
        with path.open(encoding="utf8") as f:
            model = json.load(f)
        slot_filler = cls()
        slot_filler.language = model["language"]
        slot_filler.slots_keywords = model["slots_keywords"]
        slot_filler.config = cls.config_type.from_dict(model["config"])
        return slot_filler

Our custom slot filler is registered to the list of available processing units by the use of a class decorator: @SlotFiller.register("keyword_slot_filler").

Now that we have created our keyword slot filler, we can create a specific NLUEngineConfig which will make use of it:

from snips_nlu import SnipsNLUEngine
from snips_nlu.pipeline.configs import (
    ProbabilisticIntentParserConfig, NLUEngineConfig)
from snips_nlu.slot_filler.keyword_slot_filler import KeywordSlotFiller

slot_filler_config = KeywordSlotFiller.default_config()
parser_config = ProbabilisticIntentParserConfig(
    slot_filler_config=slot_filler_config)
engine_config = NLUEngineConfig([parser_config])
nlu_engine = SnipsNLUEngine(engine_config)

Custom processing unit configuration

So far, our keyword slot filler is very simple, especially because it is not configurable.

Now, let’s imagine that we would like to perform a normalization step before matching keywords, which would consist in lowercasing the values. We could hardcode this behavior in our unit, but what we rather want is a way to configure this behavior. This can be done through the use of the config attribute of our keyword slot filler. Let’s add a boolean parameter in the config, so that now our KeywordSlotFiller implementation looks like this:

import json

from snips_nlu.common.utils import json_string
from snips_nlu.preprocessing import tokenize
from snips_nlu.result import unresolved_slot
from snips_nlu.slot_filler import SlotFiller


@SlotFiller.register("keyword_slot_filler")
class KeywordSlotFiller(SlotFiller):
    def __init__(self, config=None, **shared):
        super(KeywordSlotFiller, self).__init__(config, **shared)
        self.slots_keywords = None
        self.language = None

    @property
    def fitted(self):
        return self.slots_keywords is not None

    def fit(self, dataset, intent):
        self.language = dataset["language"]
        self.slots_keywords = dict()
        utterances = dataset["intents"][intent]["utterances"]
        for utterance in utterances:
            for chunk in utterance["data"]:
                if "slot_name" in chunk:
                    text = chunk["text"]
                    if self.config.get("lowercase", False):
                        text = text.lower()
                    self.slots_keywords[text] = [
                        chunk["entity"],
                        chunk["slot_name"]
                    ]
        return self

    def get_slots(self, text):
        tokens = tokenize(text, self.language)
        slots = []
        for token in tokens:
            normalized_value = token.value
            if self.config.get("lowercase", False):
                normalized_value = normalized_value.lower()
            if normalized_value in self.slots_keywords:
                entity = self.slots_keywords[normalized_value][0]
                slot_name = self.slots_keywords[normalized_value][1]
                slot = unresolved_slot((token.start, token.end), token.value,
                                       entity, slot_name)
                slots.append(slot)
        return slots

    def persist(self, path):
        model = {
            "language": self.language,
            "slots_keywords": self.slots_keywords,
            "config": self.config.to_dict()
        }
        with path.open(mode="w", encoding="utf8") as f:
            f.write(json_string(model))

    @classmethod
    def from_path(cls, path, **shared):
        with path.open(encoding="utf8") as f:
            model = json.load(f)
        slot_filler = cls()
        slot_filler.language = model["language"]
        slot_filler.slots_keywords = model["slots_keywords"]
        slot_filler.config = cls.config_type.from_dict(model["config"])
        return slot_filler

With this updated implementation, we can now define a more specific configuration for our slot filler:

from snips_nlu import SnipsNLUEngine
from snips_nlu.pipeline.configs import (
    ProbabilisticIntentParserConfig, NLUEngineConfig)
from snips_nlu.slot_filler.keyword_slot_filler import KeywordSlotFiller

slot_filler_config = {
    "unit_name": "keyword_slot_filler",  # required in order to identify the processing unit
    "lower_case": True
}
parser_config = ProbabilisticIntentParserConfig(
    slot_filler_config=slot_filler_config)
engine_config = NLUEngineConfig([parser_config])
nlu_engine = SnipsNLUEngine(engine_config)

You can now train this engine, parse intents, persist it and load it from disk.

Note

The client code is responsible for persisting and loading the unit configuration as done in the implementation example. This will ensure that the proper configuration is used when deserializing the processing unit.

Supported languages

Snips NLU supports various languages. The language is specified in the dataset in the "language" attribute.

Here is the list of the supported languages along with their isocode:

ISO code
de
en
es
fr
it
ja
ko
pt_br
pt_pt

Support for additional languages will come in the future, stay tuned :)

Supported builtin entities

Builtin entities are entities that have a built-in support in Snips NLU. These entities are associated to specific builtin entity parsers which provide an extra resolution step. Typically, dates written in natural language ("in three days") are resolved into ISO formatted dates ("2019-08-12 00:00:00 +02:00").

Here is the list of supported builtin entities:

Entity Identifier Category Supported Languages
AmountOfMoney snips/amountOfMoney Grammar Entity de, en, es, fr, it, ja, ko, pt_br, pt_pt
Duration snips/duration Grammar Entity de, en, es, fr, it, ja, ko, pt_br, pt_pt
Number snips/number Grammar Entity de, en, es, fr, it, ja, ko, pt_br, pt_pt
Ordinal snips/ordinal Grammar Entity de, en, es, fr, it, ja, ko, pt_br, pt_pt
Temperature snips/temperature Grammar Entity de, en, es, fr, it, ja, ko, pt_br, pt_pt
Datetime snips/datetime Grammar Entity de, en, es, fr, it, ja, ko, pt_br, pt_pt
Date snips/date Grammar Entity en
Time snips/time Grammar Entity en
DatePeriod snips/datePeriod Grammar Entity en
TimePeriod snips/timePeriod Grammar Entity en
Percentage snips/percentage Grammar Entity de, en, es, fr, it, ja, pt_br, pt_pt
MusicAlbum snips/musicAlbum Gazetteer Entity de, en, es, fr, it, ja, pt_br, pt_pt
MusicArtist snips/musicArtist Gazetteer Entity de, en, es, fr, it, ja, pt_br, pt_pt
MusicTrack snips/musicTrack Gazetteer Entity de, en, es, fr, it, ja, pt_br, pt_pt
City snips/city Gazetteer Entity de, en, es, fr, it, ja, pt_br, pt_pt
Country snips/country Gazetteer Entity de, en, es, fr, it, ja, pt_br, pt_pt
Region snips/region Gazetteer Entity de, en, es, fr, it, ja, pt_br, pt_pt

The entity identifier (second column above) is what is used in the dataset to reference a builtin entity.

Grammar Entity

Grammar entities, in the context of Snips NLU, correspond to entities which contain significant compositionality. The semantic meaning of such an entity is determined by the meanings of its constituent expressions and the rules used to combine them. Modern semantic parsers for these entities are often based on defining a formal grammar. In the case of Snips NLU, the parser used to handle these entities is Rustling, a Rust adaptation of Facebook’s duckling.

Gazetteer Entity

Gazetteer entities correspond to all the builtin entities which do not contain any semantic structure, as opposed to the grammar entities. For such entities, a gazetteer entity parser is used to perform the parsing.

Results Examples

The following sections provide examples for each builtin entity.

AmountOfMoney

Input examples:

[
  "$10",
  "six euros",
  "around 5€",
  "ten dollars and five cents"
]

Output examples:

[
  {
    "kind": "AmountOfMoney",
    "value": 10.05,
    "precision": "Approximate",
    "unit": "€"
  }
]

Duration

Input examples:

[
  "1h",
  "during two minutes",
  "for 20 seconds",
  "3 months",
  "half an hour",
  "8 years and two days"
]

Output examples:

[
  {
    "kind": "Duration",
    "years": 0,
    "quarters": 0,
    "months": 3,
    "weeks": 0,
    "days": 0,
    "hours": 0,
    "minutes": 0,
    "seconds": 0,
    "precision": "Exact"
  }
]

Number

Input examples:

[
  "2001",
  "twenty one",
  "three hundred and four"
]

Output examples:

[
  {
    "kind": "Number",
    "value": 42.0
  }
]

Ordinal

Input examples:

[
  "1st",
  "the second",
  "the twenty third"
]

Output examples:

[
  {
    "kind": "Ordinal",
    "value": 2
  }
]

Temperature

Input examples:

[
  "70K",
  "3°C",
  "Twenty three degrees",
  "one hundred degrees fahrenheit"
]

Output examples:

[
  {
    "kind": "Temperature",
    "value": 23.0,
    "unit": "celsius"
  },
  {
    "kind": "Temperature",
    "value": 60.0,
    "unit": "fahrenheit"
  }
]

Datetime

Input examples:

[
  "Today",
  "at 8 a.m.",
  "4:30 pm",
  "in 1 hour",
  "the 3rd tuesday of June"
]

Output examples:

[
  {
    "kind": "InstantTime",
    "value": "2017-06-13 18:00:00 +02:00",
    "grain": "Hour",
    "precision": "Exact"
  },
  {
    "kind": "TimeInterval",
    "from": "2017-06-07 18:00:00 +02:00",
    "to": "2017-06-08 00:00:00 +02:00"
  }
]

Date

Input examples:

[
  "today",
  "on Wednesday",
  "March 26th",
  "saturday january 19",
  "monday 15th april 2019",
  "the day after tomorrow"
]

Output examples:

[
  {
    "kind": "InstantTime",
    "value": "2017-06-13 00:00:00 +02:00",
    "grain": "Day",
    "precision": "Exact"
  }
]

Time

Input examples:

[
  "now",
  "at noon",
  "at 8 a.m.",
  "4:30 pm",
  "in one hour",
  "for ten o'clock",
  "at ten in the evening"
]

Output examples:

[
  {
    "kind": "InstantTime",
    "value": "2017-06-13 18:00:00 +02:00",
    "grain": "Hour",
    "precision": "Exact"
  }
]

DatePeriod

Input examples:

[
  "january",
  "2019",
  "from monday to friday",
  "from wednesday 27th to saturday 30th",
  "this week"
]

Output examples:

[
  {
    "kind": "TimeInterval",
    "from": "2017-06-07 00:00:00 +02:00",
    "to": "2017-06-09 00:00:00 +02:00"
  }
]

TimePeriod

Input examples:

[
  "until dinner",
  "from five to ten",
  "by the end of the day"
]

Output examples:

[
  {
    "kind": "TimeInterval",
    "from": "2017-06-07 18:00:00 +02:00",
    "to": "2017-06-07 20:00:00 +02:00"
  }
]

Percentage

Input examples:

[
  "25%",
  "twenty percent",
  "two hundred and fifty percents"
]

Output examples:

[
  {
    "kind": "Percentage",
    "value": 20.0
  }
]

MusicAlbum

Input examples:

[
  "Discovery"
]

Output examples:

[
  {
    "kind": "MusicAlbum",
    "value": "Discovery"
  }
]

MusicArtist

Input examples:

[
  "Daft Punk"
]

Output examples:

[
  {
    "kind": "MusicArtist",
    "value": "Daft Punk"
  }
]

MusicTrack

Input examples:

[
  "Harder Better Faster Stronger"
]

Output examples:

[
  {
    "kind": "MusicTrack",
    "value": "Harder Better Faster Stronger"
  }
]

City

Input examples:

[
  "San Francisco",
  "Los Angeles",
  "Beijing",
  "Paris"
]

Output examples:

[
  {
    "kind": "City",
    "value": "Paris"
  }
]

Country

Input examples:

[
  "France"
]

Output examples:

[
  {
    "kind": "Country",
    "value": "France"
  }
]

Region

Input examples:

[
  "California",
  "Washington"
]

Output examples:

[
  {
    "kind": "Region",
    "value": "California"
  }
]

Evaluation

The snips-nlu library provides two CLI commands to compute metrics and evaluate the quality of your NLU engine.

Cross Validation Metrics

You can compute cross validation metrics on a given dataset by running the following command:

snips-nlu cross-val-metrics path/to/dataset.json path/to/metrics.json --include_errors

This will produce a JSON metrics report that will be stored in the path/to/metrics.json file. This report contains:

You can check the CLI help for the exhaustive list of options:

snips-nlu cross-val-metrics --help

Train / Test metrics

Alternatively, you can compute metrics in a classical train / test fashion by running the following command:

snips-nlu train-test-metrics path/to/train_dataset.json path/to/test_dataset.json path/to/metrics.json

This will produce a similar metrics report to the one before.

You can check the CLI help for the exhaustive list of options:

snips-nlu train-test-metrics --help

Command Line Interface

The easiest way to test the abilities of the Snips NLU library is through the command line interface (CLI). The CLI is installed with the python package and is typically used by running snips-nlu <command> [args] or alternatively python -m snips_nlu <command> [args].

Creating a dataset

As seen in the tutorial section, a command allows you to generate a dataset from a language and a list of YAML files containing data for intents and entities:

snips-nlu generate-dataset en my_first_intent.yaml my_second_intent.yaml my_entity.yaml

Note

You don’t have to use separated files for each intent and entity. You could for instance merge all intents together in a single intents.yaml file, or even merge all intents and entities in a single dataset.yaml file.

This will print a JSON string to the standard output. If you want to store the dataset directly in a JSON file, you just have to pipe the previous command like below:

snips-nlu generate-dataset en my_first_intent.yaml my_second_intent.yaml my_entity.yaml > dataset.json

Check the Training Dataset Format section for more details about the format used to describe the training data.

Training

Once you have built a proper dataset, you can use the CLI to train an NLU engine:

snips-nlu train path/to/dataset.json path/to/persisted_engine

The first parameter corresponds to the path of the dataset file. The second parameter is the directory where the engine should be saved after training. The CLI takes care of creating this directory. You can enable logs by adding a -v flag.

Parsing

Finally, you can use the parsing command line to test interactively the parsing abilities of a trained NLU engine:

snips-nlu parse path/to/persisted_engine

This will run a prompt allowing you to parse queries interactively. You can also pass a single query using an optional parameter:

snips-nlu parse path/to/persisted_engine -q "my query"

In both previous examples, you can perform parsing using intents filters by providing a comma-separated list of intents:

snips-nlu parse path/to/persisted_engine -f intent1,intent3

Evaluation

The CLI provides two commands that will help you evaluate the performance of your NLU engine. These commands are detailed in this dedicated section.

Versions

Two simple commands allow to print the version of the library and the version of the NLU model:

snips-nlu version
snips-nlu model-version

API reference

This part of the documentation covers the most important interfaces of the Snips NLU package.

NLU engine

class SnipsNLUEngine(config=None, **shared)

Main class to use for intent parsing

A SnipsNLUEngine relies on a list of IntentParser object to parse intents, by calling them successively using the first positive output.

With the default parameters, it will use the two following intent parsers in this order:

The logic behind is to first use a conservative parser which has a very good precision while its recall is modest, so simple patterns will be caught, and then fallback on a second parser which is machine-learning based and will be able to parse unseen utterances while ensuring a good precision and recall.

The NLU engine can be configured by passing a NLUEngineConfig

config_type

alias of snips_nlu.pipeline.configs.nlu_engine.NLUEngineConfig

intent_parsers = None

list of IntentParser

fitted

Whether or not the nlu engine has already been fitted

fit(**kwargs)

Fits the NLU engine

Parameters:
  • dataset (dict) – A valid Snips dataset
  • force_retrain (bool, optional) – If False, will not retrain intent parsers when they are already fitted. Default to True.
Returns:

The same object, trained.

parse(**kwargs)

Performs intent parsing on the provided text by calling its intent parsers successively

Parameters:
  • text (str) – Input
  • intents (str or list of str, optional) – If provided, reduces the scope of intent parsing to the provided list of intents. The None intent is never filtered out, meaning that it can be returned even when using an intents scope.
  • top_n (int, optional) – when provided, this method will return a list of at most top_n most likely intents, instead of a single parsing result. Note that the returned list can contain less than top_n elements, for instance when the parameter intents is not None, or when top_n is greater than the total number of intents.
Returns:

the most likely intent(s) along with the extracted slots. See parsing_result() and extraction_result() for the output format.

Return type:

dict or list

Raises:
  • NotTrained – When the nlu engine is not fitted
  • InvalidInputError – When input type is not unicode
get_intents(**kwargs)

Performs intent classification on the provided text and returns the list of intents ordered by decreasing probability

The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent

Note

The probabilities returned along with each intent are not guaranteed to sum to 1.0. They should be considered as scores between 0 and 1.

get_slots(**kwargs)

Extracts slots from a text input, with the knowledge of the intent

Parameters:
  • text (str) – input
  • intent (str) – the intent which the input corresponds to
Returns:

the list of extracted slots

Return type:

list

Raises:
  • IntentNotFoundError – When the intent was not part of the training data
  • InvalidInputError – When input type is not unicode
persist(path, *args, **kwargs)

Persists the NLU engine at the given directory path

Parameters:path (str or pathlib.Path) – the location at which the nlu engine must be persisted. This path must not exist when calling this function.
Raises:PersistingError – when persisting to a path which already exists
classmethod from_path(path, **shared)

Loads a SnipsNLUEngine instance from a directory path

The data at the given path must have been generated using persist()

Parameters:

path (str) – The path where the nlu engine is stored

Raises:
  • LoadingError – when some files are missing
  • IncompatibleModelError – when trying to load an engine model which is not compatible with the current version of the lib

Intent Parser

class IntentParser(config, **shared)

Abstraction which performs intent parsing

A custom intent parser must inherit this class to be used in a SnipsNLUEngine

fit(dataset, force_retrain)

Fit the intent parser with a valid Snips dataset

Parameters:
  • dataset (dict) – valid Snips NLU dataset
  • force_retrain (bool) – specify whether or not sub units of the
  • parser that may be already trained should be retrained (intent) –
parse(text, intents, top_n)

Performs intent parsing on the provided text

Parameters:
  • text (str) – input
  • intents (str or list of str) – if provided, reduces the scope of intent parsing to the provided list of intents
  • top_n (int, optional) – when provided, this method will return a list of at most top_n most likely intents, instead of a single parsing result. Note that the returned list can contain less than top_n elements, for instance when the parameter intents is not None, or when top_n is greater than the total number of intents.
Returns:

the most likely intent(s) along with the extracted slots. See parsing_result() and extraction_result() for the output format.

Return type:

dict or list

get_intents(text)

Performs intent classification on the provided text and returns the list of intents ordered by decreasing probability

The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent

Note

The probabilities returned along with each intent are not guaranteed to sum to 1.0. They should be considered as scores between 0 and 1.

get_slots(text, intent)

Extract slots from a text input, with the knowledge of the intent

Parameters:
  • text (str) – input
  • intent (str) – the intent which the input corresponds to
Returns:

the list of extracted slots

Return type:

list

Raises:

IntentNotFoundError – when the intent was not part of the training data

class DeterministicIntentParser(config=None, **shared)

Intent parser using pattern matching in a deterministic manner

This intent parser is very strict by nature, and tends to have a very good precision but a low recall. For this reason, it is interesting to use it first before potentially falling back to another parser.

The deterministic intent parser can be configured by passing a DeterministicIntentParserConfig

config_type

alias of snips_nlu.pipeline.configs.intent_parser.DeterministicIntentParserConfig

patterns

Dictionary of patterns per intent

fitted

Whether or not the intent parser has already been trained

fit(**kwargs)

Fits the intent parser with a valid Snips dataset

parse(**kwargs)

Performs intent parsing on the provided text

Intent and slots are extracted simultaneously through pattern matching

Parameters:
  • text (str) – input
  • intents (str or list of str) – if provided, reduces the scope of intent parsing to the provided list of intents
  • top_n (int, optional) – when provided, this method will return a list of at most top_n most likely intents, instead of a single parsing result. Note that the returned list can contain less than top_n elements, for instance when the parameter intents is not None, or when top_n is greater than the total number of intents.
Returns:

the most likely intent(s) along with the extracted slots. See parsing_result() and extraction_result() for the output format.

Return type:

dict or list

Raises:

NotTrained – when the intent parser is not fitted

get_intents(*args, **kwargs)

Returns the list of intents ordered by decreasing probability

The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent

get_slots(*args, **kwargs)

Extracts slots from a text input, with the knowledge of the intent

Parameters:
  • text (str) – input
  • intent (str) – the intent which the input corresponds to
Returns:

the list of extracted slots

Return type:

list

Raises:

IntentNotFoundError – When the intent was not part of the training data

persist(path, *args, **kwargs)

Persists the object at the given path

classmethod from_path(path, **shared)

Loads a DeterministicIntentParser instance from a path

The data at the given path must have been generated using persist()

to_dict()

Returns a json-serializable dict

classmethod from_dict(unit_dict, **shared)

Creates a DeterministicIntentParser instance from a dict

The dict must have been generated with to_dict()

class LookupIntentParser(config=None, **shared)

A deterministic Intent parser implementation based on a dictionary

This intent parser is very strict by nature, and tends to have a very good precision but a low recall. For this reason, it is interesting to use it first before potentially falling back to another parser.

The lookup intent parser can be configured by passing a LookupIntentParserConfig

config_type

alias of snips_nlu.pipeline.configs.intent_parser.LookupIntentParserConfig

fitted

Whether or not the intent parser has already been trained

fit(**kwargs)

Fits the intent parser with a valid Snips dataset

parse(**kwargs)

Performs intent parsing on the provided text

Intent and slots are extracted simultaneously through pattern matching

Parameters:
  • text (str) – input
  • intents (str or list of str) – if provided, reduces the scope of intent parsing to the provided list of intents
  • top_n (int, optional) – when provided, this method will return a list of at most top_n most likely intents, instead of a single parsing result. Note that the returned list can contain less than top_n elements, for instance when the parameter intents is not None, or when top_n is greater than the total number of intents.
Returns:

the most likely intent(s) along with the extracted slots. See parsing_result() and extraction_result() for the output format.

Return type:

dict or list

Raises:

NotTrained – when the intent parser is not fitted

get_intents(*args, **kwargs)

Returns the list of intents ordered by decreasing probability

The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent

get_slots(*args, **kwargs)

Extracts slots from a text input, with the knowledge of the intent

Parameters:
  • text (str) – input
  • intent (str) – the intent which the input corresponds to
Returns:

the list of extracted slots

Return type:

list

Raises:

IntentNotFoundError – When the intent was not part of the training data

persist(path, *args, **kwargs)

Persists the object at the given path

classmethod from_path(path, **shared)

Loads a LookupIntentParser instance from a path

The data at the given path must have been generated using persist()

to_dict()

Returns a json-serializable dict

classmethod from_dict(unit_dict, **shared)

Creates a LookupIntentParser instance from a dict

The dict must have been generated with to_dict()

class ProbabilisticIntentParser(config=None, **shared)

Intent parser which consists in two steps: intent classification then slot filling

The probabilistic intent parser can be configured by passing a ProbabilisticIntentParserConfig

config_type

alias of snips_nlu.pipeline.configs.intent_parser.ProbabilisticIntentParserConfig

fitted

Whether or not the intent parser has already been fitted

fit(**kwargs)

Fits the probabilistic intent parser

Parameters:
  • dataset (dict) – A valid Snips dataset
  • force_retrain (bool, optional) – If False, will not retrain intent classifier and slot fillers when they are already fitted. Default to True.
Returns:

The same instance, trained

Return type:

ProbabilisticIntentParser

parse(**kwargs)

Performs intent parsing on the provided text by first classifying the intent and then using the correspond slot filler to extract slots

Parameters:
  • text (str) – input
  • intents (str or list of str) – if provided, reduces the scope of intent parsing to the provided list of intents
  • top_n (int, optional) – when provided, this method will return a list of at most top_n most likely intents, instead of a single parsing result. Note that the returned list can contain less than top_n elements, for instance when the parameter intents is not None, or when top_n is greater than the total number of intents.
Returns:

the most likely intent(s) along with the extracted slots. See parsing_result() and extraction_result() for the output format.

Return type:

dict or list

Raises:

NotTrained – when the intent parser is not fitted

get_intents(*args, **kwargs)

Returns the list of intents ordered by decreasing probability

The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent

get_slots(*args, **kwargs)

Extracts slots from a text input, with the knowledge of the intent

Parameters:
  • text (str) – input
  • intent (str) – the intent which the input corresponds to
Returns:

the list of extracted slots

Return type:

list

Raises:

IntentNotFoundError – When the intent was not part of the training data

persist(path, *args, **kwargs)

Persists the object at the given path

classmethod from_path(path, **shared)

Loads a ProbabilisticIntentParser instance from a path

The data at the given path must have been generated using persist()

Intent Classifier

class IntentClassifier(config, **shared)

Abstraction which performs intent classification

A custom intent classifier must inherit this class to be used in a ProbabilisticIntentParser

fit(dataset)

Fit the intent classifier with a valid Snips dataset

get_intent(text, intents_filter)

Performs intent classification on the provided text

Parameters:
  • text (str) – Input
  • intents_filter (str or list of str) – When defined, it will find the most likely intent among the list, otherwise it will use the whole list of intents defined in the dataset
Returns:

The most likely intent along with its probability or None if no intent was found. See intent_classification_result() for the output format.

Return type:

dict or None

get_intents(text)

Performs intent classification on the provided text and returns the list of intents ordered by decreasing probability

The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent

Note

The probabilities returned along with each intent are not guaranteed to sum to 1.0. They should be considered as scores between 0 and 1.

class LogRegIntentClassifier(config=None, **shared)

Intent classifier which uses a Logistic Regression underneath

The LogReg intent classifier can be configured by passing a LogRegIntentClassifierConfig

config_type

alias of snips_nlu.pipeline.configs.intent_classifier.LogRegIntentClassifierConfig

fitted

Whether or not the intent classifier has already been fitted

fit(**kwargs)

Fits the intent classifier with a valid Snips dataset

Returns:The same instance, trained
Return type:LogRegIntentClassifier
get_intent(*args, **kwargs)

Performs intent classification on the provided text

Parameters:
  • text (str) – Input
  • intents_filter (str or list of str) – When defined, it will find the most likely intent among the list, otherwise it will use the whole list of intents defined in the dataset
Returns:

The most likely intent along with its probability or None if no intent was found

Return type:

dict or None

Raises:

snips_nlu.exceptions.NotTrained – When the intent classifier is not fitted

get_intents(*args, **kwargs)

Performs intent classification on the provided text and returns the list of intents ordered by decreasing probability

The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent

Raises:snips_nlu.exceptions.NotTrained – when the intent classifier is not fitted
persist(path, *args, **kwargs)

Persists the object at the given path

classmethod from_path(path, **shared)

Loads a LogRegIntentClassifier instance from a path

The data at the given path must have been generated using persist()

class Featurizer(config=None, **shared)

Feature extractor for text classification relying on ngrams tfidf and optionally word cooccurrences features

config_type

alias of snips_nlu.pipeline.configs.intent_classifier.FeaturizerConfig

feature_index_to_feature_name

Maps the feature index of the feature matrix to printable features names. Mainly useful for debug.

Returns:a dict mapping feature indices to printable features names
Return type:dict
class TfidfVectorizer(config=None, **shared)

Wrapper of the scikit-learn TfidfVectorizer

config_type

alias of snips_nlu.pipeline.configs.intent_classifier.TfidfVectorizerConfig

fit(x, dataset)

Fits the idf of the vectorizer on the given utterances after enriching them with builtin entities matches, custom entities matches and the potential word clusters matches

Parameters:
  • x (list of dict) – list of utterances
  • dataset (dict) – dataset from which x was extracted (needed to extract the language and the builtin entity scope)
Returns:

The fitted vectorizer

Return type:

TfidfVectorizer

fit_transform(x, dataset)

Fits the idf of the vectorizer on the given utterances after enriching them with builtin entities matches, custom entities matches and the potential word clusters matches. Returns the featurized utterances.

Parameters:
  • x (list of dict) – list of utterances
  • dataset (dict) – dataset from which x was extracted (needed to extract the language and the builtin entity scope)
Returns:

A sparse matrix X of shape (len(x), len(self.vocabulary)) where X[i, j] contains tfdif of the ngram of index j of the vocabulary in the utterance i

Return type:

scipy.sparse.csr_matrix

transform(*args, **kwargs)

Featurizes the given utterances after enriching them with builtin entities matches, custom entities matches and the potential word clusters matches

Parameters:x (list of dict) – list of utterances
Returns:A sparse matrix X of shape (len(x), len(self.vocabulary)) where X[i, j] contains tfdif of the ngram of index j of the vocabulary in the utterance i
Return type:scipy.sparse.csr_matrix
Raises:NotTrained – when the vectorizer is not fitted:
limit_vocabulary(*args, **kwargs)

Restrict the vectorizer vocabulary to the given ngrams

Parameters:ngrams (iterable of str or tuples of str) – ngrams to keep
Returns:The vectorizer with limited vocabulary
Return type:TfidfVectorizer
class CooccurrenceVectorizer(config=None, **shared)

Featurizer that takes utterances and extracts ordered word cooccurrence features matrix from them

config_type

alias of snips_nlu.pipeline.configs.intent_classifier.CooccurrenceVectorizerConfig

fit(x, dataset)

Fits the CooccurrenceVectorizer

Given a list of utterances the CooccurrenceVectorizer will extract word pairs appearing in the same utterance. The order in which the words appear is kept. Additionally, if self.config.window_size is not None then the vectorizer will only look in a context window of self.config.window_size after each word.

Parameters:
  • x (iterable) – list of utterances
  • dataset (dict) – dataset from which x was extracted (needed to extract the language and the builtin entity scope)
Returns:

The fitted vectorizer

Return type:

CooccurrenceVectorizer

fitted

Whether or not the vectorizer is fitted

fit_transform(x, dataset)

Fits the vectorizer and returns the feature matrix

Parameters:
  • x (iterable) – iterable of 3-tuples of the form (tokenized_utterances, builtin_entities, custom_entities)
  • dataset (dict) – dataset from which x was extracted (needed to extract the language and the builtin entity scope)
Returns:

A sparse matrix X of shape (len(x), len(self.word_pairs)) where X[i, j] = 1.0 if x[i][0] contains the words cooccurrence (w1, w2) and if self.word_pairs[(w1, w2)] = j

Return type:

scipy.sparse.csr_matrix

transform(*args, **kwargs)

Computes the cooccurrence feature matrix.

Parameters:x (list of dict) – list of utterances
Returns:A sparse matrix X of shape (len(x), len(self.word_pairs)) where X[i, j] = 1.0 if x[i][0] contains the words cooccurrence (w1, w2) and if self.word_pairs[(w1, w2)] = j
Return type:scipy.sparse.csr_matrix
Raises:NotTrained – when the vectorizer is not fitted
limit_word_pairs(*args, **kwargs)

Restrict the vectorizer word pairs to the given word pairs

Parameters:word_pairs (iterable of 2-tuples (str, str)) – word_pairs to keep
Returns:The vectorizer with limited word pairs
Return type:CooccurrenceVectorizer

Slot Filler

class SlotFiller(config, **shared)

Abstraction which performs slot filling

A custom slot filler must inherit this class to be used in a ProbabilisticIntentParser

fit(dataset, intent)

Fit the slot filler with a valid Snips dataset

get_slots(text)

Performs slot extraction (slot filling) on the provided text

Returns:The list of extracted slots. See unresolved_slot() for the output format of a slot
Return type:list of dict
class CRFSlotFiller(config=None, **shared)

Slot filler which uses Linear-Chain Conditional Random Fields underneath

Check https://en.wikipedia.org/wiki/Conditional_random_field to learn more about CRFs

The CRF slot filler can be configured by passing a CRFSlotFillerConfig

config_type

alias of snips_nlu.pipeline.configs.slot_filler.CRFSlotFillerConfig

features

List of Feature used by the CRF

labels

List of CRF labels

These labels differ from the slot names as they contain an additional prefix which depends on the TaggingScheme that is used (BIO by default).

fitted

Whether or not the slot filler has already been fitted

fit(**kwargs)

Fits the slot filler

Parameters:
  • dataset (dict) – A valid Snips dataset
  • intent (str) – The specific intent of the dataset to train the slot filler on
Returns:

The same instance, trained

Return type:

CRFSlotFiller

get_slots(*args, **kwargs)

Extracts slots from the provided text

Returns:The list of extracted slots
Return type:list of dict
Raises:NotTrained – When the slot filler is not fitted
compute_features(tokens, drop_out=False)

Computes features on the provided tokens

The drop_out parameters allows to activate drop out on features that have a positive drop out ratio. This should only be used during training.

get_sequence_probability(*args, **kwargs)

Gives the joint probability of a sequence of tokens and CRF labels

Parameters:
  • tokens (list of Token) – list of tokens
  • labels (list of str) – CRF labels with their tagging scheme prefix (“B-color”, “I-color”, “O”, etc)

Note

The absolute value returned here is generally not very useful, however it can be used to compare a sequence of labels relatively to another one.

log_weights(*args, **kwargs)

Returns a logs for both the label-to-label and label-to-features weights

persist(path, *args, **kwargs)

Persists the object at the given path

classmethod from_path(path, **shared)

Loads a CRFSlotFiller instance from a path

The data at the given path must have been generated using persist()

Feature

class Feature(base_name, func, offset=0, drop_out=0)

CRF Feature which is used by CRFSlotFiller

base_name

str – Feature name (e.g. ‘is_digit’, ‘is_first’ etc)

func

function – The actual feature function for example:

def is_first(tokens, token_index):
return “1” if token_index == 0 else None
offset

int, optional – Token offset to consider when computing the feature (e.g -1 for computing the feature on the previous word)

drop_out

float, optional – Drop out to use when computing the feature during training

Note

The easiest way to add additional features to the existing ones is to create a CRFFeatureFactory

Feature Factories

class CRFFeatureFactory(factory_config, **shared)

Abstraction to implement to build CRF features

A CRFFeatureFactory is initialized with a dict which describes the feature, it must contains the three following keys:

  • ‘factory_name’
  • ‘args’: the parameters of the feature, if any
  • ‘offsets’: the offsets to consider when using the feature in the CRF. An empty list corresponds to no feature.

In addition, a ‘drop_out’ to use at training time can be specified.

classmethod from_config(factory_config, **shared)

Retrieve the CRFFeatureFactory corresponding the provided config

Raises:NotRegisteredError – when the factory is not registered
fit(dataset, intent)

Fit the factory, if needed, with the provided dataset and intent

build_features()

Build a list of Feature

class SingleFeatureFactory(factory_config, **shared)

A CRF feature factory which produces only one feature

class IsDigitFactory(factory_config, **shared)

Feature: is the considered token a digit?

class IsFirstFactory(factory_config, **shared)

Feature: is the considered token the first in the input?

class IsLastFactory(factory_config, **shared)

Feature: is the considered token the last in the input?

class PrefixFactory(factory_config, **shared)

Feature: a prefix of the considered token

This feature has one parameter, prefix_size, which specifies the size of the prefix

class SuffixFactory(factory_config, **shared)

Feature: a suffix of the considered token

This feature has one parameter, suffix_size, which specifies the size of the suffix

class LengthFactory(factory_config, **shared)

Feature: the length (characters) of the considered token

class NgramFactory(factory_config, **shared)

Feature: the n-gram consisting of the considered token and potentially the following ones

This feature has several parameters:

  • ‘n’ (int): Corresponds to the size of the n-gram. n=1 corresponds to a unigram, n=2 is a bigram etc
  • ‘use_stemming’ (bool): Whether or not to stem the n-gram
  • ‘common_words_gazetteer_name’ (str, optional): If defined, use a gazetteer of common words and replace out-of-corpus ngram with the alias ‘rare_word’
class ShapeNgramFactory(factory_config, **shared)

Feature: the shape of the n-gram consisting of the considered token and potentially the following ones

This feature has one parameters, n, which corresponds to the size of the n-gram.

Possible types of shape are:

  • ‘xxx’ -> lowercased
  • ‘Xxx’ -> Capitalized
  • ‘XXX’ -> UPPERCASED
  • ‘xX’ -> None of the above
class WordClusterFactory(factory_config, **shared)

Feature: The cluster which the considered token belongs to, if any

This feature has several parameters:

  • ‘cluster_name’ (str): the name of the word cluster to use
  • ‘use_stemming’ (bool): whether or not to stem the token before looking for its cluster

Typical words clusters are the Brown Clusters in which words are clustered into a binary tree resulting in clusters of the form ‘100111001’ See https://en.wikipedia.org/wiki/Brown_clustering

class CustomEntityMatchFactory(factory_config, **shared)

Features: does the considered token belongs to the values of one of the entities in the training dataset

This factory builds as many features as there are entities in the dataset, one per entity.

It has the following parameters:

  • ‘use_stemming’ (bool): whether or not to stem the token before looking for it among the (stemmed) entity values
  • ‘tagging_scheme_code’ (int): Represents a TaggingScheme. This allows to give more information about the match.
  • ‘entity_filter’ (dict): a filter applied to select the custom entities for which the custom match feature will be computed. Available filters: - ‘automatically_extensible’: if True, selects automatically extensible entities only, if False selects non automatically extensible entities only
class BuiltinEntityMatchFactory(factory_config, **shared)

Features: is the considered token part of a builtin entity such as a date, a temperature etc

This factory builds as many features as there are builtin entities available in the considered language.

It has one parameter, tagging_scheme_code, which represents a TaggingScheme. This allows to give more information about the match.

Configurations

class NLUEngineConfig(intent_parsers_configs=None, random_seed=None)

Configuration of a SnipsNLUEngine object

Parameters:intent_parsers_configs (list) – List of intent parser configs (ProcessingUnitConfig). The order in the list determines the order in which each parser will be called by the nlu engine.
class DeterministicIntentParserConfig(max_queries=100, max_pattern_length=1000, ignore_stop_words=False)

Configuration of a DeterministicIntentParser

Parameters:
  • max_queries (int, optional) – Maximum number of regex patterns per intent. 50 by default.
  • max_pattern_length (int, optional) – Maximum length of regex patterns.
  • ignore_stop_words (bool, optional) – If True, stop words will be removed before building patterns.

This allows to deactivate the usage of regular expression when they are too big to avoid explosion in time and memory

Note

In the future, a FST will be used instead of regexps, removing the need for all this

class LookupIntentParserConfig(ignore_stop_words=False)

Configuration of a LookupIntentParser

Parameters:ignore_stop_words (bool, optional) – If True, stop words will be removed before building patterns.
class ProbabilisticIntentParserConfig(intent_classifier_config=None, slot_filler_config=None)

Configuration of a ProbabilisticIntentParser object

Parameters:
  • intent_classifier_config (ProcessingUnitConfig) – The configuration of the underlying intent classifier, by default it uses a LogRegIntentClassifierConfig
  • slot_filler_config (ProcessingUnitConfig) – The configuration that will be used for the underlying slot fillers, by default it uses a CRFSlotFillerConfig
class LogRegIntentClassifierConfig(data_augmentation_config=None, featurizer_config=None, noise_reweight_factor=1.0)

Configuration of a LogRegIntentClassifier

Parameters:
  • data_augmentation_config (IntentClassifierDataAugmentationConfig) – Defines the strategy of the underlying data augmentation
  • featurizer_config (FeaturizerConfig) – Configuration of the Featurizer used underneath
  • noise_reweight_factor (float, optional) – this parameter allows to change the weight of the None class. By default, the class weights are computed using a “balanced” strategy. The noise_reweight_factor allows to deviate from this strategy.
class CRFSlotFillerConfig(feature_factory_configs=None, tagging_scheme=None, crf_args=None, data_augmentation_config=None)

Configuration of a CRFSlotFiller

Parameters:
  • feature_factory_configs (list, optional) – List of configurations that specify the list of CRFFeatureFactory to use with the CRF
  • tagging_scheme (TaggingScheme, optional) – Tagging scheme to use to enrich CRF labels (default=BIO)
  • crf_args (dict, optional) – Allow to overwrite the parameters of the CRF defined in sklearn_crfsuite, see sklearn_crfsuite.CRF (default={“c1”: .1, “c2”: .1, “algorithm”: “lbfgs”})
  • data_augmentation_config (dict or SlotFillerDataAugmentationConfig, optional) – Specify how to augment data before training the CRF, see the corresponding config object for more details.
  • random_seed (int, optional) – Specify to make the CRF training deterministic and reproducible (default=None)
class FeaturizerConfig(tfidf_vectorizer_config=None, cooccurrence_vectorizer_config=None, pvalue_threshold=0.4, added_cooccurrence_feature_ratio=0)

Configuration of a Featurizer object

Parameters:
  • tfidf_vectorizer_config (TfidfVectorizerConfig, optional) – empty configuration of the featurizer’s tfidf_vectorizer
  • cooccurrence_vectorizer_config – (CooccurrenceVectorizerConfig, optional): configuration of the featurizer’s cooccurrence_vectorizer
  • pvalue_threshold (float) – after fitting the training set to extract tfidf features, a univariate feature selection is applied. Features are tested for independence using a Chi-2 test, under the null hypothesis that each feature should be equally present in each class. Only features having a p-value lower than the threshold are kept
  • added_cooccurrence_feature_ratio (float, optional) – proportion of cooccurrence features to add with respect to the number of tfidf features. For instance with a ratio of 0.5, if 100 tfidf features are remaining after feature selection, a maximum of 50 cooccurrence features will be added
class CooccurrenceVectorizerConfig(window_size=None, unknown_words_replacement_string=None, filter_stop_words=True, keep_order=True)

Configuration of a CooccurrenceVectorizer object

Parameters:
  • window_size (int, optional) – if provided, word cooccurrences will be taken into account only in a context window of size window_size. If the window size is 3 then given a word w[i], the vectorizer will only extract the following pairs: (w[i], w[i + 1]), (w[i], w[i + 2]) and (w[i], w[i + 3]). Defaults to None, which means that we consider all words
  • unknown_words_replacement_string (str, optional) –
  • filter_stop_words (bool, optional) – if True, stop words are ignored when computing cooccurrences
  • keep_order (bool, optional) – if True then cooccurrence are computed taking the words order into account, which means the pairs (w1, w2) and (w2, w1) will count as two separate features. Defaults to True.

Dataset

class Dataset(language, intents, entities)

Dataset used in the main NLU training API

Consists of intents and entities data. This object can be built either from text files (Dataset.from_files()) or from YAML files (Dataset.from_yaml_files()).

language

str – language of the intents

intents

list of Intent – intents data

entities

list of Entity – entities data

classmethod from_yaml_files(language, filenames)

Creates a Dataset from a language and a list of YAML files or streams containing intents and entities data

Each file need not correspond to a single entity nor intent. They can consist in several entities and intents merged together in a single file.

Parameters:
  • language (str) – language of the dataset (ISO639-1)
  • filenames (iterable) – filenames or stream objects corresponding to intents and entities data.

Example

A dataset can be defined with a YAML document following the schema illustrated in the example below:

>>> import io
>>> from snips_nlu.common.utils import json_string
>>> dataset_yaml = io.StringIO('''
... # searchFlight Intent
... ---
... type: intent
... name: searchFlight
... slots:
...   - name: origin
...     entity: city
...   - name: destination
...     entity: city
...   - name: date
...     entity: snips/datetime
... utterances:
...   - find me a flight from [origin](Oslo) to [destination](Lima)
...   - I need a flight leaving to [destination](Berlin)
...
... # City Entity
... ---
... type: entity
... name: city
... values:
...   - london
...   - [paris, city of lights]''')
>>> dataset = Dataset.from_yaml_files("en", [dataset_yaml])
>>> print(json_string(dataset.json, indent=4, sort_keys=True))
{
    "entities": {
        "city": {
            "automatically_extensible": true,
            "data": [
                {
                    "synonyms": [],
                    "value": "london"
                },
                {
                    "synonyms": [
                        "city of lights"
                    ],
                    "value": "paris"
                }
            ],
            "matching_strictness": 1.0,
            "use_synonyms": true
        }
    },
    "intents": {
        "searchFlight": {
            "utterances": [
                {
                    "data": [
                        {
                            "text": "find me a flight from "
                        },
                        {
                            "entity": "city",
                            "slot_name": "origin",
                            "text": "Oslo"
                        },
                        {
                            "text": " to "
                        },
                        {
                            "entity": "city",
                            "slot_name": "destination",
                            "text": "Lima"
                        }
                    ]
                },
                {
                    "data": [
                        {
                            "text": "I need a flight leaving to "
                        },
                        {
                            "entity": "city",
                            "slot_name": "destination",
                            "text": "Berlin"
                        }
                    ]
                }
            ]
        }
    },
    "language": "en"
}
Raises:
  • DatasetFormatError – When one of the documents present in the YAML files has a wrong ‘type’ attribute, which is not ‘entity’ nor ‘intent’
  • IntentFormatError – When the YAML document of an intent does not correspond to the expected intent format
  • EntityFormatError – When the YAML document of an entity does not correspond to the expected entity format
json

Dataset data in json format

class Intent(intent_name, utterances, slot_mapping=None)

Intent data of a Dataset

intent_name

str – name of the intent

utterances

list of IntentUtterance – annotated intent utterances

slot_mapping

dict – mapping between slot names and entities

classmethod from_yaml(yaml_dict)

Build an Intent from its YAML definition object

Parameters:yaml_dict (dict or IOBase) – object containing the YAML definition of the intent. It can be either a stream, or the corresponding python dict.

Examples

An intent can be defined with a YAML document following the schema illustrated in the example below:

>>> import io
>>> from snips_nlu.common.utils import json_string
>>> intent_yaml = io.StringIO('''
... # searchFlight Intent
... ---
... type: intent
... name: searchFlight
... slots:
...   - name: origin
...     entity: city
...   - name: destination
...     entity: city
...   - name: date
...     entity: snips/datetime
... utterances:
...   - find me a flight from [origin](Oslo) to [destination](Lima)
...   - I need a flight leaving to [destination](Berlin)''')
>>> intent = Intent.from_yaml(intent_yaml)
>>> print(json_string(intent.json, indent=4, sort_keys=True))
{
    "utterances": [
        {
            "data": [
                {
                    "text": "find me a flight from "
                },
                {
                    "entity": "city",
                    "slot_name": "origin",
                    "text": "Oslo"
                },
                {
                    "text": " to "
                },
                {
                    "entity": "city",
                    "slot_name": "destination",
                    "text": "Lima"
                }
            ]
        },
        {
            "data": [
                {
                    "text": "I need a flight leaving to "
                },
                {
                    "entity": "city",
                    "slot_name": "destination",
                    "text": "Berlin"
                }
            ]
        }
    ]
}
Raises:IntentFormatError – When the YAML dict does not correspond to the expected intent format
json

Intent data in json format

class Entity(name, utterances=None, automatically_extensible=True, use_synonyms=True, matching_strictness=1.0)

Entity data of a Dataset

This class can represents both a custom or a builtin entity. When the entity is a builtin one, only the name attribute is relevant.

name

str – name of the entity

utterances

list of EntityUtterance – entity utterances (only for custom entities)

automatically_extensible

bool – whether or not the entity can be extended to values not present in the data (only for custom entities)

use_synonyms

bool – whether or not to map entity values using synonyms (only for custom entities)

matching_strictness

float – controls the matching strictness of the entity (only for custom entities). Must be between 0.0 and 1.0.

classmethod from_yaml(yaml_dict)

Build an Entity from its YAML definition object

Parameters:yaml_dict (dict or IOBase) – object containing the YAML definition of the entity. It can be either a stream, or the corresponding python dict.

Examples

An entity can be defined with a YAML document following the schema illustrated in the example below:

>>> import io
>>> from snips_nlu.common.utils import json_string
>>> entity_yaml = io.StringIO('''
... # City Entity
... ---
... type: entity
... name: city
... automatically_extensible: false # default value is true
... use_synonyms: false # default value is true
... matching_strictness: 0.8 # default value is 1.0
... values:
...   - london
...   - [new york, big apple]
...   - [paris, city of lights]''')
>>> entity = Entity.from_yaml(entity_yaml)
>>> print(json_string(entity.json, indent=4, sort_keys=True))
{
    "automatically_extensible": false,
    "data": [
        {
            "synonyms": [],
            "value": "london"
        },
        {
            "synonyms": [
                "big apple"
            ],
            "value": "new york"
        },
        {
            "synonyms": [
                "city of lights"
            ],
            "value": "paris"
        }
    ],
    "matching_strictness": 0.8,
    "use_synonyms": false
}
Raises:EntityFormatError – When the YAML dict does not correspond to the expected entity format
json

Returns the entity in json format

Result and output format

intent_classification_result(intent_name, probability)

Creates an intent classification result to be returned by IntentClassifier.get_intent()

Example

>>> intent_classification_result("GetWeather", 0.93)
{'intentName': 'GetWeather', 'probability': 0.93}
unresolved_slot(match_range, value, entity, slot_name)

Creates an internal slot yet to be resolved

Example

>>> from snips_nlu.common.utils import json_string
>>> slot = unresolved_slot([0, 8], "tomorrow", "snips/datetime",             "startDate")
>>> print(json_string(slot, indent=4, sort_keys=True))
{
    "entity": "snips/datetime",
    "range": {
        "end": 8,
        "start": 0
    },
    "slotName": "startDate",
    "value": "tomorrow"
}
custom_slot(internal_slot, resolved_value=None)

Creates a custom slot with resolved_value being the reference value of the slot

Example

>>> s = unresolved_slot([10, 19], "earl grey", "beverage", "beverage")
>>> from snips_nlu.common.utils import json_string
>>> print(json_string(custom_slot(s, "tea"), indent=4, sort_keys=True))
{
    "entity": "beverage",
    "range": {
        "end": 19,
        "start": 10
    },
    "rawValue": "earl grey",
    "slotName": "beverage",
    "value": {
        "kind": "Custom",
        "value": "tea"
    }
}
builtin_slot(internal_slot, resolved_value)

Creates a builtin slot with resolved_value being the resolved value of the slot

Example

>>> rng = [10, 32]
>>> raw_value = "twenty degrees celsius"
>>> entity = "snips/temperature"
>>> slot_name = "beverageTemperature"
>>> s = unresolved_slot(rng, raw_value, entity, slot_name)
>>> resolved = {
...     "kind": "Temperature",
...     "value": 20,
...     "unit": "celsius"
... }
>>> from snips_nlu.common.utils import json_string
>>> print(json_string(builtin_slot(s, resolved), indent=4))
{
    "entity": "snips/temperature",
    "range": {
        "end": 32,
        "start": 10
    },
    "rawValue": "twenty degrees celsius",
    "slotName": "beverageTemperature",
    "value": {
        "kind": "Temperature",
        "unit": "celsius",
        "value": 20
    }
}
resolved_slot(match_range, raw_value, resolved_value, entity, slot_name)

Creates a resolved slot

Parameters:
  • match_range (dict) – Range of the slot within the sentence (ex: {“start”: 3, “end”: 10})
  • raw_value (str) – Slot value as it appears in the sentence
  • resolved_value (dict) – Resolved value of the slot
  • entity (str) – Entity which the slot belongs to
  • slot_name (str) – Slot type
Returns:

The resolved slot

Return type:

dict

Example

>>> resolved_value = {
...     "kind": "Temperature",
...     "value": 20,
...     "unit": "celsius"
... }
>>> slot = resolved_slot({"start": 10, "end": 19}, "earl grey",
... resolved_value, "beverage", "beverage")
>>> from snips_nlu.common.utils import json_string
>>> print(json_string(slot, indent=4, sort_keys=True))
{
    "entity": "beverage",
    "range": {
        "end": 19,
        "start": 10
    },
    "rawValue": "earl grey",
    "slotName": "beverage",
    "value": {
        "kind": "Temperature",
        "unit": "celsius",
        "value": 20
    }
}
parsing_result(input, intent, slots)

Create the final output of SnipsNLUEngine.parse() or IntentParser.parse()

Example

>>> text = "Hello Bill!"
>>> intent_result = intent_classification_result("Greeting", 0.95)
>>> internal_slot = unresolved_slot([6, 10], "Bill", "name",
... "greetee")
>>> slots = [custom_slot(internal_slot, "William")]
>>> res = parsing_result(text, intent_result, slots)
>>> from snips_nlu.common.utils import json_string
>>> print(json_string(res, indent=4, sort_keys=True))
{
    "input": "Hello Bill!",
    "intent": {
        "intentName": "Greeting",
        "probability": 0.95
    },
    "slots": [
        {
            "entity": "name",
            "range": {
                "end": 10,
                "start": 6
            },
            "rawValue": "Bill",
            "slotName": "greetee",
            "value": {
                "kind": "Custom",
                "value": "William"
            }
        }
    ]
}
extraction_result(intent, slots)

Create the items in the output of SnipsNLUEngine.parse() or IntentParser.parse() when called with a defined top_n value

This differs from parsing_result() in that the input is omitted.

Example

>>> intent_result = intent_classification_result("Greeting", 0.95)
>>> internal_slot = unresolved_slot([6, 10], "Bill", "name",
... "greetee")
>>> slots = [custom_slot(internal_slot, "William")]
>>> res = extraction_result(intent_result, slots)
>>> from snips_nlu.common.utils import json_string
>>> print(json_string(res, indent=4, sort_keys=True))
{
    "intent": {
        "intentName": "Greeting",
        "probability": 0.95
    },
    "slots": [
        {
            "entity": "name",
            "range": {
                "end": 10,
                "start": 6
            },
            "rawValue": "Bill",
            "slotName": "greetee",
            "value": {
                "kind": "Custom",
                "value": "William"
            }
        }
    ]
}
is_empty(result)

Check if a result is empty

Example

>>> res = empty_result("foo bar", 1.0)
>>> is_empty(res)
True
empty_result(input, probability)

Creates an empty parsing result of the same format as the one of parsing_result()

An empty is typically returned by a SnipsNLUEngine or IntentParser when no intent nor slots were found.

Example

>>> res = empty_result("foo bar", 0.8)
>>> from snips_nlu.common.utils import json_string
>>> print(json_string(res, indent=4, sort_keys=True))
{
    "input": "foo bar",
    "intent": {
        "intentName": null,
        "probability": 0.8
    },
    "slots": []
}
parsed_entity(entity_kind, entity_value, entity_resolved_value, entity_range)
Create the items in the output of
snips_nlu.entity_parser.EntityParser.parse()

Example

>>> resolved_value = dict(age=28, role="datascientist")
>>> range = dict(start=0, end=6)
>>> ent = parsed_entity("snipster", "adrien", resolved_value, range)
>>> import json
>>> print(json.dumps(ent, indent=4, sort_keys=True))
{
    "entity_kind": "snipster",
    "range": {
        "end": 6,
        "start": 0
    },
    "resolved_value": {
        "age": 28,
        "role": "datascientist"
    },
    "value": "adrien"
}

Indices and tables