Building a Language User Interface? Let Genie Generate It For You! #

How do you teach your brand new virtual assistant to understand language? How do you represent the user's input? How do you acquire training data cheaply, before you have users at all? In this blog post, we will present our latest tool, called Genie, to answer all these questions. Genie allows developers to bootstrap new virtual assistants at significantly less cost, and without thousands of employees annotating data. Genie generates a training set, and uses it to train a state of the art neural network model[1].

Genie has been presented in our latest paper, in the Programming Languages Design and Implementation conference[2]. This is joint work with Silei Xu (equal contribution), Mehrad Moradshahi, Richard Socher, and Monica Lam.

Semantic Parsing 101

Advanced virtual assistants, including the commercial ones like Alexa, rely on semantic parsing to interpret the inputs coming from the user into executable actions. Semantic parsing is a machine learning task where natural language comes in, and the machine learning produces a formal representation of the input, in some formal language like SQL, Prolog or Python.

Alexa, for example, uses the Alexa Meaning Representation Language (AMRL)[3], which focuses on matching the natural language strictly. This is problematic for a number of reasons, chiefly among which the fact that multiple commands can have different syntactic structure but identical executions. In AMRL, commands must be annotated manually with their formal representation. To do so, Amazon employs 10,000 people, listening to users' conversations and annotating by hand[4]. It's a slow and expensive process, and the manpower required makes it impractical for anyone not called Amazon or Google.

Almond, on the other hand, uses ThingTalk as the formal representation language. ThingTalk has well defined executable semantics, which makes it easy to reason about what each command means, whether Almond can do it or not, what the representation should be, etc. Here is an example of how Almond interprets a command:

Example of Almond interpreting a command. The command, "Get a cat picture and post it on Facebook with caption funny cat.", is combined with all the services in Thingpedia, and translated to `now => @com.thecatapi.get() => @com.facebook.post_picture(picture_url=picture_url, caption="funny cat"

Every command that Almond receives is translated into ThingTalk, using what is essentially a translation neural network, trained on the Thingpedia services. The resulting ThingTalk code is interpreted to execute the command.

What's So Hard About Semantic Parsing?

One would think that semantic parsing is easy. After all, neural networks can successfully translate from English to French, German, Italian, and what not. ThingTalk has a smaller vocabulary, a smaller set of supported concepts, and far fewer idiosyncrasies than French. Surely semantic parsing is easier than human language translation?

Well, as with everything machine learning, the problem is data - specifically, corresponding pairs of English and translated text. You need a lot of data: the benchmark WMT English-French dataset contains a whopping 36 million pairs of English and French sentences[5][6]. With human language translation, people have looked at pre-existing corpora. Common examples include the Bible, which is translated in pretty much any known human language. As another example, the European Union agencies translate every single document they produce, point-by-point, in each of the 27 official languages of the EU.

Unfortunately, there is no Bible of ThingTalk. ThingTalk is a brand new language, designed specifically for Almond, and there is no existing corpus of English and ThingTalk that we can use to train. In fact, there is no existing corpus we can even use for validation, because until Almond works, people will not start typing commands into it.

How Does Everybody Else Do It?

Previously, Wang, Berant, and Liang have proposed leveraging crowdsourcing to acquire data quickly and cheaply, with what they called the Overnight methodology[7]. They noted that while generating sentences in human languages is very hard, generating programs in a formal language is easy, because there is a formal grammar and typesystem that specifies exactly which string is valid and which string is meaningless. Furthermore, they noted that both natural and formal languages are compositional: from a limited set of primitives, you can write an exponential number of possible commands.

They propose that given any program, you mechanically generate a unique canonical representation in pseudo-English, by replacing the program constructs with their English descriptions and rearranging them to fit the English grammar. This canonical representation is verbose and clunky, so it is not good to train. At the same time, it is good enough to be understood by someone who understands English but does not know programming. That someone, hired on a crowdsourcing platform like Mechanical Turk, can paraphrase the canonical representation into a truly good sentence, which then can be used to train and validate.

The key step in the Overnight methodology is that someone who has zero expertise in either machine learning or the domain under considerations can read the sentence, understand what it means, and come up with a better way to say the same things. This turns out to be the case in many domains, such as question-answering for restaurants or sports, and the same methodology was used in the WikiSQL dataset[8].

Unfortunately, executing complex event-driven commands across over 100 services is a different beast than just asking who won the NBA championship in 2018. Because the canonical form is unique and the domain is so large and hard to capture, the crowdsource workers do not introduce enough variety in the training set to capture the real space of natural language. You could paraphrase over and over again, and never improve the accuracy of the model on real-world data. Even worse, because Overnight trains and tests on the same paraphrase data, and that paraphrase data eventually becomes easy, you might be fooled into thinking the model works - until you deploy it in production and it fails miserably.

Enter Genie

Compositionality is a great principle, but unique canonical forms are not sufficient to generate good training data, and any sort of generated data over-estimates real-world accuracy. So what do we do? Genie proposes that developers data-program their natural language support.

Like data programming in other contexts, the methodology begins with acquiring a high-quality validation set that is representative of real-world data. This validation set must be obtained in some way that does not bias whoever writes it, and must be manually annotated. Even better, it could be an existing source of real, unbiased data, like IFTTT is for Almond. Manual annotation is expensive, but the validation set is small (around 1500 sentences for Almond), so this is still feasible.

Then we propose that instead of unique canonical forms, developers represent the training set using templates. These templates are associated with arbitrary semantic functions, and can decouple the composition operators in program space from the composition of natural language primitives. This allows developers to succinctly represent more ways to express the same commands; Genie then converts this representation with existing sources of data and crowdsourced paraphrases to generate a large high-quality training set. On this training set, Genie trains a model, and evaluates on the validation set. The developers can then iterate and add templates or crowdsource more paraphrases until a good validation accuracy is achieved.

Using Genie: The Synthesized Set

When using Genie developers start by writing primitive templates, which associate natural language fragments with composable snippets of ThingTalk code. Here are some examples of primitive templates for Dropbox:

Natural language Cat. ThingTalk Code
my Dropbox files np query := @com.dropbox.list_folder();
files in my Dropbox np query := @com.dropbox.list_folder();
my Dropbox files that changed most recently np query := @com.dropbox.list_folder(order_by=enum(modified_time_decreasing));
my Dropbox files that changed this week np query := @com.dropbox.list_folder(order_by=enum(modified_time_decreasing) filter modified_time > start_of(week);
files in my Dropbox folder $x np query (x : Entity(tt:path_name)) := @com.dropbox.list_folder(folder_name=x);
when I modify a file in Dropbox wp stream := monitor @com.dropbox.list_folder();
when I create a file in Dropbox wp stream := monitor @com.dropbox.list_folder() on new [file_name];
the download URL of $x np query (x : Entity(tt:path_name)) :=;
a temporary link to $x np query (x : Entity(tt:path_name)) :=;
open $x vp query (x : Entity(tt:path_name)) :=;

In the table, np stands for noun-phrase, wp for when-phrase, and vp for verb-phrase. Note that the same command can have multiple natural language fragments associated with it (first two lines in the table), and they don't necessarily have the same grammatical category in English (last two lines). Also, just one function can have many uses: for example listing files, sorting them, monitoring new files, updates files, etc. You can find the latest list of primitive templates in the Dropbox account page (hover over each command to see the code).

In addition to primitive templates, developers using Genie provide construct templates, which combine the primitives to form larger programs. Construct templates look like this:

// `query`, `get_command`, and `action`, `stream` are the non-terminal corresponding
// to primitive templates (noun-phrase, verb-phrase query, verb-phrase action, and when-phrase resp.)

// define a non-terminal `forward_get_do` with one single rule
// the rule generates sentences of the form:
// - get [a cat picture] and then [post $picture_url on twitter]
// - take [my dropbox files] then [send $message on slack]
// - get [my instagram pictures] then [post $picture_url on facebook]
// etc.
// (brackets added for explanation only)
forward_get_do = {
  ('get' | 'take' | 'retrieve') query ('and then' | 'then') action
    => new Ast.Statement.Command(query, [action]);
// similarly, this generates sentences of the form
// - [upload $url on onedrive] after getting [a temporary link to PATH_NAME_0 on dropbox]
// - [post $status on twitter] after you retrieve [my facebook posts]
backward_get_do = {
  action 'after' ('getting' | 'taking' | 'you get' | 'you retrieve') query
    => new Ast.Statement.Command(query, [action]);

// rules for passing from get to do are omitted from this introductory blog (due to the complex syntax)
// at a high level, they generate:
// - get [a cat picture] and then [post [the picture] on twitter]
// - take [my dropbox files] then [send [the link] on slack]
// - get [my instagram pictures] then [post [them] on facebook]
// - [upload [the link] on onedrive] after [a temporary link to PATH_NAME_0 on dropbox]
// - [post [the text] on twitter] after you retrieve [my facebook posts]
// etc.

// we can define new grammar/construct categories with different rules
// these rules generate
// - [the author] equal to [USERNAME_0]
// - [their sender] equal to [EMAIL_ADDRESS_0]
// - [the title] containing [QUOTED_STRING_0]
// - [QUOTED_STRING_1] in the [its text]
// - [the temperature] higher than [NUMBER_0 F]
// etc.
with_filter = {
    out_param_Any 'equal to' constant_Any
       => C.makeFilter($options, out_param_Any, '==', constant_Any);

    out_param_String ('containing' | 'including') constant_String
       => C.makeFilter($options, out_param_String, '=~', constant_String);
    constant_String ('in the' | 'in its' | 'in their') out_param_String
      => C.makeFilter($options, out_param_String, '=~', constant_String);

    out_param_Numeric ('higher' | 'larger' | 'bigger') 'than' constant_Numeric
       => C.makeFilter($options, out_param_Numeric, '>=', constant_Numeric);
    out_param_Numeric ('smaller' | 'lower') 'than' constant_Numeric
       => C.makeFilter($options, out_param_Numeric, '<=', constant_Numeric);

// we can also compose the primitive templates to make more primitives
// for example by adding more rules to `query`
// this lets us generate rule of the form:
// - [recent tweets] with [the author equal to USERNAME_0]
// - [my emails] having [the sender equal to EMAIL_ADDRESS_0]
// - [my emails] with [QUOTED_STRING_0 in the subject]
// - [my dropbox files] having [QUOTED_STRING_0 in the file name]
// - [my dropbox files] having [the size larger than NUMBER_0 megabytes]
// etc.
query = {
   query ('with' | 'having') with_filter => {
        if (!query.schema.is_list || !C.checkFilter(query, with_filter))
            return null;
        return C.addFilter(query, with_filter, $options);

$root = {
    // when we're statisfied with the parameter passing, convert the command into a program
    forward_get_do => C.makeProgram(forward_get_do);
    backward_get_do => C.makeProgram(backward_get_do);

    // primitive programs have no parameter passing, and are defined directly at the root
    // when => notify
    ('monitor' | 'watch') query => {
        if (!query.schema.is_monitorable)
            return null;
        return C.makeProgram(new Ast.Statement.Rule(
           new Ast.Stream.Monitor(query, null, query.schema),
    // now => query => notify
    // multiple ways to ask questions / retrieve data; this generates
    // - get [my dropbox files]
    // - tell me [the current weather]
    // - show me [websites matching QUOTED_STRING_0]
    // - pull up [my latest emails]
    // - hey almond search [QUOTED_STRING_0 on bing]
    // - hey almond search [tweets matching QUOTED_STRING_0]
    // - hey almond what is [the translation of QUOTED_STRING_0 to GENERIC_ENTITY_tt:iso_lang_code_0] ?
    // etc.
      'get' query
    | ('tell me' | 'give me' | 'show me' | 'present' | 'retrieve' | 'pull up') query
    | ('hey almond' | '') ('search' | 'find' | 'i want' | 'i need') query
    | ('hey almond' | '') 'what is' query ('' | '?')
    ) => C.makeProgram(new Ast.Statement.Command(table, [C.notifyAction()]));

    // specialized now => query => notify templates for lists
    // this generates:
    // - hey almond enumerate [my emails]
    // - please list [my dropbox files]
    // etc.
    // but NOT
    // - list [the current weather]
    ('hey almond' | 'please' | '') ('list' | 'enumerate') query => {
        if (!query.schema.is_list)
            return null;
        return C.makeProgram(new Ast.Statement.Command(query, [C.notifyAction()]));

Specifically, construct templates define a context-free grammar of natural language, defined using terminals such as get or tell me and non-terminals such as query, action or forward_get_do. Each rule is associated with a semantic function, which defines the program value associated with the generated natural language. By recursively expanding this grammar, Genie synthesized the first part of the training set. (The current set of templates can be found on our Github; the examples here have been modified for readability).

Crowdsourced Data Augmentation

Synthesized data is large and cheap, but it is not quite representative of real-world inputs, because the way commands combine is limited by the construct templates, and commands repeat the same words over and over again. To overcome this problem, Genie augments the synthesized dataset with a small set of paraphrases, obtained from crowdsourcing. These paraphrases are expensive to obtain, and therefore their coverage of the space of commands is very sparse, but they introduce variance in the training set, by showing different forms of the same command.

Genie also uses large datasets of string parameter values and entities to replace each parameter value. This expands the dataset further, and lets the trained semantic parser understand parameters that are not quoted in the sentence. Finally, Genie also uses standard data augmentation with PPDB.

The Model

Genie's model is a modified version of Multi-Task Question Answering Network, a model developed by Salesforce for a number of NLP tasks[1:1]. Here is the high-level model architecture Genie uses:

The architecture of the MQAN model, as used in Genie

First, the sentence is combined with the current context, and encoded with bidirectional recurrent and self-attention layers. Then the result is decoded with a recurrent and self-attentive auto-regressive decoder. The decoder makes use of a language model layer, which is pre-trained on a large unsupervised automatically generated set of programs. This exposes the model to programs outside of the training set. For the details, please refer to our paper.

How Well Does It Work?

To evaluate how well Genie learns to understand ThingTalk commands, we designed four experiments, corresponding to four increasingly more difficult evaluation sets. The first experiment evaluates only on paraphrase data, collected in the same way as the paraphrase training data. Evaluating on the same data as training is the standard methodology in machine learning. Yet, we know that paraphrases are not necessarily as hard as real data, so we evaluate also on realistic data. Some of the data comes from If This Then That: we found the most popular recipes that Thingpedia supports. The rest was acquired using the cheatsheet methodology: crowdsource workers are shown the Almond cheatsheet, and are asked to combine two commands from it. We saved some of the realistic data for validation: this is the data we used to tune our templates and our model; the rest became our test sets.

Here are the results of Genie on this data:

Accuracy of the Genie model on the four test sets, compared with training just with paraphrases and just with synthesized data

We first observe that paraphrases are relatively easy: Genie achieves an accuracy of 87%, indicating it can fit the training data well. Realistic data, on the other hand, is (unsurprisingly) much harder. Yet, Genie still achieves an accuracy of 68% on the validation data, 62% on cheatsheet test data, and 63% on IFTTT test data. This underscores the need to acquire realistic data for evaluation. At the same time, it is possible achieve good accuracy without realistic data in training. 68% accuracy is not yet production ready, but it is sufficient for a beta deployment, which then leads to acquiring real data.

For comparison with the existing work that used paraphrases exclusively, we also trained a model with just paraphrase data, and just synthesized data. Training with paraphrase data alone is sufficient to achieve good results on paraphrases (82% accuracy) but not on realistic data, showing a drop between 13% and 16% in accuracy on the three realistic evaluation sets. Training with synthesized data fares a little better on realistic data, because the synthesized set can be made much larger and reduce overfitting, but the model is still not robust enough to handle realistic data.

In the paper, we also discuss the changes and improvements we had to make to ThingTalk to achieve this result. Without those changes, our paraphrase accuracy would have been a meager 48%. Almond was really not usable back then! For all the details, please refer to our paper.

We also evaluated Genie on 3 additional target languages (extensions of ThingTalk): a skill specialized to Spotify, the ThingTalk Access Control Language, and an aggregation extension (to say "how many X" or "what is the sum of X"). In all experiments we found that Genie improves the baseline by at least 19%. This shows that Genie is not just useful for ThingTalk, but can be used for developers to build specialized virtual assistants.

Where Can I Get Genie?

Genie is now part of the Almond platform, and it is available on GitHub and NPM. It can be used standalone, or it can be used as part of Thingpedia, by opening a Thingpedia Developer Account. We also welcome contributions to Genie, whether bug fixes, new construct templates, or support for languages other than English. If you're interested in using Genie, please reach out to us on our Community Forum.

Our datasets and our trained models are all freely available. We hope that Genie will prove useful for developers to create cost-effective semantic parsers for their own domains. By collecting contributions in template constructs, Thingpedia entries, and natural language sentences from developers in different domains, we can potentially grow Almond's model to be the best publicly available semantic parsing model.

(Cover image courtesy of Brick Resort. CC-BY-NC-ND 2.0. Link )

  1. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The Natural Language Decathlon: Multitask Learning as Question Answering. arXiv preprint arXiv:1806.08730. ↩︎ ↩︎

  2. Genie: A Generator of Natural Language Semantic Parsers for Virtual Assistant Commands. 2019. Giovanni Campagna, Silei Xu, Mehrad Moradshahi, Richard Socher, and Monica S. Lam. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (to appear), Phoenix, AZ, June 2019. ↩︎

  3. Thomas Kollar, Danielle Berry, Lauren Stuart, Karolina Owczarzak, Tagyoung Chung, Lambert Mathias, Michael Kayser, Bradford Snow, and Spyros Matsoukas. 2018. The Alexa Meaning Representation Language. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). Association for Computational Linguistics. ↩︎

  4. Vittorio Perera, Tagyoung Chung, Thomas Kollar, and Emma Strubell. 2018. Multi-task learning for parsing the Alexa Meaning Representation Language. In American Association for Artificial Intelligence (AAAI). 181–224. ↩︎

  5. Shared Task: Machine Translation. ACL 2014 Ninth Workshop on Machine Translation (WMT 14). ↩︎

  6. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144 (2016). ↩︎

  7. Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a Semantic Parser Overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 1332–1342. ↩︎

  8. Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103. ↩︎