ECMAScript Action Tags for JSGF

Proposal for discussion
2 September 1999

Bruce Lucas (IBM)
Will Walker (Sun Microsystems)
Andrew Hunt (Sun Microsystems)


This document describes a proposed mechanism that allows grammars written in the JavaTM Speech API Grammar Format (JSGF) to use the JSGF tagging mechanism together with ECMAScript (standardized version of Netscape's JavaScript) to specify a transformation from an utterance to information that is meaningful to the application. The information is returned in the form of ECMAScript values, such as strings and sets of attribute-value pairs (ECMAScript objects).

By embedding semantic interpretation into the syntactic definition of grammars, this proposal is intended to address the following technical challenges in developing and using speech recognition applications.


"Rule-based" or "phrase-structure" grammars in general, and JSGF in particular, by themselves only allow an application developer to specify the legal utterances - sequences of words - that the user may say.  However, typically the sequence of words in is not by itself very useful to an application.  Consider the following examples:

Utterance Application needs
five thousand three hundred and six  5306 
December 24th, 1998
the day before Christmas, last year>
"1998/12/24" or {year:1998, month:12, day:24}
I want to fly from Boston to Chicago. 
Hikoki-de, Boston-kara, Chicago-made ikitai
{action:"fly", from:"Boston", to:"Chicago"} 

Note that this table illustrates two kinds of values that are useful to applications: simple values such as numbers or strings, and sets of attribute-value pairs.  Simple values are useful for example in grammars for basic types such as numbers, dates, and times.  Simple values are also useful in simple command & control applications and in directed-dialog applications in which the user is asked a question and is then expected to supply a single piece of information. Sets of attribute-value pairs are useful in more complex command & control applications and in more sophisticated dialog applications, in which any utterance my simultaneously provide several pieces of information to the application.

The remainder of this document discusses a proposed method for embedding ECMAScript in JSGF tags that transforms utterances into information meaningful to an application. ECMAScript is a relatively powerful and flexible object-oriented programming language. It provides, for example, means to construct arrays, fill multiple optional slots, construct objects within objects within objects, perform trivial and complex numerical operations, manipulate dates/strings and other standard object types and much more. The ECMAScript specification is also thorough, which helps to eliminate different behavior between implementations. Finally, because ECMAScript is becoming more commonly used in web page development, the learning curve for developing JSGF grammars with ECMAScript Action Tags can be greatly reduced.

Parse trees

The ECMAScript action tag mechanism will be described with reference to the parse tree corresponding to an utterance. A grammar together with an utterance define a parse tree (assuming no ambiguity).  A parse tree can be viewed as a reduced version of the grammar which preserves only the non-terminals, terminals, tags, and sequences from the original grammar that correspond to the content of the utterance, and in which a separate copy of each non-terminal has been made for each use of the non-terminal. For the purposes of this document, parse trees will be represented in outline form. For example, consider the following grammar:

    <city> = New York {this.$value="NYC"} | Boston {this.$value="BOS"};

    public <top> = I want to (fly {this.action="fly"} | drive {this.action="drive"})
                   from <city> {this.from=$}
                   to   <city> {$city};

The utterance "I want to fly from New York to Boston", when parsed against this grammar, produces the following parse tree:


As we will see below, when evaluated this parse tree will produce the ECMAScript value {action:"fly", from:"NYC", to:"BOS"} for the application to use.

All JSGF parse-tree structure in our parse trees, except non-terminal references, will be flattened.  In particular, parenthesized expressions, optional items, repeated items, and tagged items will be flattened to a single level in the tree.

Parse tree evaluation

The purpose of parse tree evaluation is to recursively compute a value for each non-terminal in the tree.  The value for each parent non-terminal is computed by the action tags contained in the parent non-terminal, possibly using the values computed for its child non-terminals in the tree.

Thus, for a well-written grammar we should define action tags so that each non-terminal will return a value that is a computer-understandable transformation of the spoken tokens that match the non-terminal.

For each non-terminal in a tree the action tag mechanism allocates a new object that represents the value of the non-terminal.  The purpose of action tags is to construct the non-terminal value object in which they are directly contained by assigning values to fields of the object.  The set of action tags for a non-terminal taken together act something like the body of a constructor for the object associated with the non-terminal:

The object constructed by the action tags for a child non-terminal may then be used in the enclosing parent non-terminal to construct its value object by referring to the child non-terminal in one of two ways:

While the value of each non-terminal is an object, it is also useful in some cases for a non-terminal to be treated as a simple value such as a number or string.  The standard ECMAScript toString and valueOf object methods allow this to be accomplished.  To provide a simple value for a non-terminal its action tags may assign to a special field, this.$value, in the non-terminal's value object.  The action tag mechanism supplies for each non-terminal's value object a toString and toValue method that return this.$value.  The ECMAScript interpreter automatically calls the toString and valueOf methods when a reference is made to the non-terminal value in a context where a simple value such as a number or string is required.

The default value for this.$value (and therefore for the non-terminal value object when used in a context where a simple value is needed) is a string which is the concatenation of the string values of all the items in the non-terminal, separated by spaces.

In addition, the action tag mechanism computes for each non-terminal object a special field, this.$tokens, that contains an array of strings containing the words (terminals) used by the non-terminal and any non-terminals that it directly or indirectly references.   In summary each object has the following special fields:


Hello World

The following grammar:

    <hi> = yo {this.$value="hello"} | hello;
    <who> = world | fred;
    public <helloworld> = <hi> <who> {this.greeting=$hi; this.recipient=$who};

when used to parse the utterance

    yo world

produces the following parse tree:

        {this.greeting=$hi; this.recipient=$who}

which when evaluated produces the ECMAScript value

    {greeting:"hello", recipient:"world"}

This illustrates three points concerning action tags:


The following simple number grammar accepts spoken number phrases less than one million and returns a string containing the number in numeric form.  (A portion of the <10to99> rule has been omitted from the version shown here for the sake of brevity.)

    <1to9> = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ;

    <0to9> = oh {this.$value="0"} | 0 | <1to9>;

    <10to99> = 10 | 11 | 12 | ... | 99 ;

    <1to99> = <1to9> | <10to99>;

    <00to99> = [oh] <0to9> {this.$value="0"+$0to9} | <10to99>;

        = <1to9> [hundred [and]] <00to99> {this.$value = $1to9 + $00to99}
        | <1to9> hundred                  {this.$value = $1to9 + "00"}
        | <1to99>                         {this.$value = $1to99}

        = <0to9> [hundred [and]] <00to99> {this.$value = $0to9 + $00to99}
        | <0to9> hundred                  {this.$value = $0to9 + "00"}
        | <00to99>                        {this.$value = "0" + $00to99}

        = <1to999> thousand <000to999>    {this.$value = $1to999 + $000to999}
        | <1to999> thousand and <00to99>  {this.$value = $1to999 + "0" + $00to99}
        | <1to999> thousand               {this.$value = $1to999 + "000"}
        | <1to99> hundred [and] <00to99>  {this.$value = $1to99 + $00to99}
        | <1to99> <10to99>                {this.$value = $1to99 + $10to99}
        | <1to99> hundred                 {this.$value = $1to99 + "00"}
        | <1to99>                         {this.$value = $1to99}

    public <number> = oh {this.$value="0"} | 0 | <1to999999>;

This grammar illustrates:


The following grammar illustrates the use of action tags for a simple mixed-initiative form-filling dialog system for making appointments.

    <ondate> = [on] <> {date=$};
    <attime> = [at] <test.time> {time=$};
    <gorp>   = [(I'd|I) (like|want) to] (make|schedule) (an appointment|a meeting);

    <appt> = <gorp> [<ondate> [<attime>]]
           | <gorp> <attime> [<ondate>]
           | <ondate> [<gorp> [<attime>]]
           | <ondate> <attime> [<gorp>]
           | <attime> [<gorp> [<ondate>]]
           | <attime> <ondate> [<gorp>]

    public <appointment> =
        <NULL> {var date, time} <appt> {; this.time=time};

This grammar allows the user to take the initiative (by making a complete or partial request) or to respond when the computer takes the initiative (by prompting the user for missing information) as illustrated by the following table:

Utterance Returned value
I'd like to make an appointment on January third at two o'clock  {date:"1/3", time:"2:00"} 
schedule an appointment on the fourth of February  {date:"2/4"} 
at five thirty  {time:"5:30"} 

This grammar illustrates:

Airline reservation

The following grammar (the airline reservation grammar that was presented above):

    <city> = New York {this.$value="NYC"} | Boston {this.$value="BOS"};

    public <top> = I want to (fly {this.action="fly"} | drive {this.action="drive"})
                   from <city> {this.from=$}
                   to   <city> {$city};

illustrates a few points concerning action tags:

Pizza toppings

The following grammar for ordering pizza:

     <topping> = mushrooms | pepperoni | onions | anchovies;
     <toppings> = <NULL>           {this.toppings = new Array()}
                  <topping>        {this.toppings=this.toppings.concat($topping)}
                  ([and] <topping> {this.toppings=this.toppings.concat($topping)})*;

     public <ask> = I would like a pizza with <toppings>
                    {this.item="pizza"; this.toppings=$toppings.toppings};

when used to parse the utterance

     I would like a pizza with onions, mushrooms and pepperoni

produces the following parse tree:

             {this.toppings = new Array()}
             {this.toppings = this.toppings.concat($topping)}
             {this.toppings = this.toppings.concat($topping)}
             {this.toppings = this.toppings.concat($topping)}
         {this.item="pizza"; this.toppings=$toppings.toppings}

The values for each successive instance of <topping> will be the word spoken in that instance (e.g., "onions", "mushrooms", "pepperoni"). The <toppings> non-terminal gets a value of an ECMAScript indexed array:

     ["onions", "mushrooms", "pepperoni"]

Finally, the <ask> non-terminal returns a compound ECMAScript object with two named attributes:

     {item: "pizza",  toppings: ["onions", "mushrooms", "pepperoni"]}