------------------------------------------------------------

[MUSIIC]

Multimodal User Supervised Interface and Intelligent Control

------------------------------------------------------------

Pseudo Natural Language [PNS] Semantics for MUSIIC

Draft: for internal circulation

Zunaid Kazi

1.0 User Interface

The user to system communication has two main destinations.


The knowledge base, for initializing, updating, instructive and reactive purposes.
The robot subsystem, directly through actions specified as speech and indirectly via the vision system specified by gesture diectics.

2.0 The Planner/Executor Subsystem [PES]

The goal of the PES is to facilitate communication between the user and robot and knowledge bases and satisfy the user intentions. While a natural language interface is ideal, the current state of the art in natural language research precludes the use of such interfaces. A multimodal combination of speech and gesture deictic is a better alternative for use as an assistive device, where the input speech is a restrictive sub-set of natural language, a pseudo-natural language [PNL]. We then can apply the model-based procedural semantics [Winograd, Suppes, Crangle] where words are interpreted as procedures that operate on the model of the robot's physical environment. One of the major questions in procedural semantics has been is the choice of candidate procedures. Without any constraints any procedural account might be preferred over another and their will not be any shortage of candidate procedures. The restrictive PNL and the finite set of manipulatable objects in the robots domain provide this much necessary constraint. Following the argument of Crangle and Suppes, we need to focus on user intentions in communications in order to evaluate the adequacy of any semantic account extended to include procedural encodings. Following their approach we need to:


* Define a class of intentions
* Define a set of procedures that will satisfy the user intentions
* Define satisfaction conditions to determine successful intention fulfillment
* Construct proofs that these conditions can be met.

2.1 User Intentions

Consider the user command:


Put the book on the table.
The exact mode of approaching the book, the path followed by the robot is not essential to satisfying the user intention. While the details of the actual procedures invoked to satisfy user intentions is not required by the user, expressed intentions carry along with them conditions that may restrict the procedures actually being invoked. While these conditions are not given in advance they dependant on the context in which the procedures are being invoked. Satisfaction entails user intention satisfaction as well as the satisfaction of the equally important associated conditions not necessarily specified directly by the user.

Let us consider the intentions that the user wishes to communicate to the robotic arm:

Meta intentions:


That the robot perform a sequence of operations
That the robot pursue two or more goals simultaneously
The robot pursue a goal only if a condition is met
The robot pursue repeatedly one goal
The robot pursue repeatedly one goal until a given condition is met
The robot pursue a goal when a condition is met

The robot perform a specified operation slower than normal
The robot perform a specified operation faster than normal
The robot go to a given region
The robot avoid a given region
That the robot approach an object
That the robot grasp an object
That the robot pick an object up
That the robot move an object to a certain location
That the robot insert an object into another
That the robot rotate an object in a given orientation

There are also certain conditions that the arm should be able to detect:

That the arm is in a given region

That the gripper is free

That the arm is not touching anything

2.2 Robot level procedures:

The robot's low level operation can be classified into the following 4 categories:

2.2.1 Control procedures:


(Series A1 A2 A3...)
Execute A1 and so on in sequence
(Parallel A1 A2)
Execute A1 and A2 simultaneously
(If Condition A1)
If Condition is true execute A1
(Repeat A1)
Repeatedly execute A1
(Repeat-until A1 Condition)
Repeatedly execute A1 until Condition is true
(Whenever Condition A1)
Execute A1 whenever Condition becomes true

2.2.2 Motion procedures:


(Move location distance path)
(Move-to location)
(Rotate orientation)
(Gripper open)
(Gripper close)
(Pause)
(Resume)
(Home)
(Slow)
(Fast)
(Stop)

2.2.3 Test procedures:


(Arm-in-location?)
(Gripper-open?)
(Touching?)

2.2.4 Cognitive and perceptual procedures:

Routines that interact with the USER, the knowledge base, and the vision system to build a perceptual and cognitive model of the domain. The 3 different set of routines will be elaborated in a later section. The procedures might be to obtain object attributes, properties and relational information. Currently we have:


(Object-type A)
Returns the type of object A is at the defined level of specialization, e.g. my-cup, a-cup etc.
(Property object property-type)
Returns the property value defined by property-type of the object. Property-type might be color and the property value might be red.
(Relationship object1 object2)
Returns the geometric relationship between the two objects. Such as object1 is above object2.

These procedures need to be further formalized.

2.2.5 Illustration

More complex motion routines can be built up as a function of the previously described motion, test and control procedures.

As an example: To put an object in a certain location, we might have:

(Series

(Move-to location)

(Gripper open))

(Home)

(Gripper-close))

3.0 Interpreting Multimodal Instructions:

3.1 The multimodal interface

Speech provides both categorical (Objects), property information (Color etc.), and qualifying/quantifying information.

Gestures provide both shape and spatial information. However, if the gestures are diectic, then we can only obtain the spatial configuration of elements.

The information that is obtained from this multimodal input can be categorized into


* shape, appearance and property (visual)
* spatial and temporal information
* categorical or abstract information
* Task information

3.1.1 Semantic Interpretation for Robot Control

Semantic Interpretation for USER-PLANNER communication for Robot Control

Looking at a typical MUSIIC instruction syntax for the robot: the words in [] imply both speech and gesture diectic]


Move [that] [there].

Analyzing the components:

Move -> TASK specification / ACTION; Semantic analog to a Verb in NL

[that]-> Deictic that gets instantiated to an OBJECT/THING. NL subject.

[there]-> Diectic that gets instantiated to LOCATION.

From a purely speech input, we may have an instruction such as:


Push slowly the blue book next to the red cup 2 feet towards me.

Mapping the major syntactic components of this sentence to their corresponding semantic elements, we obtain:


Push:->TASK/ACTION
slowly:-> TASK-QUALIFIER
the blue book->THING (TASK FOCUS)
next to the red cup:->LOCATION (SOURCE)
2 feet:->QUANTITY
towards me:->LOCATION (DESTINATION)

In essence a typical instruction would have the following semantic format:


TASK
TASK-QUALIFIER
TASK-FOCUS
SOURCE-LOCATION
QUANTITY
DEST-LOCATION

While a complete natural language mechanism is at this point not desired, a syntactic structure that simulates to a certain extent the syntax of natural language (though restricted) is something that would make the user feel more comfortable with the system.

The Semantic Units (SU) being used are:

TASK: The action that is to be performed.

TASK-QUALIFIER: Qualifying how the action is going to be invoked. Slowly and fast.

TASK-FOCUS: TASK being invoked on this THING

SOURCE-LOCATION: Of type LOCATION

QUANTITY: Spatial/Temporal duration of the TASK

DESTINATION-LOCATION: Of type LOCATION

THING: Is an SU similar to a noun-phrase in NL. Elements of THING are, {ART}1, {ADJ}2 and {OBJECT}


ART: a, an, the, that, this
ADJ: Object quantifier. Properties such as weight, color, size, surface.
OBJECT: The actual manipulatable object. Both abstract as well as specific.
LOCATION: An SU that maps to an OBJECT position in the world with respect to a certain frame of reference. What is also needed is a location functions (LF) to define locational relationships such as "in", "inside", "above", "below" etc. The LF takes the locational relationship and a THING and maps to a LOCATION.

QUANTITY: A spatial or temporal quantity.

3.2 Semantic interpretation for knowledge-base interaction

The knowledge base contains information about the robots perceptual environment, updated both from vision data and user interaction. The semantics of the user interaction with the knowledge base needs also to be specified. Plus learning, instruction and plan correction also entails user interaction with the knowledge base. The class of user instructions in this case would be significantly different to look at this interaction separately.

3.2.1 Knowledge base initializing dialogues

We assume that a basic object hierarchy has already been defined and the user needs only to interact at most at the level where generic object types need to be spawned off shape based hierarchy.


(Update)
Instruction that sets the context for knowledge base interaction
(Spawn type parent)
Spawning an object or class from parent.
(Name type)
Naming the object or class
(Add-property type attribute value)
Adding a property attribute-value pair for type.
(Edit-property object attribute value)
Editing a property attribute-value pair for type.
(Add-action)
This adds an action to the knowledge base. This invokes the context for instruction dialogues.
(Edit-action)
This edits an action to the knowledge base. This invokes the context for instruction dialogues.

3.2.2 Instruction dialogues

Each instruction dialogue is a sequence of trials and or steps. This needs to be better formalized.The instructions can be provided off-line, i.e. when the Action Base is being updated, or on-line when a new skill is being taught on the fly. There is need for feedback to be provided to the PES during on the fly instruction. Both on-line and off-line instruction requires two way dialogue between the USER and the PES.
3.2.2.1 On-line instruction
In this method the robot arm is physically controlled by the user, again by the use of PNL. The whole sequence of actions is then encoded as an action name. The PES needs to then generalize the sequence into an efficient plan.

Initializing Context


(Name-action name)
Providing the name for the action to be specified
(Action-type type)
Type is whether this is a plan-fragment or a complete plan
(Action-object type)
Type specifies a general object or specific object for which this plan is valid


Setting up the actions



General feedback
(reset n)
Undo last n instructions, where n can be zero
(reset-all)
Undo all instructions, i.e. restart from start
(repeat TASK-QUALIFIER)
Repeat the last instruction with the TASK-QUALIFIER

Positional feedback


(bit LOCATION)
move a "bit" to LOCATION where bit is context dependant and is encoded in the actual plan.
(bit TASK-QUALIFIER)
move a "bit" to TASK-QUALIFIER where bit is context dependant and is encoded in the actual plan.
(more LOCATION)
Self explanatory
(more TASK-QUALIFIER)
Self Explanatory
(ok)
Successful completion feedback

Finalize


(Consolidate)
Update and generalize the plan.
(Done)
Instruction completed.
3.2.2.2 Off-line instruction:
An action needs to be defined. Step by step instructions encoded as a sequence of PNL inputs is provided. The whole sequence is then given a name by which the robot can than be instructed with at a latter date.

3.2.3 Corrective/reactive/learning dialogues

This dialogue is initiated when the PES fails to satisfy a user intention. This is also a 2 way dialogue. The syntax and semantics are still being worked out.

3.3 The grammar and the lexicon

Given the procedural semantic interpretation of the PNL, calls to the previously defined robot routines are encoded in the system's grammar and lexicon. The lexical entry for each semantic unit [SU] can be thought of as a robot plan.

In general each plan must stipulate:


* Which of the robot routines must be invoked to respond to the appropriate command and what kind of temporal, spatial and logical constraints come into play during the execution, and
* The number of parameters each routine has and what parts of the user instruction map to arguments of the specified routines.

Communication between the PNL system and the knowledge bases, robot arm and the vision system is supervised by the PES. The PES must invoke and monitor all the procedures that are ultimately invoked. The overall subsystem architecture is shown in the following figure

3.3.1 Parser

The parser applies the grammar rules of the PNL to a USER input sentence and generates the syntactic and semantic components.

3.3.2 Grammar

Encoded as CFGs.
3.3.2.1 Terminal Symbols

TASK (T)
TASK-QUALIFIER (TQ)
ART
ADJ
OBJ
QUANTITY (Q)
LOCATION-FUNCTION (LF)
3.3.2.2 Non-terminal symbols:

SENTENCE (S)
QUALIFIED-TASK (QT)
THING
SOURCE-LOCATION (SL)
DESTINATION-LOCATION (DL)
3.3.2.3 Production Rules

S -> QT
S -> QT + Q
S-> QT + LF
S-> QT + THING
S-> QT + THING + DL
S-> QT + THING + SL + DL
QT -> T
QT -> T + TQ
THING -> OBJ
THING -> ART + OBJ
THING -> ART + ADJ + OBJ
DL -> LF
SL -> LF

3.3.3 The Domain: Object and task specifications

3.3.3.1 Object Set

Shape Based: Cylindrical


Soda Can
Cup

Shape Based: Cuboid


Hard Cover Book
Small Box (Matchbox)
Measuring Scale

Shape Based: Spherical


Ball

Shape Based: Amorphous


Piece of Rock

Property Based


Pleistocene Lump
Paperback Book
Straw
Surface Texture Based
Drinking Glass

Task Dependent


Telephone
Plate
Saucer
Spoon
Fork
Knife

3.3.4 Lexicon

TASKS:


Approach
Grasp
Pick-up
Rotate
Insert
Stop
Pause
Resume
Goback
Push

Update
Spawn
Name
Add-property
Edit-property
Add-action
Edit-action

Name-action
Action-type
Action-object
reset
reset-all
repeat
bit
more
ok

TASK-FOCUS


THING:

User Defined


ART:

a
an
the
that
this
those


ADJ:

Color: standard
Weight: heavy, light
Size: Huge, big, medium, small, tiny

LOCATIONS:


Location Functions (LF):

inside
in
above
behind
right
left
forwards
backwards


Location arguments:

Distance measures: meters, inches, centimeters
TASK-FOCUS

QUANTITY:


n n n
n THING
TASK-QUALIFIERS


Slowly
Fast

3.3.5 Semantic Interpretation

Semantic functions are attached to production rules. Semantic functions may invoke calls to perceptual and cognitive routines. Current thoughts: Choice between phrase-attribute grammar or the simpler mode of an extra slot associated with each SU encoding the procedure.

------------------------------------------------------------

[MUSIIC [Robotics] [ASEL]

------------------------------------------------------------

Last Updated: March 5, by Zunaid Kazi <kazi@asel.udel.edu>