Multimodal User Supervised Interface and Intelligent Control
Draft: for internal circulation
Zunaid Kazi
The user to system communication has two main destinations.
The goal of the PES is to facilitate communication between the user and robot
and knowledge bases and satisfy the user intentions. While a natural language
interface is ideal, the current state of the art in natural language research
precludes the use of such interfaces. A multimodal combination of speech and
gesture deictic is a better alternative for use as an assistive device, where
the input speech is a restrictive sub-set of natural language, a pseudo-natural
language [PNL]. We then can apply the model-based procedural semantics
[Winograd, Suppes, Crangle] where words are interpreted as procedures that
operate on the model of the robot's physical environment. One of the major
questions in procedural semantics has been is the choice of candidate
procedures. Without any constraints any procedural account might be preferred
over another and their will not be any shortage of candidate procedures. The
restrictive PNL and the finite set of manipulatable objects in the robots
domain provide this much necessary constraint. Following the argument of
Crangle and Suppes, we need to focus on user intentions in communications in
order to evaluate the adequacy of any semantic account extended to include
procedural encodings. Following their approach we need to:
Consider the user command:
The exact mode of approaching the book, the path followed by the robot is not
essential to satisfying the user intention. While the details of the actual
procedures invoked to satisfy user intentions is not required by the user,
expressed intentions carry along with them conditions that may restrict the
procedures actually being invoked. While these conditions are not given in
advance they dependant on the context in which the procedures are being
invoked. Satisfaction entails user intention satisfaction as well as the
satisfaction of the equally important associated conditions not necessarily
specified directly by the user.
Let us consider the intentions that the user wishes to communicate to the
robotic arm:
There are also certain conditions that the arm should be able to detect:
That the arm is in a given region
That the gripper is free
That the arm is not touching anything
The robot's low level operation can be classified into the following 4
categories:
Routines that interact with the USER, the knowledge base, and the vision system
to build a perceptual and cognitive model of the domain. The 3 different set of
routines will be elaborated in a later section. The procedures might be to
obtain object attributes, properties and relational information. Currently we
have:
These procedures need to be further formalized.
More complex motion routines can be built up as a function of the previously
described motion, test and control procedures.
As an example: To put an object in a certain location, we might have:
(Series
(Move-to location)
(Gripper open))
(Home)
(Gripper-close))
Speech provides both categorical (Objects), property information (Color etc.),
and qualifying/quantifying information.
Gestures provide both shape and spatial information. However, if the gestures
are diectic, then we can only obtain the spatial configuration of elements.
The information that is obtained from this multimodal input can be categorized
into
Semantic Interpretation for USER-PLANNER communication for Robot Control
Looking at a typical MUSIIC instruction syntax for the robot: the words in []
imply both speech and gesture diectic]
Analyzing the components:
Move -> TASK specification / ACTION; Semantic analog to a Verb in NL
[that]-> Deictic that gets instantiated to an OBJECT/THING. NL subject.
[there]-> Diectic that gets instantiated to LOCATION.
From a purely speech input, we may have an instruction such as:
Mapping the major syntactic components of this sentence to their corresponding
semantic elements, we obtain:
In essence a typical instruction would have the following semantic format:
While a complete natural language mechanism is at this point not desired, a
syntactic structure that simulates to a certain extent the syntax of natural
language (though restricted) is something that would make the user feel more
comfortable with the system.
The Semantic Units (SU) being used are:
TASK: The action that is to be performed.
TASK-QUALIFIER: Qualifying how the action is going to be invoked. Slowly and
fast.
TASK-FOCUS: TASK being invoked on this THING
SOURCE-LOCATION: Of type LOCATION
QUANTITY: Spatial/Temporal duration of the TASK
DESTINATION-LOCATION: Of type LOCATION
THING: Is an SU similar to a noun-phrase in NL. Elements of THING are,
{ART}1, {ADJ}2 and {OBJECT}
LOCATION: An SU that maps to an OBJECT position in the world with respect to a
certain frame of reference. What is also needed is a location functions (LF) to
define locational relationships such as "in", "inside", "above", "below" etc.
The LF takes the locational relationship and a THING and maps to a LOCATION.
QUANTITY: A spatial or temporal quantity.
The knowledge base contains information about the robots perceptual
environment, updated both from vision data and user interaction. The semantics
of the user interaction with the knowledge base needs also to be specified.
Plus learning, instruction and plan correction also entails user interaction
with the knowledge base. The class of user instructions in this case would be
significantly different to look at this interaction separately.
We assume that a basic object hierarchy has already been defined and the user
needs only to interact at most at the level where generic object types need to
be spawned off shape based hierarchy.
Each instruction dialogue is a sequence of trials and or steps. This needs to
be better formalized.The instructions can be provided off-line, i.e. when the
Action Base is being updated, or on-line when a new skill is being taught on
the fly. There is need for feedback to be provided to the PES during on the fly
instruction. Both on-line and off-line instruction requires two way dialogue
between the USER and the PES.
In this method the robot arm is physically controlled by the user, again by the
use of PNL. The whole sequence of actions is then encoded as an action name.
The PES needs to then generalize the sequence into an efficient plan.
An action needs to be defined. Step by step instructions encoded as a sequence
of PNL inputs is provided. The whole sequence is then given a name by which the
robot can than be instructed with at a latter date.
This dialogue is initiated when the PES fails to satisfy a user intention. This
is also a 2 way dialogue. The syntax and semantics are still being worked out.
Given the procedural semantic interpretation of the PNL, calls to the
previously defined robot routines are encoded in the system's grammar and
lexicon. The lexical entry for each semantic unit [SU] can be thought of as a
robot plan.
In general each plan must stipulate:
Communication between the PNL system and the knowledge bases, robot arm and the
vision system is supervised by the PES. The PES must invoke and monitor all the
procedures that are ultimately invoked. The overall subsystem architecture is
shown in the following figure
The parser applies the grammar rules of the PNL to a USER input sentence and
generates the syntactic and semantic components.
Encoded as CFGs.
Semantic functions are attached to production rules. Semantic functions may
invoke calls to perceptual and cognitive routines. Current thoughts: Choice
between phrase-attribute grammar or the simpler mode of an extra slot
associated with each SU encoding the procedure.
Last Updated: March 5, by
Zunaid Kazi
<kazi@asel.udel.edu>