Multimodal User Supervised Interface and Intelligent Control



The primary objective of MUSIIC is the development of a multimodally controlled assistive robot for operation in an unstructured environment. The long term goal of the project is to have the assitive robotic arm attached to a wheel chair and allow the user the freedom from being limited to a fixed workcell.


The main philosophy of MUSIIC is based on the synergy between Humans and Machines. Inspite of our physical limitations, we excel in

Machines, on the other hand, excel in

We contend that by harnessing the best abilities of Human and Machine, can we arrive at an assistive robot that is feasible, usable and most importantly practical.


MUSIIC has four main components

Stereovision Component

A pair of cameras takes a stereoscopic snap-shot of the domain of interest and obtains information about the the shape, pose and location of objects in the domain.

Multimodal Interface

The user instructs the robot arm by means of a multimodal interface which is a combination of speech and gesture. By gesture, we are currently focusing on deictic gesture, ie pointing gesture. This allows the identification of the user's focus of interest without having to perform the computationally difficult process of real-time object identification.

The speech input is a restricted subset of Natural Language [NL] and is called Pseudo Natural Language [PNL]. The basic grammar has the following structure:

Verb Object Subject

Since this is a manipulative domain, the language struture is essentially, an action on an object to/at a location. As an example we may have inputs like:

Put that right of the blue book.
Insert the straw into the cup.

A parser then generates the semantics of the user intentions from the combined speech and gesture input. The semantic interpretation of user input is then passed on to the planner.

Adaptive Planner

The adaptive planner executes user intentions by generating the appropriate robot control commands. The adaptive planner reduces cognitive load on the user by taking over most high-level as well as all low level planning tasks. The planner is also able to adapt to changing situations and events, and either replans on its own or interacts with the user when it is unable to construct a plan on its own.

Object Oriented Knowledge Bases

Underlying MUSIIC are a pair of knowledge bases. One is a knowledge base of actions, both primitive and complex, which is user extendible. The other is a knowledge base of objects, which contains information about objects in an abstraction hierarchy. The hierarchy has four tiers arranged on the basis of abstraction.

Generic Blob
This is the top level hierarchy and contains just enough information to allow the construction of correct planes based solely on what information is obtained from the vision system and no a priori information about the objects in question.
Shape Based
AT this level, objects are classified in terms of general shapes, such as cuboid, cylindrical, conical etc. This classification was chosen because in a manipulative domain, the shape of objects affects the planning process the most.
Object Type
This is derived from the previous tier and groups together classes of objects, such as cups, straws, pencils, boxes etc. Information contained in this tier is used to make plans that are more indicative of user intentions and also constrains planning parameters such as approach position, grasp position, orientation to maintain etc. Other information is obtained from the vision system for correct plan construction.
Instantiated Object
This is the final tier on the abstraction hierarchy, and contains information about actual objects in the domain.

The knowledge bases allows the planner to not only make plans for manipulating objects about which it knows nothing a priori except what is obtained by the vision system, but also to make plans that are a more accurate interpretation of user intentions when more information is available from the knowledge bases.

Test Bed

A test-bed has been constructed to determine the validity of our proof of concept. The backbone computing machine for the vision interface is an SGI XS-24 IRIS Indigo computer. Pictures are taken by two CCD color cameras, model VCP-920 that have light-resolution 450 TV lines with 768X494 picture elements. Each camera is equipped with motorized TV zoom lens, model Computer M61212MSP. Cameras are connected to the SGI Galileo graphics board which provides up to three input channels. In this system we use two channels with s-video inputs. The Noesis Visilog-4 software package installed on the SGI machine is used as an image processing engine to assist developing the vision interface software. The speech system used is Dragon Dictate running on a PC. A six degree of freedom robot manipulator, Zebra ZERO, is employed as the manipulation device. The planner and knowledge bases reside on a Sun Sparc Station 5. Communication between the planner and the sub-systems are supported by the RPC (Remote Procedure Call) protocol.


We claim that the MUSIIC method is one of the better ways to extend the functionality of a person with disability given that our system endows the user with


[Robotics] [MUSIIC] [ASEL]


Last Updated: June 15, 1996 Zunaid Kazi <kazi@asel.udel.edu>