Enabling robots to understand instructions provided via spoken naturallanguage would facilitate interaction between robots and people in a variety ofsettings in homes and workplaces. However, natural language instructions areoften missing information that would be obvious to a human based onenvironmental context and common sense, and hence does not need to beexplicitly stated. In this paper, we introduce Language-Model-based CommonsenseReasoning (LMCR), a new method which enables a robot to listen to a naturallanguage instruction from a human, observe the environment around it, andautomatically fill in information missing from the instruction usingenvironmental context and a new commonsense reasoning approach. Our approachfirst converts an instruction provided as unconstrained natural language into aform that a robot can understand by parsing it into verb frames. Our approachthen fills in missing information in the instruction by observing objects inits vicinity and leveraging commonsense reasoning. To learn commonsensereasoning automatically, our approach distills knowledge from largeunstructured textual corpora by training a language model. Our results show thefeasibility of a robot learning commonsense knowledge automatically fromweb-based textual corpora, and the power of learned commonsense reasoningmodels in enabling a robot to autonomously perform tasks based on incompletenatural language instructions.
Quick Read (beta)
Enabling Robots to Understand Incomplete Natural Language Instructions Using Commonsense Reasoning
Enabling robots to understand instructions provided via spoken natural language would facilitate interaction between robots and people in a variety of settings in homes and workplaces. However, natural language instructions are often missing information that would be obvious to a human based on environmental context and common sense, and hence does not need to be explicitly stated. In this paper, we introduce Language-Model-based Commonsense Reasoning (LMCR), a new method which enables a robot to listen to a natural language instruction from a human, observe the environment around it, and automatically fill in information missing from the instruction using environmental context and a new commonsense reasoning approach. Our approach first converts an instruction provided as unconstrained natural language into a form that a robot can understand by parsing it into verb frames. Our approach then fills in missing information in the instruction by observing objects in its vicinity and leveraging commonsense reasoning. To learn commonsense reasoning automatically, our approach distills knowledge from large unstructured textual corpora by training a language model. Our results show the feasibility of a robot learning commonsense knowledge automatically from web-based textual corpora, and the power of learned commonsense reasoning models in enabling a robot to autonomously perform tasks based on incomplete natural language instructions.
Natural language is inherently unstructured and often reliant on common sense to understand, which makes it challenging for robots to correctly and precisely interpret natural language. Consider a scenario in a home setting in which a robot is holding a bottle of water and there are scissors, a plate, some bell peppers, and a cup on a table (see Fig. 1). A human gives an instruction, “pour me some water”, to the robot. This instruction is incomplete from the robot’s perspective since it does not specify where the water should be poured, but for a human, it might be obvious that the water should be poured into the cup. A robot that has the common sense to automatically resolve such incompleteness in natural language instructions, just as humans do intuitively, will allow humans to interact with it more naturally and increase its overall usefulness. To this end, we introduce Language-Model-based Commonsense Reasoning (LMCR), a new approach which enables a robot to listen to a natural language instruction from a human, observe the environment around it, automatically resolve missing information in the instruction, and then autonomously perform the specified task.
The core problem we are addressing is enabling a robot to understand incomplete natural language instructions with the help of commonsense reasoning, particularly handling cases in which an argument of the instruction’s verb is missing. Solving this problem requires two steps: (1) identify if and how an instruction is incomplete, and (2) complete the instruction using knowledge of the objects in the robot’s environment.
For the first step (identifying incomplete instructions), we parse the natural language instruction into a structured representation, referred to as a verb frame. A verb frame is a tuple containing a predicate (i.e., a verb or verb phrase) and a set of semantic roles and their associated content [jurafsky2014speech]. For example, LMCR automatically parses the instruction “pour me some water” to the verb frame (pour, Theme: water, Destination: ?), where “pour” is the predicate, “water” and “?” are arguments that help complete the meaning of a predicate, and “Theme” and “Destination” are semantic roles which specify the underlying relationship between arguments and the predicate. The empty tag ? indicates that the argument of Destination is missing. Under such a representation, incomplete instructions can be easily identified as not all roles in the verb frame are filled with content from the instruction. The task of resolving incomplete instructions then becomes filling the missing role with objects in the environment.
For the second step (completing an incomplete instruction), we note that people are more likely to omit information from an instruction if it is obvious to the listener, so the correct role filler should be the one that yields a complete verb frame with the highest probability among all possible combinations. Inspired by this, LMCR uses a neural network based language model, which acts as a probability distribution over sequences of words. After training on textual corpora containing descriptions of common household tasks, our language model is able to assign higher probabilities to candidate verb frames that correspond to more common complete instructions, such as (pour, Theme: water, Destination: cup). This language model, combined with limiting the missing arguments to objects in the robot’s vicinity, enables the robot to automatically fill in missing information in natural language instructions via common sense.
We incorporate the above language understanding pipeline into a robot as shown in Fig. 2. The robot gets instructions and environmental information via the Speech Recognition and Detection modules respectively, processes the inputs via LMCR, and executes the specified task via the Motion Planning module. A video of an LMCR-enabled robot is provided in Supplementary Materials. We also quantitatively evaluate LMCR on a human-annotated dataset collected as part of this work. We compile a novel dataset as existing datasets on commonsense reasoning such as [chao2015mining, warren2015comprehending, pylkkanen2007meg] consist mostly of general-purpose verbs and nouns and are not aimed specifically at robot manipulation applications, making them unsuitable for evaluating our method. In this work we focus on kitchen assistance tasks (e.g., blend, pour, sprinkle), but the same pipeline can be extended to other scenarios given relevant training data. The results show that incorporating commonsense knowledge via a language model approach enables a robot to understand and perform a task based on incomplete instructions, enabling more natural human-robot interaction.
II Related Work
Reasoning using commonsense knowledge to understand incomplete natural language instruction has been studied in a variety of contexts. For example, Bolt et al. [bolt1980put] presented a robotic system that could leverage deictic reference or pointing gestures to understand human instructions in a situated human-robot interaction setting. Recent years have seen systems like Prac [nyga2018cloud] and RoboBrain [saxena2014robobrain] that have the ability to leverage world knowledge to understand natural language instructions. However, these systems tend to rely on graph-based knowledge representations. For example, Prac considered a similar commonsense reasoning problem as ours, aiming at inferring the most probable executable action in a given context, but the knowledge is encoded in a Prac knowledge base, which is constructed from manually annotated clauses found in natural language recipes. LMCR, by contrast, uses a neural network language model and is based on the intuition that world knowledge is implicitly encoded in textual corpora. The idea is adapted from recent works in neural language models such as ELMo [peters2018deep], OpenAI GPT [radford2018improving], and BERT [devlin2018bert], using a pre-trained language model to improve the performance of various downstream applications, including commonsense reasoning. These applications show that neural network language models are well suited to encoding and extracting knowledge that exists in large language corpora.
To understand natural language instructions, a robot has to extract a semantically meaningful representation of natural language and ground it to the perceptual elements and actions in its environment. This process is referred to as language grounding [matuszek2018IJCAI]. Several approaches have been proposed for language grounding, which can be broadly divided into probabilistic models [howard2014natural, hemachandra2015learning, paul2017grounding, paul2018efficient] and deterministic models [thomas2012roboframenet, misra2016tell, misra2015environment, thomason2019improving]. These approaches seek to find an intermediate representation in order to bridge natural language and machine commands. To bridge this gap, the probabilistic models employ a probabilistic graphical model approach, while the deterministic models employ a frame-like structure. Our proposed model falls into the deterministic model category. However, the related work mentioned above does not consider grounding unstated concepts with the help of commonsense world knowledge. Recently, due to the advancement of deep neural networks, several works use sequence learning and reinforcement learning to directly map text to actions, skipping the need for an intermediate representation of instructions [janner2018representation, blukis2018mapping, shah2018follownet, wang2018reinforced, das2017embodied]. However, they either consider only navigation tasks, or a simple simulated environment, where the possible actions are limited. In contrast, our method generalizes to any task domain as we can easily extend the set of our verb frame representations by adding more frames to our training corpus.
Affordance can be defined as knowledge of an object’s functionality, and understanding affordances is crucial for a robot to recognize human activities, interact with the environment, and achieve its goals [chao2015mining]. Previous research on affordance can be primarily divided into two categories, namely, visual affordance and semantic affordance. Our work is closely related to semantic affordance [zhu2014reasoning, chao2015mining], which seeks to model the possible actions that can be conducted on an object. However, these works only model single verb-object pairs. We extend the dependency by using verb frames, which allows us to make inferences on object affordances conditioned on both the predicate and other roles.
III-A Problem Definition
The robot receives a spoken instruction from the user as input. Our Speech Recognition module, shown in Fig. 2, transcribes the audio of spoken language into text, which we specify as a sequence of tokens representing words, . We use Google Cloud API [speech2text] for the transcription. The robot also receives input from its RGB-D sensors. Our Detection module in Fig. 2 detects instances of certain classes of objects and their positions in the input RGB-D image. This module can be implemented by an object detector, such as Mask R-CNN [he2017mask]. The Detection module outputs a list of relevant objects in the vicinity of the robot, along with their associated positions.
We represent actions that the robot can perform using verb frames. Following the convention in the frame semantic parsing literature [das2014frame, hermann2014semantic], we define a verb frame as , where denotes the predicate and and denote the ’th role and its argument, respectively. The predicate represents an action, where is the set of actions that the robot can perform (e.g., “pour”, “brush”, as summarized in the left column of Table I for our robot). We focus on predicates (actions) that take arguments, so we simplify the verb frames to its two-argument specification . In our work, are drawn from a fixed, pre-defined set of role labels and are a function of the predicates. At the same time, each is drawn from a fixed vocabulary (i.e., a set of words) . In our experiments, the labels (e.g., ‘apple’, ‘banana’) of detected objects in a testing scenario come from this vocabulary .
The problem we want to study is to translate the possibly incomplete input instruction into a complete verb frame with all arguments filled in, while the detected object list help with filling in the missing arguments. Thus the robot’s motion planner can execute this verb frame later. Below, we first describe our approach for identifying missing arguments using verb frames (Sec. III-B). We focus on the case where the human-provided instruction is missing one of the two roles. We then introduce our approach to completing an incomplete verb frame using common sense via a neural-network based language model (Sec. III-C), which will enable the robot to plan a motion to accomplish the desired task (Sec. III-D).
III-B Identifying Incomplete Instructions
To identify if and how an instruction is incomplete, we parse the natural language instruction into a sequence of verb frames. The Predicate-Argument Parsing module in Fig. 2 takes the sequence of tokens and outputs a sequence of verb frames as input. We use an off-the-shelf semantic role labeling (SRL) model [he2017deep] to parse the sentence into verb frames, which provide us with a predicate-argument structure. Since some arguments may be missing from the instruction, we augment the vocabulary to include an empty token, which is used to indicate a missing argument. Using parsed verb frames, an incomplete instruction can be identified as one having an empty token for one of its roles. The problem of resolving an incomplete instruction then becomes filling the missing role with an object from the environment.
III-C Completing an Incomplete Instruction Using Common Sense
Given an incomplete verb frame with one missing role and a list of objects in the robot’s environment, we formalize the task of commonsense reasoning as finding the most proper roll filler and outputting a complete verb frame. This problem can be further treated as ranking a list of complete verb frames, as we can easily iterate over the object list to create all possible candidate verb frames that are feasible in the current environment. Thus, we implement commonsense reasoning as a scoring function where is a complete verb frame. And from the list of candidate verb frames we pick the one with the highest score as the output verb frame. We refer to the score as a plausibility score. The job of the commonsense reasoning method is then to define the scoring function .
To compute the plausibility score, we note that people are more likely to omit information from an instruction if it is obvious to the listener, so the correct role filler should be the one that yields a complete verb frame with the highest probability among all possible combinations. To this end, we use a language model (LM) whose goal is to predict the probability of a word sequence (we assume a word sequence with higher probability to appear is more plausible). A language model factorizes the probability according to the chain rule. Using to denote an entire sentence with tokens, the chain rule can be written as,
where is the conditional probability of the word given the previous words. In the Language Model Reasoning module of our work, we model this conditional probability using a recurrent neural network (RNN) [mikolov2010recurrent].
Following the recent progress in the study of language models, we also tried other advanced pre-trained language models such as ELMo [peters2018deep] and BERT [devlin2018bert]. However, we empirically did not find a significant difference between these different language models of our approach, so we take the simplest RNN-based LM as our model here.
Note that the language model operates on a sequence of words, but verb frames are a structured representation of language. We thus need to serialize the candidate complete verb frames into a sequence of words, a process known as linearization [filippova2009tree, konstas2017neural]. We propose two linearization methods in this work. The first is to concatenate the predicate and all arguments directly, i.e., to treat as a sequence. This results in unnatural sounding word sequences. The second is to make a more natural sentence from the frame using a rule-based approach. With these two approaches, (pour, Theme: water, Destination: cup) is converted to pour water cup with the former approach and pour water to the cup with the latter one. We refer to the LM trained and tested with the former approach as frame-based LM and the latter one as sentence-based LM. For both, the sequence format needs to be consistent during training and inference to get the best performance. As our training corpus contains natural language sentences, we can use them to train the sentence-based LM directly, while frame-based LM requires predicate-argument parsing on the entire training corpus as a pre-processing step.
III-D Motion Planning for a Complete Verb Frame
The motion planner takes as input a complete verb frame and the positions of relevant objects in and computes a motion for the robot that executes the task specified by the verb frame. For each , we define a motion planner parameterized by its arguments. In our implementation, each motion planner is defined by a series of waypoints for the end-effector. Each waypoint is defined in a coordinate system relative to the positions of a task-relevant object in [Bowen2015_TASE], which enables the robot to plan motions that are robust to the movement of the objects in the environment. Reaching these waypoints in sequence executes the action. We use the motion planning toolkit MoveIt! [moveit2018] to compute the movement of the robot arm given the relative waypoints.
III-E Comparison Methods for Commonsense Reasoning Evaluation
As described above, LMCR gives a score to each complete verb frame in a generated list, and the frame with the highest score is chosen as the output. We use to denote the scoring function. In Sec. V, we compare the scoring function of our method LMCR against those of co-occurrence, Word2Vec, and ConceptNet, described below.
This is the shorthand of “co-occurrence”. Chao et al. [chao2015mining] used co-occurrence in a textual corpus to determine the relatedness of a verb-object pair. We extend this to determine the relatedness of a verb frame, which is defined as
where denotes the total normalized co-occurrence score of and in the training text corpora. This is computed by where and are the occurrences of and individually in the corpus and is the count of and co-occurring in the same sentence.
Chao et al. [chao2015mining] also used Word2Vec as one of their affordance mining methods. Similarly, we extend it to work on verb frames by defining the scoring function as
where denotes the Euclidean distance of word embeddings of and . We use GloVe embeddings [pennington2014glove] for this comparison.
Systems like PRAC and RoboBrain use knowledge graphs and conduct probabilistic inference on the graph for instruction completion. Similarly, we use ConceptNet [speer2017conceptnet], which is a large scale common sense knowledge graph. We use the relatedness score provided by the ConceptNet API [conceptnet5api], and compute the score for a frame as
where denotes the ConceptNet relatedness score [speer2017conceptnet] of and .
Training Data for the Language Model
The training data for LMCR’s language model comes from textual corpora, which can be treated as the knowledge source of the method. We use YouCook2 [zhou2017procnets] and Now You’re Cooking (NYC) [nyc2013] as training corpora. YouCook2 is a large instructional video dataset designed to facilitate video captioning research. The cooking steps for each video are annotated with temporal boundaries and described by imperative English sentences, resulting in around raw descriptions of cooking actions. NYC contains over recipes, each containing a step-by-step description of how to execute the recipe. Although NYC is much larger in size, it contains unrelated information such as ingredient lists and comments, which is more similar to what we can get directly by crawling web data. In our experiment, we deliberately keep this extraneous information in order to determine if the language model can distill commonsense knowledge required in human-robot interaction from noisy textual corpora.
Testing Data Based on Human Judgments
In order to quantitatively evaluate LMCR’s commonsense reasoning for robotic assistance instructions, we created a new human-generated dataset, since existing datasets on commonsense reasoning [chao2015mining, warren2015comprehending, pylkkanen2007meg] are not specific to our domain of filling in missing information in instructions for robotic assistance tasks. We show our data collection process in Fig. 3. We provided human annotators on Amazon Mechanical Turk (AMT) [amt] with sentences representing complete verb frames and asked them to give a plausibility rating for each of them, scaling from 1 (most implausible) to 5 (most plausible) (see the figure for examples). We collected 5 annotations for all complete verb frames in our dataset.
In our experiments, for each predicate we split the verb frames into positive, ambiguous, and negative subsets using a plausibility threshold . Namely, for a complete verb frame with an average plausibility rating , it is included in the positive subset if , the negative subset if , and the ambiguous subset otherwise. We then randomly pick one frame from the positive and frames from the negative subset (we restrict one correct answer in a test scenario for the convenience of evaluation), keeping the predicate and one of the arguments the same and varying the other argument. In this way, we can construct a test scenario with candidates, where one of them is plausible based on the human annotation.
In our experiments, we vary the plausibility threshold and the number of candidates to create test scenarios with different difficulties. A larger brings more ambiguous frames (which even humans are not sure about their plausibility) into the positive and negative subsets. A larger introduces more candidates in a single scenario. In both cases, the test scenarios become more challenging.
Comparison with Other Methods
Based on the collected human judgment dataset, we compare the proposed LMCR approach11 1 The language model here is the sentence-based LM trained on both the YouCook2 and Now You’re Cooking dataset. with other baseline methods Co-occur, Word2Vec, and ConceptNet, described above, as well as with a uniform random choice (Random). Each method defines a scoring function given a complete verb frame . Given a test scenario containing candidate verb frames, a successful prediction gives the highest score to the ground truth, namely the one sampled from the positive subset. We consider 11 verbs listed in the leftmost column of Table I, vary the plausibility threshold and the number of objects in the list to create scenarios with various difficulties, and report the accuracy (success rate). Table I gives the overall and per predicate accuracy with and plausibility threshold , and Fig. 4A and Fig. 4B show the results when varying and respectively. LMCR performs consistently better than other methods when considering all actions (predicates), for all variations of and , although some methods show better performance on specific individual predicates. The results suggests that, overall, LMCR better encodes the type of commonsense reasoning we are addressing in this work.
Comparison under Different Training Settings
We also compare several different ways of training the language model (LM) used by the Language Model Reasoning module of LMCR. To do so we train the LM with two different linearization strategies, namely, frame-based (Frame) and sentence-based (Sent.) linearization. We also train the LM on different combinations of training corpora, YouCook2 data only (YouCook2), Now You’re Cooking data only (NYC), and the combination of the two (All data). Results for these comparisons are shown in Fig. 4C and Fig. 4D which vary and respectively. As shown in the results, sentence-based LM generally performs better than frame-based LM. We suspect this is due to the fact that the former is end-to-end trained while the latter requires generating training data from an upstream predicate-argument parser, whose errors may propagate to the training process of the frame-based LM. Also, the parser cannot generate a frame for predicates that are not in its verb vocabulary, even though these relatively rare predicates can be beneficial when learning others. For example, “scatter some salt on the beef” would help with the learning for “spread” and “sprinkle” as they can be synonyms. While the sentence-based LM can take advantage of this, the frame-based schema loses this information in the training data, since “scatter” is not among the 11 verbs we consider. For sentence-based LM, the performance of the models trained with NYC and all data (YouCook2+NYC) are similar, and both are better than the model trained on YouCook2 alone. For frame-based LM, the all data yields the best performance. Although NYC is noisier than YouCook2, the former still brings positive input to the language model training, since it is much larger than the latter. This suggests that the language model is robust to the noise in the training data on the commonsense reasoning task considered in this paper. These results demonstrate that a human annotated dataset, such as YouCook2, is not necessarily better than a recipe-based dataset, although it may still be helpful. Based on the above analyses, we use the sentence-based LM trained on all data when comparing with the other commonsense reasoning approaches in the previous section.
Real Robot Experiment
We deploy LMCR on a Baxter robot [rethink2013baxter], a research robotics platform with two 7 degree of freedom arms, and demonstrate its ability to successfully accomplish intended tasks given incomplete spoken instructions in different scenarios. A video of the LMCR-enabled robot in action is provided in Supplementary Materials.
In this work, we presented a robot that can detect when a human instruction is incomplete and automatically resolve it by observing the environment and making inferences based on commonsense world knowledge. The use of a neural language model in capturing the commonsense knowledge allows us to leverage online textual corpora and train the model with little manual intervention. We demonstrate the effectiveness of our algorithm both by measuring the alignment with human judgments and on a physical robot. In future work, we plan to investigate the robustness of the entire system against error in each module, consider verb frames with a varying number of missing arguments, and use dialogue when LMCR cannot make a confident decision about filling in missing information.