Enabling machines to respond appropriately to natural language commands couldgreatly expand the number of people to whom they could be of service. Recently,advances in neural network-trained word embeddings have empowered non-embodiedtext-processing algorithms, and suggest they could be of similar utility forembodied machines. Here we introduce a method that does so by training robotsto act similarly to semantically-similar word2vec encoded commands. We showthat this enables them to act appropriately, after training, topreviously-unheard commands. Finally, we show that inducing such an alignmentbetween motoric and linguistic similarities can be facilitated or hindered bythe mechanical structure of the robot. This points to future, large scalemethods that find and exploit relationships between action, language, and robotstructure.
Quick Read (beta)
Word2vec to behavior: morphology facilitates the grounding of language in machines.
Enabling machines to respond appropriately to natural language commands could greatly expand the number of people to whom they could be of service. Recently, advances in neural network-trained word embeddings have empowered non-embodied text-processing algorithms, and suggest they could be of similar utility for embodied machines. Here we introduce a method that does so by training robots to act similarly to semantically-similar word2vec encoded commands. We show that this enables them to act appropriately, after training, to previously-unheard commands. Finally, we show that inducing such an alignment between motoric and linguistic similarities can be facilitated or hindered by the mechanical structure of the robot. This points to future, large scale methods that find and exploit relationships between action, language, and robot structure.
Using natural language to interact with machines has long been a goal in AI research. Recently, word embeddings such as word2vec have yielded significant advances in this direction [1, 2]. These embeddings generate vector spaces which accurately preserve semantic relationships between words, and can then be used to address text classification , sentiment analysis [4, 5], and other natural language-based problems.
However, these approaches tend to disregard the role that action and the body of an agent may play in generating and understanding natural language. The link between action and natural language has long been hypothesized in cognitive science  and linguistics . However, it was only recently that neuroscience studies have provided data suggesting that such a link exists . For instance, Pulvemüller et al. have shown that if stories are read to immobile subjects being scanned by fMRI, their motor and sensor cortices exhibit heightened activity .
In robotics, a long literature in helping robots ground language in action exists . For example, Steels et al. reported a series of experiments in which robots collectively construct their own syntax and grammar , while Schulz et al. report on robots that construct a language to describe spatial  and temporal  concepts. Matuszek et al.  trained a parser on pairs of English commands and corresponding control language expressions.
The word embedding approach is attractive as a data-driven, rather than hypothesis-driven, method for enabling machines to link natural language and action. Indeed, recent such attempts have been reported. For example a visual word2vec corpus has been trained which captures semantic relationships between images rather than text , and sound-word2vec similarly discovers semantic structure among the sounds associated with words . Jang et al. demonstrated “grasp2vec”: a method for enabling robots to autonomously learn object-centric representations that enable recognition and grasping of objects without recourse to a pre-defined feature set .
However, none of these methods attempt to align embodied embeddings with word embeddings. In order to realize robots that can respond appropriately to previously unheard natural language, it would be useful if the robot’s learned sensorimotor structure mapped on to natural language semantic structure, and vice versa.
If so, robots should act similarly and appropriately when they hear two similar words, even if they have not previously heard one of those words.
We demonstrate a method that forges such an alignment here. Briefly, we assign a unique objective function to sets of similar action words, and then train robots to maximize these functions while they “hear” the embeddings associated with those words, along with their own sensor data. Evidence that robots can successfully learn to align human language semantic structure with the structure of their own felt experience is demonstrated by the fact that robots so trained act appropriately when issued previously unheard words.
Finally, we have found that the mechanical design of the robot can facilitate or obstruct the training algorithm’s ability to forge this alignment: we found the method performed worse or better for robots with different body plans. This adds to a growing body of work that demonstrates that an appropriate robot body plan choice can facilitate other aspects of behavior generation in robots [18, 19, 20, 21, 22].
II-A The task.
Robots were optimized in simulation using Pyrosim11 1 Pyrosim is a python interface for building robots (and their neural controllers) in Open Dyanamics Engine: github.com/mec-lab/pyrosim to perform three behaviors (move forward, move backward, and stop movement) according to the embeddings of six different input commands: ‘forward’, ‘backward’, ‘stop’, ‘cease’, ‘suspend’, and ‘halt’. Prior to optimization, one of the last four commands (i.e., one of the synonyms of ‘stop’) is randomly selected for testing and held out of the training set. Robots are optimized according to their performance summed across all five training commands.
The performance of a robot under the ‘forward’ and ‘backward’ commands was measured by their respective displacement in the positive and negative -axis of the simulator, at the end of an evaluation period of 500 time steps (with step size 0.05; i.e., 25 seconds of behavior). For the remaining commands, performance was proportional to the negative euclidean distance from the origin (the robot’s starting point) at the end of simulation, thus rewarding robots that move less.
Because robots were tested under an unheard synonym of ‘stop’, test error was measured as the final displacement of the robot.
II-B The controller.
The robots are controlled by recurrent neural networks with three layers: a sensor layer, which is fully connected to a self- and recurrently-connected hidden layer consisting of five neurons, which are fully connected to a motor layer (Fig. 1). The number of motor neurons and sensor neurons vary with the morphology of each robot.
The sensor layer includes an auditory neuron that initializes the controller’s hidden state as follows. Before sending a robot to the simulator for evaluation, the target command vector is fed serially through the auditory neuron and into the recurrent hidden layer, one element after another, each time updating the hidden neurons’ values.
After initializing the hidden neurons, their incoming synapses from the auditory input neuron were removed, and the sensor and motor neurons were attached. The robot was then sent to the simulator with its initialized network (Fig. 3).
II-C The robots.
II-C1 The quadruped
The quadruped (Fig. Word2vec to behavior: morphology facilitates the grounding of language in machines.) consists of a rectangular abdomen, attached to which are four legs, each composed of an upper and lower cylindrical object. The knee and the hip joint of each leg contain a 1-DOF rotational hinge joint which can flex inward or extend outward by up to 45 degrees away from its initial angle (Fig. 2C). Inside each lower leg is a touch sensor neuron, which at every time step detects contact with the floor: its value is either (no contact) or (contact).
II-C2 The minimal robot
The minimal robot (Fig. 2A) consists simply of two cylinders joined end-to-end by a rotational hinge joint. Like a single leg of the quadruped, minimal robots have a single degree-of-freedom hinge joint. However, unlike the quadruped’s legs, the minimal robot has two touch sensors (one in each cylinder) as well as a proprioceptive sensor that measures the angle of its joint.
II-C3 The spherical robots
The spherical robots (Fig. 2B) consist of a pendulum attached to a sphere’s top interior wall. Some spherical robots have a pendulum which can only swing through the plane (where is the vertical axis). These are referred to as 1DOF spherical robots. Other spherical robots have two orthogonal joints rotating in both the and planes. This version is referred to as a 2DOF spherical robot. Some spherical robots have proprioception (denoted as “spherical robots with sensors”) and others do not (denoted as “without sensors”).
|Quadruped||1.84 (0.07)||1.90 (0.49)|
|Minimal||5.30 (0.06)||5.43 (0.07)|
|with sensors||10.74 (0.08)||11.22 (0.08)|
|without sensors||11.48 (0.08)||11.20 (0.09)|
|with sensors||10.55 (0.08)||10.84 (0.08)|
|without sensors||10.61 (0.09)||10.33 (0.07)|
II-D The optimization algorithm.
Controllers were optimized using a standard evolutionary algorithm: AFPO (Age-Fitness Pareto Optimization; ). AFPO is a multi-objective optimization method that trains populations of candidate solutions to maximize the objective function for the desired behavior, while simultaneously minimizing ‘age’, a variable which roughly corresponds to the amount of search time spent in a particular area of design space. This latter objective aids in the prevention of premature convergence.
Each independent evolutionary run started with a different random seed, and consisted of a population of 50 robots, optimized for 6000 generations. At each generation, modified copies are made of each robot in the population by randomly selecting a single synapse and perturbing it according to the normal distribution with a mean of the current synapse weight value, and standard deviation of the absolute value of the current synapse weight.
II-E The experimental treatment.
Prior to optimization, the vectors corresponding to each command were obtained from the word2vec embedding.22 2 code.google.com/archive/p/word2vec During optimization, a command vector was uploaded to the robot (as described in §II-B); then, the robot behaved and was assigned a performance score (as described in §II-A). The cosine similarities between pairs of the command vectors are presented in Table I.
II-F The control treatment.
There is a possibility for overfitting in our method due to the unbalanced nature of the training set. Since the majority of the training commands (three out of five) require the robot to remain stationary, control policies could evolve that keep the robot immobile by default, yet memorize a movement response for the ‘forward’ command and another for the ‘backward’ command. In this way, even if we observe that the robot stays immobile when presented with the held-out, fourth ‘stop’ synonym, the control policy causing this behavior may have ignored the latent structure in the command embeddings.
In order to assess whether such overfitting occurs, we use the following control. At the beginning of each evolutionary run, the vectors corresponding to each command were obtained from the word embeddings vector space. Each vector was then randomly permuted so that the distribution of values in each new vector do not change, but their orderings do (Table II). The resulting five permuted embeddings are held constant over the course of that evolutionary run. If the optimization method tends to yield overfit control policies, they should similarly keep the robot immobile when presented with the sixth, held-out ‘stop’ synonym, regardless of the permutation.
If however the control policies exploit the latent structure in the embeddings and that structure is disrupted by permutation, we should expect to see the control treatment policies generate more movement in the robot, compared to the experimental treatment policies, when both are presented with the held-out ‘stop’ synonym. In other words, the control treatment policies should generalize worse than the experimental treatment policies when presented with the test command.
II-G The hypothesis tests and correction.
For hypothesis testing, we use the Mann-Whitney U test , a rank-based test of whether one of two random variables is larger than the other.
We make a total of 56 pairwise comparisons in this paper. With each comparison, the likelihood of incorrectly rejecting a null hypothesis (i.e., making a Type I error) increases. Thus, to control the family-wise error rate (the probability of one or more false rejections of true hypotheses) we conservatively adjust the rejection criteria of each individual hypothesis test using the the Holm-Bonferroni (step-down) procedure .
Twelve hundred independent evolutionary trials were performed in total: 100 for each of the experimental and control treatments, for each of the six robot morphologies (Table III).
The 1200 run champions—the best robot from each trial—are extracted in order to test for statistical differences between the treatments, commands and morphologies.
Fig. 3 traces the behaviors of three exemplar run champions that are representative of typical behaviors found by the optimizer in the control and experimental treatments. Under the control treatment, the optimizer yielded specialized robots that were unsuccessful at one or more of the training behaviors and failed to “understand” the meaning of the unheard synonym: they have high test error. Under the experimental treatment, the optimizer yielded robots with correct behavior on all three training commands and that understood the meaning of the unheard synonym: they have low test error.
Fig. 4 compares the average displacement of the run champions under the different experimental conditions tested here. Overall, within each morphology and treatment, the optimizer found controllers that behaved correctly, during training, under both the ‘move’ and ‘stop’ commands: Robots moved significantly more when commanded to do so than when commanded to ‘stop’ (green bars are significantly higher than orange bars).
There was no significant difference between control and experimental treatments, in any of the tested morphologies, in terms of the final displacement of optimized robots commanded to move (pairs of green bars in each panel are of equal height). This implies that training performance of the robots is not due to inherent properties of the word2vec embedding; rather, it is due to the evolutionary algorithm.
In both treatments and in five of the six morphologies (but not the quadruped) there is a significant difference between the displacement of robots given the training and testing ‘stop’ commands (blue bars are usually higher than the orange bar to their left). Thus, the robots did not completely understand the meaning of the command ‘stop’. This was somewhat expected given the distance between the variants of ‘stop’ in the word2vec space (Table II). However, for four of the six morphologies, robots optimized with the experimental treatment moved less under the testing ‘stop’ command than those optimized with the control treatment (red significance brackets in Fig. 4). Thus, morphology affects the grounding of the ‘stop’ commands.
However, the training set is unbalanced—there are three commands for ‘stop’ and only two for ‘move’—so it is possible that robots are overfitting to the stop commands and thus display little motion during testing without learning the semantic meaning of the commands.
To control for this, we retrained the quadruped from scratch on a new, balanced set of commands (Fig. 5), where each task (‘stop’, ‘forward’, and ‘backward’) was trained using two commands, yielding a training set size of six. The two commands were chosen for each task such that the cosine similarity between them was similar to that of the stop synonyms previously used. We chose to use ‘forward’ and the misspelled ‘foward’ for the ‘forward’ task; and ‘backward’ and ‘backwards’ for the ‘backward’ task. For the ‘stop’ task, we removed the ‘halt’ command, leaving ‘stop’, ‘suspend’ and ‘cease’, one of which was randomly held-out at the beginning of each evolutionary trial for testing, and the others were used for training. (Also, the reward function paired with the ‘stop’ commands was changed to be inversely proportional to the robot’s total movement, thus protecting against the perverse instantiation of oscillating around the origin.)
This alternate training set is balanced on a per-task level, however it is unbalanced on another level: there are four commands for ‘move’ and only two for ‘stop’.
As seen in Fig. 5, under the per-task balanced training sets, the optimizer still found controllers that generated correct behavior during training. There was no statistically significant difference between the control and experimental treatments in terms of training performance.
Under the control treatment, the quadruped moved significantly more for the test ‘stop’ command than the training ‘stop’ commands. In fact they moved almost as much during the test ‘stop’ command as in their training ‘move’ commands. Thus the control treatment, with the per-task balanced training set, yielded controllers that were overfit to the ‘move’ commands.
Under the experimental treatment, however, there was no significant difference between the movement of the quadruped under the training and test ‘stop’ commands. This suggests that, despite the higher prevalence of ‘move’ commands in the per-task balanced training data, controllers learned latent structure of the embedding, and used this understanding to correctly generalize to the unheard ‘stop’ synonym.
One limitation of this work is that the command vector is loaded serially into the controller, prior to behavior, and is potentially overwritten by proprioception and touch sensor data during behavior. Ideally, robots would be able to hear the commands throughout their evaluation periods, therefore allowing them to modulate their interpretation of the command based on action. Further, this would allow dynamic communication with the robot.
One way to achieve this is with wider controller architectures: each element in the vector commands could have its own input synapse. Then, the entire vector could influence action at each time step, and be updated during behavior. Moreover, the controllers should also be deeper such that more complex (nonlinear and hierarchical) latent structure of the embedding can be learned.
Additionally, unlike other end-to-end methods that are mostly automated, the method presented here still requires much manual intervention: the investigator must create an objective function for each grouping of action synonyms. To minimize such intervention, in future work we wish to investigate whether a small set of semantically and motorically orthogonal objective functions can be created that enables the robot to generalize not just to unheard synonyms of training commands, but also to novel sequences of commands.
Finally, these experiments were conducted with simulated robots. In future work, we would like to investigate how well this technique extends to physical robots. To this end, we have already performed some initial work to investigate how well the use of vector spaces for training robots works on physical systems.
IV-A The physical robot.
Our physical robot system (Fig. 6) is a 12 DOF quadruped powered by a Raspberry Pi 3B+ and 12 14-gram Micro Servos. The main body is laser cut out of wood and holds the Raspberry Pi, an I2C PWM Driver, a 9DOF IMU, and a DC buck converter for power regulation. We provide power via an umbilical cord. Each of the four legs is constructed from 3D printed parts and contains three joints: a hip, a knee, and an ankle.
The robot is controlled by the on-board computer with programs written in Python. The Python code interfaces with the robot sensors and motor driver over I2C to actuate the motors. The sensor data can be added to the artificial recurrent neural network which is modeled by the Raspberry Pi. An SSH connection over WiFi is used to perform maintenance, and configure and start the robot.
We created a new simulated quadruped without sensors and with a slightly modified morphology to more closely match the morphology of the physical robot. Controllers were optimized in simulation under the same conditions as the original three robots. We then transferred six optimized controllers from each treatment to the physical robot and recorded the motion using computer vision. The movement patterns of these controllers are shown in Fig. 7.
Overall, simulated behaviors did not transfer adequately to reality. However, some of the controllers were able to exhibit movement denoting a minimally successful sim2real transfer. For some of the controllers (e.g., Fig. 7b,g,j), the physical robot moved in different ways for the ‘forward’, ‘backward’, and ‘stop’ commands, thus exhibiting the rudiments of successful sim2real transfer. Future work will more thoroughly investigate the use of vector spaces and existing sim2real methods [26, 27, 28, 29] for training physical systems to ground language.
In this work we have presented a method for inducing an alignment between similarities among sensor data generated by robot movements and the semantic similarities between the word2vec-encoded commands that induced those actions. This method yields control policies that cause robots to move appropriately to previously-unheard natural language commands. Further, we have found that this method can be facilitated or frustrated by the particular mechanical structure of the robot employed. In future work we plan to evolve robot body plans, searching for those that make it even easier to induce such alignments. This work thus suggests not just that relationships between action, human language, and embodiment can be created in machines, but provides an empirical method for exploring and strengthening these relationships to yield robots that could be commanded by non-expert human handlers.
The authors would like to thank Eve Wight and Ryan Joseph for their help in creating the physical robot. This work was supported by NSF award EFRI-1830870 and DARPA contract HR0011-18-2-0022. Computation was provided by the Vermont Advanced Computing Core.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
-  J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and word2vec for text classification with semantic features,” in 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC). IEEE, 2015, pp. 136–140.
-  D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin, “Learning sentiment-specific word embedding for twitter sentiment classification,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2014, pp. 1555–1565.
-  B. Dickinson and W. Hu, “Sentiment analysis of investor opinions on twitter,” Social Networking, vol. 4, no. 03, p. 62, 2015.
-  A. Clark, “Language, embodiment, and the cognitive niche,” Trends in cognitive sciences, vol. 10, no. 8, pp. 370–374, 2006.
-  G. Lakoff and M. Johnson, Metaphors we live by. University of Chicago press, 2008.
-  V. Gallese and G. Lakoff, “The brain’s concepts: The role of the sensory-motor system in conceptual knowledge,” Cognitive neuropsychology, vol. 22, no. 3-4, pp. 455–479, 2005.
-  F. Pulvermüller and L. Fadiga, “Active perception: sensorimotor circuits as a cortical basis for language,” Nature reviews neuroscience, vol. 11, no. 5, p. 351, 2010.
-  M. Selfridge and W. Vannoy, “A natural language interface to a robot assembly system,” IEEE Journal on Robotics and Automation, vol. 2, no. 3, pp. 167–171, 1986.
-  L. Steels, “Evolving grounded communication for robots,” Trends in cognitive sciences, vol. 7, no. 7, pp. 308–312, 2003.
-  R. Schulz, A. Glover, M. J. Milford, G. Wyeth, and J. Wiles, “Lingodroids: Studies in spatial cognition and language,” in 2011 IEEE International Conference on Robotics and Automation. IEEE, 2011, pp. 178–183.
-  S. Heath, R. Schulz, D. Ball, and J. Wiles, “Lingodroids: Learning terms for time,” in 2012 IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 1862–1867.
-  C. Matuszek, E. Herbst, L. Zettlemoyer, and D. Fox, “Learning to parse natural language commands to a robot control system,” in Experimental Robotics. Springer, 2013, pp. 403–415.
-  S. Kottur, R. Vedantam, J. M. Moura, and D. Parikh, “Visual word2vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4985–4994.
-  A. K. Vijayakumar, R. Vedantam, and D. Parikh, “Sound-word2vec: Learning word representations grounded in sounds,” arXiv preprint arXiv:1703.01720, 2017.
-  E. Jang, C. Devin, V. Vanhoucke, and S. Levine, “Grasp2vec: Learning object representations from self-supervised grasping,” arXiv preprint arXiv:1811.06964, 2018.
-  J. Bongard, “Morphological change in machines accelerates the evolution of robust behavior,” Proceedings of the National Academy of Sciences, vol. 108, no. 4, pp. 1234–1239, 2011.
-  J. C. Bongard, A. Bernatskiy, K. Livingston, N. Livingston, J. Long, and M. Smith, “Evolving robot morphology facilitates the evolution of neural modularity and evolvability,” in Proceedings of the 2015 annual conference on genetic and evolutionary computation. ACM, 2015, pp. 129–136.
-  Z. Mahoor, J. Felag, and J. Bongard, “Morphology dictates a robot’s ability to ground crowd-proposed language,” arXiv preprint arXiv:1712.05881, 2017.
-  S. Kriegman, N. Cheney, and J. Bongard, “How morphological development can guide evolution,” Scientific reports, vol. 8, no. 1, p. 13934, 2018.
-  S. Kriegman, S. Walker, D. Shah, M. Levin, R. Kramer-Bottiglio, and J. Bongard, “Automated shapeshifting for function recovery in damaged robots,” in Proceedings of Robotics: Science and Systems, 2019.
-  M. Schmidt and H. Lipson, “Age-fitness pareto optimization,” in Genetic Programming Theory and Practice VIII. Springer, 2011, pp. 129–146.
-  H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” The annals of mathematical statistics, pp. 50–60, 1947.
-  S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian journal of statistics, pp. 65–70, 1979.
-  J. Bongard, V. Zykov, and H. Lipson, “Resilient machines through continuous self-modeling,” Science, vol. 314, no. 5802, pp. 1118–1121, 2006.
-  J. Zhang, L. Tai, P. Yun, Y. Xiong, M. Liu, J. Boedecker, and W. Burgard, “Vr-goggles for robots: Real-to-sim domain adaptation for visual control,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1148–1155, 2019.
-  J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, 2019.
-  R. Kwiatkowski and H. Lipson, “Task-agnostic self-modeling machines,” Science Robotics, vol. 4, no. 26, 2019.