The aim of our study was to develop a method by which a social robot cangreet passersby and get their attention without causing them to sufferdiscomfort.A number of customer services have recently come to be provided bysocial robots rather than people, including, serving as receptionists, guides,and exhibitors. Robot exhibitors, for example, can explain products beingpromoted by the robot owners. However, a sudden greeting by a robot can startlepassersby and cause discomfort to passersby.Social robots should thus adapttheir mannerisms to the situation they face regarding passersby.We developed amethod for meeting this requirement on the basis of the results of relatedwork. Our proposed method, user-centered reinforcement learning, enables robotsto greet passersby and get their attention without causing them to sufferdiscomfort (p<0.01) .The results of an experiment in the field, an officeentrance, demonstrated that our method meets this requirement.
Quick Read (beta)
Can User-Centered Reinforcement Learning Allow a Robot to Attract Passersby without Causing Discomfort?*
The aim of our study was to develop a method by which a social robot can greet passersby and get their attention without causing them to suffer discomfort. A number of customer services have recently come to be provided by social robots rather than people, including, serving as receptionists, guides, and exhibitors. Robot exhibitors, for example, can explain products being promoted by the robot owners. However, a sudden greeting by a robot can startle passersby and cause discomfort to passersby. Social robots should thus adapt their mannerisms to the situation they face regarding passersby. We developed a method for meeting this requirement on the basis of the results of related work. Our proposed method, user-centered reinforcement learning, enables robots to greet passersby and get their attention without causing them to suffer discomfort () . The results of an experiment in the field, an office entrance, demonstrated that our method meets this requirement.
The working population in many developed countries is decreasing in proportion to the total population due to population aging, and this problem is expected to affect developing countries as well[UN]. One approach to addressing this problem is to use social robots rather than people to provide customer services. Such robots, for example, are starting to be used as receptionists, guides, and exhibitors. Robot exhibitors are being used to provide, for example, exhibition services, such as explaining products being promoted by the robot owners. While robots can increase the chance of being able to provide a service by simply greeting passersby[ChaoShi], passersby can suffer discomfort if they are suddenly greeted by a robot[Ozaki]. The robot may thus face a dilemma: whether to behave in a manner that benefits the owner or to behave in a manner that does not discomfort passersby.
Our goal was to develop a method that solves the robot dilemma described above. That is, a method by which a robot can greet passersby and get their attention without causing them to suffer discomfort. We call our proposed method user-centered reinforcement learning.
In the next section, we define the problem and describe how we found an approach to solving it by studying related work. In the “Proposed Method” section, we explain the method we developed for solving the problem. In the “Experiment” section, we explain the experiment we conducted in the field to test two working hypotheses created from the original hypothesis?. The results show that our method can solve the problem. In the “Discussion” section, we examine the results from the standpoints of physiology, psychology, and user experience. In the “Conclusion” section, we conclude that, by using user-centered Q-learning, a robot can increase the chance of being able to provide a service to a passerby without causing the passerby discomfort. We also mention future work to enhance the proposed method.
I-A Related Works
Several researchers have addressed problems that are similar to the problem we addressed. These problems can be categorized in terms of the problem setting, the solution, and the goal.
In terms of the problem setting, the problem we addressed is similar to the problem of human-robot engagement, which is a complex problem. In accordance with human-robot interface studies[Sidner, Sun], we can interpret human-robot rngagement as the process by which a robot interacts with people, from initial contact to the end of the interaction. Several researchers have analyzed human-robot engagement[SidnerAna, Rich] and have developed a method for maintaining human-robot engagement during the interaction [BohusRobo]. We did not tackle the human-robot engagement problem directly; instead, we tackled the problem that precedes it, which is illustrated in Figure 2.
In terms of the solution, the problem we addressed is similar to machine learning, especially reinforcement learning. Reinforcement learning in robotics is a technique used to find a policy [Kober] and is used for robotic control tasks. It is not used much for interaction tasks. Reinforcement learning has been applied to the learning of several complex aerobatic control tasks for radio-controlled helicopters [Abbeel] and to the learning of door opening tasks for robot arms [Google]. The research on interaction tasks is less remarkable. Mitsunaga et al. showed that a social robot can adapt its behavior to humans for human-robot interaction by using reinforcement learning [Mitsunaga] if human-robot engagement has been established. Papaioannou et al. used reinforcement learning to extend the engagement time and enhance the dialogue quality [Papaioannou].
The applicability of these method to the situation before human-robot engagement is established is unclear. As shown in Figure 2, the problem we addressed occurs before engagement is established.
In terms of the goal, the problem we addressed is similar to increasing the number of human-robot engagements. Macharet et al. showed that, in a simulation environment, Gaussian process regression based on reinforcement learning can be used to increase the number of engagements[Macharet]. Going further, we focused on increasing the number of engagements in a field environment.
I-B Problem Statement
We use a problem framework commonly used for reinforcement learning in robotics, the partially observable Markov decision process (POMDP) to define the problem[Kober]. The robot is the agent, and the environment is the problem. The robot can observe the environment partially by using sensors.
We choose a exhibition service area in an entrance to a company as the environment. We assume the entrance consists of one automated exhibition system, one aisle and other space. In addition, the entrance is expressed as Euclidean space . passersby can move freely around the exhibition system.
The automated exhibition system consists of a tablet, a computer, a robot and a sensor system. The sensor system can sense a color image data and a depth image data . We called these data Observation . The sensor system can also extract a partial passerby’s action from . The passerby’s action consists of the passerby’s position and the head angle . We define the times when the passerby enters the entrance () and when the passerby leaves from the entrance () . We call the interval between and an episode. Let be the passerby’s position in an episode, and let be the passerby’s head angle in the episode.
The proposed method takes an own their action from these passerby’s action.
Let be a number of people that used the service. Let be a number of people that used the discomfort. Then, we can declare this problem as ”Find a robot’s policy such that and ”.
I-C Our Approach
We solve this problem by controlling the robot on the basis of reinforcement learning, ordinarily Q-learning except for designing the reward function. The reward function is created by focusing on the user experience of stakeholders. We call this reinforcement learning including this reward function ”user-centered reinforcement learning.” We do not use deep reinforcement learning due to the difficulty at the present time of collecting the huge amount of data needed for learning.
The contributions of this work are as follows,
We show that robots can learn abstract actions from a person’s non-verbal responses.
We present a method for increasing the number of human-robot engagements in the field without causing them to suffer discomfort.
II Proposed Method
Proposed method, User-Centered Reinforcement Learning, is based on Reinforcement Learning. In this paper, We use Q-learning, one of reinforcement learning, as a base algorithm because it is easy to explain why the robot choose the past actions by Q-learning. We call this algorithm ”User-Centered Q-Learning” (UCQL). UCQL is differ from original Q-learning[Watkins] in an action set , a state set , Q-function and reward function . UCQL consists of three functions;
Select an action by a policy
Update the policy based on user’s actions
Design a reward function and a Q function as initial condition.
II-1 Selecting an action by a policy
Generally speaking, robot senses observation, and take an action including wait. Let be the time when the robot acted. Let be the time when the robot compute the algorithm. Let be the predicted user’s state on the time . Let be the robot’s action on the time . In UCQL, robot choose the action by Algorithm II-1.
\[email protected]@algorithmic \REQUIRE \ENSURE \STATE \STATE \RETURN
II-2 Update the policy based on user’s actions
In UCQL, robot update the policy by Algorithm II-2.
\[email protected]@algorithmic \REQUIRE \ENSURE \IF is finished \STATE \STATE \STATE \ENDIF\RETURN
II-3 Designing an reward function
In UCQL, robot is given a reward function with Algorithm II-3 . Algorithm II-3 divide motivation into extrinsic and intrinsic one inspired from ”Intrinsically Motivated Reinforcement Learning[Chentanez]”. We call the proposed method ”User-Centered” because we design an extrinsic motivation from user’s states related User Experience.
\[email protected]@algorithmic \REQUIRE \ENSURE \STATE \IF is not wait \STATE \STATE(intrinsic motivation) \ENDIF\IF is discomfort for users than \STATE \STATE(extrinsic motivation) \ENDIF\IF is better than to achieve the goal \STATE \ENDIF\RETURN
We can choose optional policy such as greedy, -greedy and so on.
The Q function may be initialized with a uniform distribution. However, if the Q function is designed to be suitable for the task, the learning speed is faster than that of the uniform distribution.
The Q function may be approximated with a function such as Deep Q-Network[Mnih]. However, the learning speed is very slower than that of the designed function.
In this chapter, we aim at showing the hypothesis that ”by using user-centered Q-learning, a robot can increase the chance of being able to provide a service to a passerby without causing the passerby discomfort”.
III-A Concrete Goal
At first, we convert the hypothesis into another working hypothesis by operationalization because we cannot evaluate the hypothesis quantitatively.
In Introduction, we define this problem as ”Find a robot’s policy such that and ”. We give shape to and for this experiment. According to Ozaki’s study[Ozaki], This knowledge has two important points. Firstly, passerby is not suffer a negative effect by robot’s call if passerby don’t use a robot service. Secondly, passerby is suffer a negative effect by robot’s call if passerby use the robot service. Thus, this is a binary classification problem that passerby who is called by robot uses the robot service or do not use it. we can define a confusion matrix for evaluation of the method. We infer that and TP, TN have a positive correlation. We also infer that and FP have a positive correlation. We also infer that and FP have a positive correlation. On the other hand, we infer that and TN have a negative correlation. Therefore, we can use as a index for evaluation because is one of another representation of ” and ”.
From the above discussion, we define the working hypothesis as ”The accuracy after a learning by UCQL is better than the accuracy before a learning by UCQL”.
In this experiment, we test in order to show that the hypotheses is sound.
In this section, we explain how to conduct the experiment in a field environment. We can divide the method for this experiment into five steps.
Create an experimental equipment
Construct an experimental environment
Define an experimental procedure
Evaluate the working hypotheses by statistical hypothesis testing
Visualize the effect of UCQL
III-B1 Create an experimental equipment
Firstly, we create an equipment including UCQL. The equipment can be explained in the aspect of the physical structure and the logical structure.
Figure 3 is a diagram of the equipment in the view of the physical structure. According to Figure 3, the experimental equipment consists of a table, a sensor, a robot, a tablet PC, a router and servers. The components are connected with Ethernet cable or Wireless LAN. We use Sota11 1 https://sota.vstone.co.jp/home/, a palm-sized social humanoid robot, as a robot. Sota has a speaker to output voices, a LED to represent lip motions, a SoC to control elements and so on. In this experiment, those elements of Sota is used to interact with a participant. The iPad Air 2 is used as a tablet PC into which start the movie on the display. The Intel RealSense Depth Camera D435 22 2 https://click.intel.com/intelr-realsensetm-depth-camera-d435.html is used as an RGB-D sensor device to measure passerby’s actions.
Figure 4 is a diagram of the equipment in the view of the logical structure. The structure consist of Sensor, Motion Capture, State Estimator, Action Selector, Action Decoder, Effector and Policy Updater. We utilize Nuitrack33 3 https://nuitrack.com/ as Motion Capture. And we utilize ROS44 4 http://wiki.ros.org/ as a infrastructure of the equipment to communicate variables among functions.
\[email protected]@algorithmic \REQUIRE \ENSURE \IFthe system is NOT initialized \STATE \STATE \STATE \ENDIF\STATE \STATE \STATEPush into . \STATEPush into . \STATE \STATE \STATEPush into . \STATE \RETURN
We utilize Table I as the action set and Table II as the state set. Table I is a double Markov model created from the state set of Ozaki’s decision-making predictor[Ozaki]. Ozaki’s decision-making predictor estimates passerby’s states into seven state: Not Found (), Passing By (), Look At (), Hesitating (), Approaching (), Established (), Leaving ().
In addition, we utilize and as learning parameters. And we utilize Soft-max selection as the policy because we want robot to do action that has a high value and to find an action that has a higher value. Soft-max selection is often used for Q-learning. Equation 3 is the possibility to select actions on the policy. we utilize Equation 2 as a policy parameter. means a thermometer when it is updated times on . depends on the states because occur many times. we utilize and as learning parameters.
|Robot waits for 5 secs until somebody comes.|
|Robot calls a passerby with a greeting.|
|Robot looks at a passerby.|
|Robot represents joy by the robot’s motion.|
|Robot blinks the robot’s eyes.|
|Robot says ”I’m sorry.” in Japanese.|
|Robot says ”Excuse me.” in Japanese.|
|Robot says ”It’s rainy today.” in Japanese.|
|Robot says how to start their own service.|
|Robot says goodbye.|
|The passerby’s state changes ”Not Found” into ”Not Found”.|
|The passerby’s state changes ”Not Found” into ”Passing By”.|
|The passerby’s state changes ”Leaving” into ”Established”.|
|The passerby’s state changes ”Leaving” into ”Leaving”.|
\[email protected]@algorithmic \REQUIRE(void) \ENSURE \STATE \STATE \STATE a zero 2D-array \FOR to \FOR to \STATE \ENDFOR\ENDFOR\FOR to \STATE \ENDFOR\FOR to \STATE \ENDFOR\STATE \STATE \RETURN
III-B2 Construct an experimental environment
At first, we have to define how to construct an environment for the experiment. Figure 5 shows a overhead view of the environment. The environment consists of a exhibition space, a wall, a seat space, a way to a W.C. in an building that an actual company have. There are hundreds of employees in the building. Dozens of visitors come to the building. Visitors of the building is often shitting in the seat space for tens of minutes in order to wait for employees in the building. Some visitors and employees watches exhibition space to know newer technologies of the company. Some visitors sometimes go to W.C. while they are waiting for employees.
III-B3 Define an experimental procedure
We suppose the two main scenario. The first scenario is as follows:
A visitor is sitting on a seat in the seat space.
Then, the visitor get up from the seat because the visitor wants to go to W.C..
Thus, visitor move from the seat space to W.C. across the exhibition space.
The second scenario is as follows:
A visitor is sitting on a seat in the seat space.
Then, the visitor get up from the seat because the visitor is boring to wait.
Thus, The visitor move from the seat space to the exhibition space in order to watch the robots in the equipment.
We wants to attract the passersby in the second scenario mainly. We do not wants to attract the passersby in the first scenario because the visitor wants to go to W.C.. Therefore, because we wants the robot to learn the rules, we let the robot learn the rules on the environment by UCQL for several days. Then, we can get learned Q-funcion
After the learning, we let the robot attract passersby under two condition. We define two condition: Before Learning and After Learning because we want to test the hypotheses. The robot do not learn during the test.
We start collect data for the evaluation by rosbag55 5 http://wiki.ros.org/rosbag. Each data is recorded by rosbag. We can recode all of values in ROS by rosbag during the procedure.
III-B4 Evaluate by statistical hypothesis testing
We evaluate the working hypothesis by statistical hypothesis testing. We calculate the the accuracy before the learning and the accuracy after the learning in order to test . Finally, we use the one-sided Test of Proportion because we want to evaluate statistical difference between the the accuracy before the learning and the accuracy after the learning.
III-B5 Visualize the effect of UCQL
We visualize the Q-function before the learning and the Q-function after the learning by heat map in order to analyze the effect of UCQL. UCQL can change the action by updating Q-function. Therefore, we can know how robot learn the action by visualizing Q-function. Figure 6 is an example Q-function to explain a visualization on this paper.
We constructed a experiment environment described on Method in the entrance of our buildings. Figure 1 shows a picture of the equipment in the environment. The experimenter was the corresponding author. The participants were a lot of employees and visitors of our company. The learning interval is three days. As a result, we measured a lot of data. We clean the data by the following step because the data have a lot of noise on the field such as detection errors by Motion Capture and so on.
We drop episodes that interval is less than 1 [sec] because it takes a 3 [sec] to walk across the detection area of Motion Capture.
We drop episodes that is from to only because nobody was in the detection area of Motion Capture.
We got 209 total episodes in the experiment after the data cleansing. Table III shows number of episodes and time on each condition. We calculated the accuracy from the confusion matrix on each condition. The confusion matrices for the before condition and the after condition were respectively and . Therefore, the accuracy of the baseline and proposed methods were respectively 0.322 and 0.811. In testing by the one-sided Test of Proportion, we found a significant difference in accuracy between the before and after condition ().
We discuss the original hypothesis, ”The robot can attract passersby without users’ discomfort by User-Centered Reinforcement Learning.”, in the point of following views.
Can we accept the original hypothesis?
Why the robot attract passersby without discomfort by the proposed method?
What is the limitations of the method and the experiment?
V-A Can we accept the original hypothesis?
We explain why we can accept the original hypothesis by using the result of the experiment and another study.
At first, we show that the we can accept , ”The accuracy after a learning by UCQL is better than the accuracy before a learning by UCQL”. According to Capture IV, we found a significant difference in precision between the before and after condition. Thus, we accept . Therefore, we can infer as true.
The result of the experiment supports the original hypothesis though the above-mentioned discussion because the working hypothesis is true. Therefore, we can accept the original hypothesis.
V-B Why the robot attract passersby without discomfort by the proposed method?
We can explain why the robot attract passersby without discomfort in view of the learning process with Figure. 8.
Why the robot reduce FN by UCQL? We compare the row of in Figure. 8(a) and the row of in Figure. 8(b). The robot before learning selected a action because . The robot after learning selected a action because . That means robot do not calls if passerby don’t use a robot service. Therefore, the robot reduce FN by UCQL.
V-C What is the limitations of the method and the experiment?
In this experiment, we supposed that a passerby do not walk with others. In other words, we do not consider a group of passersby. Thus, we need to expand the method in order to process a group of them.
The data in this study are sampled from biased population. We need to take further experiments on other environments if we want more soundness about the working hypotheses.
In this experiment, we create the reward function based on other studies. However, it is hard to create reward functions on each case. Therefore, we have to create a easy method in order to design reward function and Q function.
We investigated the hypothesis that ”by using user-centered Q-learning, a robot can increase the chance of being able to provide a service to a passerby without causing the passerby discomfort.” We proposed a method based on reinforcement learning in robotics and focused on the reward function and the Q-function because we wanted the robot to perform actions in view of user experience?. To investigate our hypothesis, we made a working hypothesis and tested it experimentally. From the results, we accepted the working hypothesis and the original hypothesis.
Future work includes generalizing the method for creating the reward function to make it applicable to different tasks and developing a distributed reinforcement learning method that enhances time-efficiency.