Can User-Centered Reinforcement Learning Allow a Robot to Attract Passersby without Causing Discomfort?

  • 2020-01-02 03:52:52
  • Yasunori Ozaki, Tatsuya Ishihara, Narimune Matsumura, Tadashi Nunobiki
  • 0

Abstract

The aim of our study was to develop a method by which a social robot cangreet passersby and get their attention without causing them to sufferdiscomfort.A number of customer services have recently come to be provided bysocial robots rather than people, including, serving as receptionists, guides,and exhibitors. Robot exhibitors, for example, can explain products beingpromoted by the robot owners. However, a sudden greeting by a robot can startlepassersby and cause discomfort to passersby.Social robots should thus adapttheir mannerisms to the situation they face regarding passersby.We developed amethod for meeting this requirement on the basis of the results of relatedwork. Our proposed method, user-centered reinforcement learning, enables robotsto greet passersby and get their attention without causing them to sufferdiscomfort (p<0.01) .The results of an experiment in the field, an officeentrance, demonstrated that our method meets this requirement.

 

Quick Read (beta)

Can User-Centered Reinforcement Learning Allow a Robot to Attract Passersby without Causing Discomfort?*

Yasunori Ozaki1, Tatsuya Ishihara2, Narimune Matsumura1 and Tadashi Nunobiki1 *This work is supported by NTT Corporation.1Yasunori Ozaki, Narimune Matsumura and Tadashi Nunobiki are with Service Evolution Lab., NTT Corporation, Yokosuka, Japan [email protected], [email protected] and [email protected]2Tatsuya Ishihara is with the R&D Center, NTT West Corporation, Osaka, Japan [email protected]
Abstract

The aim of our study was to develop a method by which a social robot can greet passersby and get their attention without causing them to suffer discomfort. A number of customer services have recently come to be provided by social robots rather than people, including, serving as receptionists, guides, and exhibitors. Robot exhibitors, for example, can explain products being promoted by the robot owners. However, a sudden greeting by a robot can startle passersby and cause discomfort to passersby. Social robots should thus adapt their mannerisms to the situation they face regarding passersby. We developed a method for meeting this requirement on the basis of the results of related work. Our proposed method, user-centered reinforcement learning, enables robots to greet passersby and get their attention without causing them to suffer discomfort (p<0.01) . The results of an experiment in the field, an office entrance, demonstrated that our method meets this requirement.

I Introduction

The working population in many developed countries is decreasing in proportion to the total population due to population aging, and this problem is expected to affect developing countries as well[UN]. One approach to addressing this problem is to use social robots rather than people to provide customer services. Such robots, for example, are starting to be used as receptionists, guides, and exhibitors. Robot exhibitors are being used to provide, for example, exhibition services, such as explaining products being promoted by the robot owners. While robots can increase the chance of being able to provide a service by simply greeting passersby[ChaoShi], passersby can suffer discomfort if they are suddenly greeted by a robot[Ozaki]. The robot may thus face a dilemma: whether to behave in a manner that benefits the owner or to behave in a manner that does not discomfort passersby.

Fig. 1: Photograph illustrating problem addressed. The robot on the left uses gestures to explain the movie to passersby. The function of the robot in the middle is unrelated to the work being report. The robot on the right calls out to passersby to get their attention and is the focus here.

Our goal was to develop a method that solves the robot dilemma described above. That is, a method by which a robot can greet passersby and get their attention without causing them to suffer discomfort. We call our proposed method user-centered reinforcement learning.

In the next section, we define the problem and describe how we found an approach to solving it by studying related work. In the “Proposed Method” section, we explain the method we developed for solving the problem. In the “Experiment” section, we explain the experiment we conducted in the field to test two working hypotheses created from the original hypothesis?. The results show that our method can solve the problem. In the “Discussion” section, we examine the results from the standpoints of physiology, psychology, and user experience. In the “Conclusion” section, we conclude that, by using user-centered Q-learning, a robot can increase the chance of being able to provide a service to a passerby without causing the passerby discomfort. We also mention future work to enhance the proposed method.

I-A Related Works

Several researchers have addressed problems that are similar to the problem we addressed. These problems can be categorized in terms of the problem setting, the solution, and the goal.

In terms of the problem setting, the problem we addressed is similar to the problem of human-robot engagement, which is a complex problem. In accordance with human-robot interface studies[Sidner, Sun], we can interpret human-robot rngagement as the process by which a robot interacts with people, from initial contact to the end of the interaction. Several researchers have analyzed human-robot engagement[SidnerAna, Rich] and have developed a method for maintaining human-robot engagement during the interaction [BohusRobo]. We did not tackle the human-robot engagement problem directly; instead, we tackled the problem that precedes it, which is illustrated in Figure 2.

Fig. 2: Relationship between problem of robot greeting passersby and getting their attention without causing them to suffer discomfort and human-robot engagement.

In terms of the solution, the problem we addressed is similar to machine learning, especially reinforcement learning. Reinforcement learning in robotics is a technique used to find a policy π:OA [Kober] and is used for robotic control tasks. It is not used much for interaction tasks. Reinforcement learning has been applied to the learning of several complex aerobatic control tasks for radio-controlled helicopters [Abbeel] and to the learning of door opening tasks for robot arms [Google]. The research on interaction tasks is less remarkable. Mitsunaga et al. showed that a social robot can adapt its behavior to humans for human-robot interaction by using reinforcement learning [Mitsunaga] if human-robot engagement has been established. Papaioannou et al. used reinforcement learning to extend the engagement time and enhance the dialogue quality [Papaioannou].

The applicability of these method to the situation before human-robot engagement is established is unclear. As shown in Figure 2, the problem we addressed occurs before engagement is established.

In terms of the goal, the problem we addressed is similar to increasing the number of human-robot engagements. Macharet et al. showed that, in a simulation environment, Gaussian process regression based on reinforcement learning can be used to increase the number of engagements[Macharet]. Going further, we focused on increasing the number of engagements in a field environment.

I-B Problem Statement

We use a problem framework commonly used for reinforcement learning in robotics, the partially observable Markov decision process (POMDP) to define the problem[Kober]. The robot is the agent, and the environment is the problem. The robot can observe the environment partially by using sensors.

We choose a exhibition service area in an entrance to a company as the environment. We assume the entrance consists of one automated exhibition system, one aisle and other space. In addition, the entrance is expressed as Euclidean space R3. passersby can move freely around the exhibition system.

The automated exhibition system consists of a tablet, a computer, a robot and a sensor system. The sensor system can sense a color image data It and a depth image data Dt. We called these data Observation Ot. The sensor system can also extract a partial passerby’s action from Ot. The passerby’s action consists of the passerby’s position 𝒑𝒕=(xt,yt,zt) and the head angle 𝜽𝒕=(θtyaw,θtroll,θtpitch). We define the times when the passerby enters the entrance (t=0) and when the passerby leaves from the entrance (t=Tend) . We call the interval between t=0 and t=Tend an episode. Let Θ=(𝜽𝟎,,𝜽𝑻𝒆𝒏𝒅) be the passerby’s position in an episode, and let P=(𝒑𝟎,,𝒑𝑻𝒆𝒏𝒅) be the passerby’s head angle in the episode.

The proposed method takes an own their action from these passerby’s action.

Let Nu be a number of people that used the service. Let Nd be a number of people that used the discomfort. Then, we can declare this problem as ”Find a robot’s policy π:OA such that max(Nu) and min(Nd).

I-C Our Approach

We solve this problem by controlling the robot on the basis of reinforcement learning, ordinarily Q-learning except for designing the reward function. The reward function is created by focusing on the user experience of stakeholders. We call this reinforcement learning including this reward function ”user-centered reinforcement learning.” We do not use deep reinforcement learning due to the difficulty at the present time of collecting the huge amount of data needed for learning.

I-D Contributions

The contributions of this work are as follows,

  1. 1.

    We show that robots can learn abstract actions from a person’s non-verbal responses.

  2. 2.

    We present a method for increasing the number of human-robot engagements in the field without causing them to suffer discomfort.

II Proposed Method

Proposed method, User-Centered Reinforcement Learning, is based on Reinforcement Learning. In this paper, We use Q-learning, one of reinforcement learning, as a base algorithm because it is easy to explain why the robot choose the past actions by Q-learning. We call this algorithm ”User-Centered Q-Learning” (UCQL). UCQL is differ from original Q-learning[Watkins] in an action set A, a state set S, Q-function Q(s,a) and reward function r(st,at,st+1). UCQL consists of three functions;

  1. 1.

    Select an action by a policy

  2. 2.

    Update the policy based on user’s actions

  3. 3.

    Design a reward function and a Q function as initial condition.

II-1 Selecting an action by a policy

Generally speaking, robot senses observation, and take an action including wait. Let ta[sec] be the time when the robot acted. Let tc[sec] be the time when the robot compute the algorithm. Let stS be the predicted user’s state on the time t. Let atS be the robot’s action on the time t. In UCQL, robot choose the action by Algorithm II-1.

{algorithm}

Select an action by UCQL (Action Selector) \[email protected]@algorithmic \REQUIREtc,stc,Q(s,a),π(s,A,Q) \ENSUREat,ta \STATEatπ(stc,A,Q) \STATEtatc \RETURNat,ta

II-2 Update the policy based on user’s actions

In UCQL, robot update the policy by Algorithm II-2.

{algorithm}

Update the policy by UCQL (Policy Updater) \[email protected]@algorithmic \REQUIREsta,ata,stc,A,Q(s,a) \ENSUREQ(s,a) \IFata is finished \STATERr(sta,ata,stc) \STATEQoldQ(sta,ata) \STATEQ(sta,ata)(1-α)Qold+α(R+γmaxaQ(stc,a)) \ENDIF\RETURNQ(s,a)

II-3 Designing an reward function

In UCQL, robot is given a reward function with Algorithm II-3 . Algorithm II-3 divide motivation into extrinsic and intrinsic one inspired from ”Intrinsically Motivated Reinforcement Learning[Chentanez]”. We call the proposed method ”User-Centered” because we design an extrinsic motivation from user’s states related User Experience.

{algorithm}

Reward function by UCQL (r) \[email protected]@algorithmic \REQUIREsta,stc,atc \ENSUREr \STATEr0 \IFatc is not wait \STATErr+Va(at). \STATE(intrinsic motivation) \ENDIF\IFstc is discomfort for users than sta \STATErr-Vs(stc,sta) \STATE(extrinsic motivation) \ENDIF\IFstc is better than sta to achieve the goal \STATErr+Vg(stc,sta) \ENDIF\RETURNr

II-4 Miscellaneous

  • We can choose optional policy π such as greedy, ϵ-greedy and so on.

  • The Q function may be initialized with a uniform distribution. However, if the Q function is designed to be suitable for the task, the learning speed is faster than that of the uniform distribution.

  • The Q function may be approximated with a function such as Deep Q-Network[Mnih]. However, the learning speed is very slower than that of the designed function.

III Experiment

In this chapter, we aim at showing the hypothesis that ”by using user-centered Q-learning, a robot can increase the chance of being able to provide a service to a passerby without causing the passerby discomfort”.

III-A Concrete Goal

At first, we convert the hypothesis into another working hypothesis by operationalization because we cannot evaluate the hypothesis quantitatively.

In Introduction, we define this problem as ”Find a robot’s policy π:OA such that max(Nu) and min(Nd)”. We give shape to Nu and Nd for this experiment. According to Ozaki’s study[Ozaki], This knowledge has two important points. Firstly, passerby is not suffer a negative effect by robot’s call if passerby don’t use a robot service. Secondly, passerby is suffer a negative effect by robot’s call if passerby use the robot service. Thus, this is a binary classification problem that passerby who is called by robot uses the robot service or do not use it. we can define a confusion matrix for evaluation of the method. We infer that Nu and TP, TN have a positive correlation. We also infer that Nd and FP have a positive correlation. We also infer that Nd and FP have a positive correlation. On the other hand, we infer that Nd and TN have a negative correlation. Therefore, we can use Accuracy=(TP+TN)/(TP+FP+TN+FN) as a index for evaluation because max(Accuracy) is one of another representation of ”max(Nu) and min(Nd)”.

From the above discussion, we define the working hypothesis WH as ”The accuracy after a learning by UCQL is better than the accuracy before a learning by UCQL”.

In this experiment, we test WH in order to show that the hypotheses is sound.

III-B Method

In this section, we explain how to conduct the experiment in a field environment. We can divide the method for this experiment into five steps.

  1. 1.

    Create an experimental equipment

  2. 2.

    Construct an experimental environment

  3. 3.

    Define an experimental procedure

  4. 4.

    Evaluate the working hypotheses by statistical hypothesis testing

  5. 5.

    Visualize the effect of UCQL

III-B1 Create an experimental equipment

Firstly, we create an equipment including UCQL. The equipment can be explained in the aspect of the physical structure and the logical structure.

Figure 3 is a diagram of the equipment in the view of the physical structure. According to Figure 3, the experimental equipment consists of a table, a sensor, a robot, a tablet PC, a router and servers. The components are connected with Ethernet cable or Wireless LAN. We use Sota11 1 https://sota.vstone.co.jp/home/, a palm-sized social humanoid robot, as a robot. Sota has a speaker to output voices, a LED to represent lip motions, a SoC to control elements and so on. In this experiment, those elements of Sota is used to interact with a participant. The iPad Air 2 is used as a tablet PC into which start the movie on the display. The Intel RealSense Depth Camera D435 22 2 https://click.intel.com/intelr-realsensetm-depth-camera-d435.html is used as an RGB-D sensor device to measure passerby’s actions.

Fig. 3: The physical structure of the experimental equipment (Real line: Wired, Dashed line: Wireless)

Figure 4 is a diagram of the equipment in the view of the logical structure. The structure consist of Sensor, Motion Capture, State Estimator, Action Selector, Action Decoder, Effector and Policy Updater. We utilize Nuitrack33 3 https://nuitrack.com/ as Motion Capture. And we utilize ROS44 4 http://wiki.ros.org/ as a infrastructure of the equipment to communicate variables among functions.

Fig. 4: The logical structure of the experimental equipment

According to Figure 3 and 4, the equipment works by Algorithm 4.

{algorithm}

Select an action by the experimental system including UCQL \[email protected]@algorithmic \REQUIREt,Ot \ENSUREEt \IFthe system is NOT initialized \STATEQQ0 \STATEΘa empty list \STATEPa empty list \ENDIF\STATEIt,Dtsense(Ot) \STATE𝜽𝒕,𝒑𝒕extract(It,Dt) \STATEPush 𝜽𝒕 into Θ. \STATEPush 𝒑𝒕 into P. \STATEstestimate(Θ,P) \STATEatselectAction(t,st,Q,π) \STATEPush (t,at,st) into X. \STATEEtdecode(at,𝜽𝒕,𝒑𝒕) \RETURNEt

We utilize Table I as the action set A and Table II as the state set. Table I is a double Markov model created from the state set of Ozaki’s decision-making predictor[Ozaki]. Ozaki’s decision-making predictor estimates passerby’s states into seven state: Not Found (s0), Passing By (s1), Look At (s2), Hesitating (s3), Approaching (s4), Established (s5), Leaving (s6).

In addition, we utilize α=0.5 and γ=0.999 as learning parameters. And we utilize Soft-max selection as the policy because we want robot to do action that has a high value and to find an action that has a higher value. Soft-max selection is often used for Q-learning. Equation 3 is the possibility to select actions on the policy. we utilize Equation 2 as a policy parameter. Tn(s) means a thermometer when it is updated n times on s. Tn(s) depends on the states because s00 occur many times. we utilize kT=0.98 and Tmin=0.01 as learning parameters.

T0(s) = 1 (1)
Tn+1(s) = {Tn(s)(Tn(s)<Tmin)kT×Tn(s)(otherwise) (2)
p(s,a)=exp(Q(s,a)/Tn(s))aiAexp(Q(s,ai)/Tn(s)) (3)
TABLE I: Action set in this experiment
Symbol Detail
a0 Robot waits for 5 secs until somebody comes.
a1 Robot calls a passerby with a greeting.
a2 Robot looks at a passerby.
a3 Robot represents joy by the robot’s motion.
a4 Robot blinks the robot’s eyes.
a5 Robot says ”I’m sorry.” in Japanese.
a6 Robot says ”Excuse me.” in Japanese.
a7 Robot says ”It’s rainy today.” in Japanese.
a8 Robot says how to start their own service.
a9 Robot says goodbye.
TABLE II: State set in this experiment
Symbol Detail
s00 The passerby’s state changes ”Not Found” into ”Not Found”.
s10 The passerby’s state changes ”Not Found” into ”Passing By”.
s56 The passerby’s state changes ”Leaving” into ”Established”.
s66 The passerby’s state changes ”Leaving” into ”Leaving”.
{algorithm}

Create initial Q-function for the experiment (QB) \[email protected]@algorithmic \REQUIRE(void) \ENSUREQ(s,a) \STATEqC1 \STATEqH5 \STATEQ a |S|×|A| zero 2D-array \FORi=0 to |A|-1 \FORj=0 to |A|-1 \STATEQ(sij,a0)0 \ENDFOR\ENDFOR\FORj=1 to 5 \STATEQ(s0j,a1)qC \ENDFOR\FORi=1 to 4 \STATEQ(si5,a8)qH \ENDFOR\STATEQ(s56,a9)qH \STATEQ(s50,a9)qH \RETURNQ(s,a)

III-B2 Construct an experimental environment

At first, we have to define how to construct an environment for the experiment. Figure 5 shows a overhead view of the environment. The environment consists of a exhibition space, a wall, a seat space, a way to a W.C. in an building that an actual company have. There are hundreds of employees in the building. Dozens of visitors come to the building. Visitors of the building is often shitting in the seat space for tens of minutes in order to wait for employees in the building. Some visitors and employees watches exhibition space to know newer technologies of the company. Some visitors sometimes go to W.C. while they are waiting for employees.

Fig. 5: The overhead view of the experimental environment.

III-B3 Define an experimental procedure

We suppose the two main scenario. The first scenario is as follows:

  1. 1.

    A visitor is sitting on a seat in the seat space.

  2. 2.

    Then, the visitor get up from the seat because the visitor wants to go to W.C..

  3. 3.

    Thus, visitor move from the seat space to W.C. across the exhibition space.

The second scenario is as follows:

  1. 1.

    A visitor is sitting on a seat in the seat space.

  2. 2.

    Then, the visitor get up from the seat because the visitor is boring to wait.

  3. 3.

    Thus, The visitor move from the seat space to the exhibition space in order to watch the robots in the equipment.

We wants to attract the passersby in the second scenario mainly. We do not wants to attract the passersby in the first scenario because the visitor wants to go to W.C.. Therefore, because we wants the robot to learn the rules, we let the robot learn the rules on the environment by UCQL for several days. Then, we can get learned Q-funcion QA(s,a)

After the learning, we let the robot attract passersby under two condition. We define two condition: Before Learning and After Learning because we want to test the hypotheses. The robot do not learn during the test.

We start collect data for the evaluation by rosbag55 5 http://wiki.ros.org/rosbag. Each data is recorded by rosbag. We can recode all of values in ROS by rosbag during the procedure.

III-B4 Evaluate by statistical hypothesis testing

We evaluate the working hypothesis WH by statistical hypothesis testing. We calculate the the accuracy before the learning and the accuracy after the learning in order to test WH. Finally, we use the one-sided Test of Proportion because we want to evaluate statistical difference between the the accuracy before the learning and the accuracy after the learning.

III-B5 Visualize the effect of UCQL

We visualize the Q-function before the learning and the Q-function after the learning by heat map in order to analyze the effect of UCQL. UCQL can change the action by updating Q-function. Therefore, we can know how robot learn the action by visualizing Q-function. Figure 6 is an example Q-function to explain a visualization on this paper.

Fig. 6: An example Q-function represented by heat map. The columns mean the state symbols of agent and the rows mean the action symbols of agent. For example, Q(s01,a1) is 0. That means the robot call a passerby that is passing by it will get no value.

IV Result

We constructed a experiment environment described on Method in the entrance of our buildings. Figure 1 shows a picture of the equipment in the environment. The experimenter was the corresponding author. The participants were a lot of employees and visitors of our company. The learning interval is three days. As a result, we measured a lot of data. We clean the data by the following step because the data have a lot of noise on the field such as detection errors by Motion Capture and so on.

  • We drop episodes that interval is less than 1 [sec] because it takes a 3 [sec] to walk across the detection area of Motion Capture.

  • We drop episodes that is from s00 to s00 only because nobody was in the detection area of Motion Capture.

We got 209 total episodes in the experiment after the data cleansing. Table III shows number of episodes and time on each condition. We calculated the accuracy from the confusion matrix on each condition. The confusion matrices for the before condition and the after condition were respectively (TP,FP,FN,TN)=(11,59,0,17) and (TP,FP,FN,TN)=(7,23,0,92). Therefore, the accuracy of the baseline and proposed methods were respectively 0.322 and 0.811. In testing WH by the one-sided Test of Proportion, we found a significant difference in accuracy between the before and after condition (p=4.46×10-13<0.01).

TABLE III: Items of the result after the data cleansing.
items Before After Total
episodes 87 122 209
time[h] 13.7 26.7 40.4
days[d] 3 6 9
Fig. 7: The accuracy of the experiment on each condition (**: p<0.01)
(a) The part of the Q function before the learning (QB).
(b) The part of the Q function after the learning (QA).
Fig. 8: The changing process of Q-function by UCQL.

V Discussion

We discuss the original hypothesis, ”The robot can attract passersby without users’ discomfort by User-Centered Reinforcement Learning.”, in the point of following views.

  1. 1.

    Can we accept the original hypothesis?

  2. 2.

    Why the robot attract passersby without discomfort by the proposed method?

  3. 3.

    What is the limitations of the method and the experiment?

V-A Can we accept the original hypothesis?

We explain why we can accept the original hypothesis by using the result of the experiment and another study.

At first, we show that the we can accept WH, ”The accuracy after a learning by UCQL is better than the accuracy before a learning by UCQL”. According to Capture IV, we found a significant difference in precision between the before and after condition. Thus, we accept WH. Therefore, we can infer WH as true.

The result of the experiment supports the original hypothesis though the above-mentioned discussion because the working hypothesis is true. Therefore, we can accept the original hypothesis.

V-B Why the robot attract passersby without discomfort by the proposed method?

We can explain why the robot attract passersby without discomfort in view of the learning process with Figure. 8.

Why the robot reduce FN by UCQL? We compare the row of s01 in Figure. 8(a) and the row of s01 in Figure. 8(b). The robot before learning selected a action a4 because argmaxaQB(s01,a)=a4. The robot after learning selected a action a0 because argmaxaQA(s01,a)=a0. That means robot do not calls if passerby don’t use a robot service. Therefore, the robot reduce FN by UCQL.

V-C What is the limitations of the method and the experiment?

In this experiment, we supposed that a passerby do not walk with others. In other words, we do not consider a group of passersby. Thus, we need to expand the method in order to process a group of them.

The data in this study are sampled from biased population. We need to take further experiments on other environments if we want more soundness about the working hypotheses.

In this experiment, we create the reward function based on other studies. However, it is hard to create reward functions on each case. Therefore, we have to create a easy method in order to design reward function and Q function.

VI Conclusion

We investigated the hypothesis that ”by using user-centered Q-learning, a robot can increase the chance of being able to provide a service to a passerby without causing the passerby discomfort.” We proposed a method based on reinforcement learning in robotics and focused on the reward function and the Q-function because we wanted the robot to perform actions in view of user experience?. To investigate our hypothesis, we made a working hypothesis and tested it experimentally. From the results, we accepted the working hypothesis and the original hypothesis.

Future work includes generalizing the method for creating the reward function to make it applicable to different tasks and developing a distributed reinforcement learning method that enhances time-efficiency.

References