### Abstract

Deep reinforcement learning has recently been widely applied in robotics tostudy tasks such as locomotion and grasping, but its application to socialhuman-robot interaction (HRI) remains a challenge. In this paper, we present adeep learning scheme that acquires a prior model of robot approaching behaviorin simulation and applies it to real-world interaction with a physical robotapproaching groups of humans. The scheme, which we refer to as Staged SocialBehavior Learning (SSBL), considers different stages of learning in socialscenarios. We learn robot approaching behaviors towards small groups insimulation and evaluate the performance of the model using objective andsubjective measures in a perceptual study and a HRI user study with humanparticipants. Results show that our model generates more socially appropriatebehavior compared to a state-of-the-art model.

### Quick Read (beta)

# Learning Socially Appropriate Robot Approaching Behavior Toward Groups using Deep Reinforcement Learning

###### Abstract

Deep reinforcement learning has recently been widely applied in robotics to study tasks such as locomotion and grasping, but its application to social human-robot interaction (HRI) remains a challenge. In this paper, we present a deep learning scheme that acquires a prior model of robot approaching behavior in simulation and applies it to real-world interaction with a physical robot approaching groups of humans. The scheme, which we refer to as Staged Social Behavior Learning (SSBL), considers different stages of learning in social scenarios. We learn robot approaching behaviors towards small groups in simulation and evaluate the performance of the model using objective and subjective measures in a perceptual study and a HRI user study with human participants. Results show that our model generates more socially appropriate behavior compared to a state-of-the-art model.

## I Introduction

Deep reinforcement learning (DRL) algorithms provide a framework for automatic robot perception and control [sunderhauf2018limits] [chernova2014robot]. In recent years, methods based on DRL have achieved great performance in different control tasks such as grasping and locomotion [levine2016end]. However, the question of how to make robots learn appropriate social behaviors under modern frameworks remains underexplored, partly due to the lack of cross-disciplinary synergies in human-robot interaction (HRI) studies. As a consequence, the interaction scenarios studied in previous research have been limited to simplified cases and the algorithms studied to relatively simple ones [ferreira2015reinforcement].

A promising, but underexplored approach, to robot learning in social HRI scenarios is to learn a prior model in simulation first and then refine the learned policy using model-based reinforcement learning (RL) in the real-world. Learning a prior model in a simulated environment has a lot of potential benefits. First, it can save a significant amount of real-world interactions. Several works [hanna2017grounded] [bousmalis2017using] have shown that learning a model for physical interactions can help robots learn faster in the real-world. Secondly, in social interactions, humans have little tolerance for random behaviors [abubshait2017you], and lose interest quickly if the model deviates too much from social norms. Additionally, the mathematical modeling of social interactions in a simulated setting allows researchers to control factors more rigorously, which can help with the issue of replicability. However, unlike simulating physical interactions [jonschkowski2014state], simulating social HRI poses a different set of challenges. One of the main challenges is that it is hard for the simulator to accurately model human relevant behavior. Simulators of physical interactions are based on physical laws which are well understood, while human behavior is less predictable. Nevertheless, two ways have been considered to simulate social feedback based on real world signals. The first one is to use computational models [jan2007dynamic] that have been studied in experiments [pedica2010avatars] and the second one is to use machine learning methods [yuan2018human].

In this paper, we propose a deep learning scheme, called Staged Social Behavior Learning (SSBL), for learning robot appropriate social behavior with continuous actions in a simulated environment, and we apply it to real-world interaction with a physical Pepper^{1}^{1}
1
https://www.softbankrobotics.com/emea/en/pepper robot interacting with humans. Specifically, we consider a task in which the robot moves toward a small group, positioned in an F-formation [kendon1990conducting], based on its simulated social feedback, using the Social Force Field Model (SFFM) [pedica2008social] . The task is learned in an end-to-end fashion, i.e., from vision to social behaviors in a virtual environment. SSBL involves a pipeline for simulated social robot learning that deconstructs a social task into three steps. In the first step, the robot learns a compressed representation of the world from vision or other modalities. This step is important because it significantly reduces the complexity of the DRL problem. After the compressed information is learned, the algorithm learns a dynamical model from a prior model which is built upon social forces [jan2007dynamic] [pedica2008social] in the environment. The last step is to make sure that the learned behavior follows the social standard by using simulated social norms as realistic reward.
In this study, we focus on the first two steps of the SSBL framework and learn robot approaching behaviors towards small groups in simulation and evaluate the performance of the model using objective and subjective measures in a perceptual study and a HRI user study with human participants. Figure 1 shows our physical robot experiment setup. The code of this project is publicly available on the GitHub.^{2}^{2}
2
The code is publicly available at https://github.com/gaoyuankidult/Pepp
erSocial/tree/master

## II Literature Review

### II-A Deep Reinforcement Learning in HRI

Reinforcement learning (RL) has been used since the early days of HRI. One of the first works that considered using social feedback as accumulative rewards was conducted by Bozinovski [bozinovski1996emotion] [bozinovski1982learning]. After that, many papers in HRI started to investigate the effect of RL algorithms such as Exp3 [yuan2018when] or Q-learning [mataric1997learning] in social robotics settings. However, since such algorithms lack the ability to capture important features from high-dimensional signals [mnih2015human], their applicability to solve HRI problems remains limited. After the era of deep learning started in 2006 [lecun2015deep], many different algorithms were proposed to understand different modalities in HRI, for example, ResNet [he2016deep] for image processing and Long Short-Term Memory [hochreiter1997long] (LSTM)-based solutions for text processing. As a consequence, some HRI researchers started to investigate deep learning’s role in the area of HRI. A pioneering work was conducted by Qureshi in 2017 [qureshi2016robot]. In this work, a Deep Q-Network (DQN) [mnih2015human] was used to learn a mapping from visual input to one of several predefined actions for greeting people. Another work was conducted by Madson [clark2018deep], where a DQN was used for learning generalized, high-level representations from both visual and auditory signals.

### II-B Learning Representations

RL based solely on visual observations has been used to solve complex tasks such as playing ATARI games [mnih2013playing], driving simulated cars [tan2018autonomous] and navigating mazes [mirowski2016learning]. However, learning policies directly from high-dimentional data such as images requires a large amount of samples, which makes it intractable in social robot learning [bohmer2015autonomous]. One solution is to use low-dimensional hand-crafted features as the state, but this would reduce learning autonomy.

Prior works have utilized deep autoencoders (AEs) to learn a state representation, including Lange et al. [lange2012autonomous]. Several variants of AEs have been applied as well, including attempts by Böhmer et al. [bohmer2015autonomous] to learn the dynamics of the environment by constructing an AE predicting the next image, and Finn et al. [finn2015deep] who adopted a spatial AE (SAE) to learn an intermediate representation consisting of image coordinates of relevant features. The latter suggested that this intermediate representation made it particular well suited for high-dimensional continuous control.

### II-C Modelling Groups and Robot Approaching Behavior

Numerous works have been done in group dynamic behaviors. Particle-based methods [heigeas2010physically] [treuille2006continuum] simulate global collective behaviors of large scale groups or crowds. For modeling small scale groups, agent-based methods [musse2001hierarchical] [reynolds1999steering] are adopted to simulate the behavior of each individual based on rules of behavior. Specifically, in a small multi-party conversation group, Kendon [kendon1990conducting] proposed the F-formation system to define the positions and orientations of individuals within a group, which characterized dynamic group behaviors. Several studies have been carried out that concern robot approaching behaviors towards small groups i.e. in which an agent moves towards a group in an attempt to join an ongoing task or conversation. Ramírez et al. [ramirez2016robots] adopted inverse reinforcement learning, involving several participants demonstrating approaching behaviors for a robot to learn. Pedica et al. [pedica2012lifelike] integrated behavior trees in their reactive method to simulate lifelike social behaviors, including robot approaching behavior towards groups. Both approaching and leaving behaviors are considered in [yang2017expressive], where a finite state machine is utilized in the transitions between different social behaviors. Jan et al. [jan2007dynamic] presented an algorithm for simulating movement of agents, such as an agent joining the conversation. The agents dynamically move to new locations, but without proper orientations. More recently, Samarakoon et al. [samarakoon2018replicating] designed a method to replicate the natural approaching behaviors of humans. Meanwhile, a fuzzy inference system was proposed in [bhagya2018proxemics] to decide the approaching proxemics based on the behaviors of the user.

## III Methodology

In the following sections, we introduce the fundamental concepts needed to train a prior model for robot approaching behaviors in accordance with SSBL. Section III-A introduces some basic concepts and details on how the environment was set up. Sections III-B to III-D describe the three stages of SSBL training. Specifically Section III-B pertains to the state representation and its training procedure. It also details the various architectures used to evaluate this step. Section III-C shows how one can formulate the training of a dynamical model within an RL framework. Section III-D shows how social norms can be acquired by utilizing concepts from the SFFM [pedica2008social].

In the original SFFM work [pedica2008social], SFFM was also used to generate social agent behaviors, which will be used as a baseline for evaluating our learned policy.

### III-A Environment Setup

In order to simulate robot approaching behaviors, we first build a simulator using Unity 3D^{3}^{3}
3
https://unity3d.com/ game engine. The environment consists of a square floor surrounded by four walls. A conversation group which contains two Simulated Human Agents (SHAs) is spawned at a random position within this domain. The robot agent is spawned outside the group and performs approaching behaviors. The virtual agents (agents in the group and robot agent) are pre-defined assets which resemble the SoftBank Pepper robot. Figure 2 shows one example of environment’s top-down and first-person view. The blue and green agents are SHAs and the gray one at the top right of the top-down view is the robot agent. The first-person view (Figure 2, right) is from the robot agent’s perspective.

In this paper, we are mainly concerned with the task of learning a prior model in the simulator for a robot’s approaching behaviors towards small groups of individuals. The task can be formulated as an RL problem. Let us consider ${\mathbf{s}}_{t}$ and ${\mathbf{a}}_{t}$ as the state and action of the robot agent at time $t$, respectively. Learning the dynamic behavior for approaching a group can be viewed as maximizing the expected cumulative reward ${E}_{\tau \sim \pi}[\mathcal{R}(\tau )]$ over trajectories $\tau =\{{\mathbf{s}}_{1},{\mathbf{a}}_{1},\mathrm{\dots},{\mathbf{s}}_{T},{\mathbf{a}}_{T}\}$, where $\mathcal{R}(\tau )={\sum}_{t=1}^{T}\mathcal{R}({\mathbf{s}}_{t},{\mathbf{a}}_{t})$ is the cumulative reward over $\tau $. The expectation is under distribution $p(\tau )=p({\mathbf{s}}_{1}){\prod}_{t=1}^{T}p({\mathbf{s}}_{t+1}|{\mathbf{s}}_{t},{\mathbf{a}}_{t})p({\mathbf{a}}_{t}|{\mathbf{s}}_{t})$, where $\pi ({\mathbf{s}}_{t})=p({\mathbf{a}}_{t}|{\mathbf{s}}_{t})$ is the policy we would like to train and $p({\mathbf{s}}_{t+1}|{\mathbf{s}}_{t},{\mathbf{a}}_{t})$ is the forward model determined by the environment.

### III-B State Representations

In our experiments, we try three modes of representing the environment state to the robot agent: *Vector*, *CameraOnly* and *CameraSpeed*. The first mode is a vector-based representation, consisting of the positions and velocities of all the agents, together with the positions of the walls. This representation is ideal for learning, so it serves as an upper bound on the performance of this task.

The second and third modes are designed to resemble two common robotic settings: one where the robot is equipped with a camera, and one where the robot has both a camera and the ability to estimate its speed. In these modes, the full states are given as ${\mathbf{s}}_{t}={\mathbf{I}}_{t}$ and ${\mathbf{s}}_{t}=({\mathbf{I}}_{t},{\mathbf{v}}_{t})$ respectively. Here ${\mathbf{I}}_{t}$ is the visual information from the robot’s first-person view rendered by the Unity engine, and ${\mathbf{v}}_{t}$ the velocity of the robot.

The method employed in this work is to learn a mapping from input images to simplified low-dimensional state representations, thus circumventing some of the problems associated with RL from high dimensional input [mnih2013playing]. To do this, we utilize an autoencoder (AE) [goodfellow2016deep], a neural net $\varphi $ that maps inputs to itself, s.t. $\varphi (x)\approx x$. An AE can be decomposed into an encoder and a decoder, $\varphi \equiv {\varphi}_{dec}\circ {\varphi}_{enc}$. By choosing the intermediate representation ${\varphi}_{enc}(\cdot )$ to be comparatively low-dimensional, ${\varphi}_{enc}({\mathbf{I}}_{t})$ or $({\varphi}_{enc}({\mathbf{I}}_{t}),{\mathbf{v}}_{t})$ could serve as a simplified but sufficient representation of the state, facilitating accelerated learning. Figure 3 shows a schematic illustration of the architectures.

We implemented and evaluated two different AEs. The first one is a regular convolutional AE. It uses the following encoder and decoder:

${\varphi}_{enc}^{conv}\equiv {D}_{1}\circ {C}_{3}\circ {C}_{2}\circ {C}_{1}$ | (1) | ||

${\varphi}_{dec}^{conv}\equiv {C}_{6}\circ {C}_{5}\circ {C}_{4}\circ {D}_{2}$ | (2) |

where the ${C}_{i}$ are convolutional layers and the ${D}_{i}$ are fully connected layers.

The second AE is based on the deep SAE described in [finn2015deep], but with some significant variations. In the following sections, we refer it as Spatial Auto-encoder Variant (SAEV). The SAEV uses the encoder

${\varphi}_{enc}^{saev}\equiv S\circ {C}_{3}\circ {C}_{2}\circ {C}_{1}$ | (3) |

where ${C}_{i}$ are convolutional layers. ${C}_{1}$,${C}_{2}$ using exponential linear units ($ELU$) activation [clevert2015fast], while ${C}_{3}$ uses a spatial softmax-activation:

$softmax{(z)}_{i,j,c}={\displaystyle \frac{{e}^{{z}_{i,j,c}}}{{\sum}_{w=0}^{W}{\sum}_{h=0}^{H}{e}^{{z}_{w,h,c}}}}$ | (4) |

The mapping $S$ takes a number of feature maps, which it treats as bivariate probability distributions. For each, a feature location is estimated by the expectation values:

${x}_{c}$ | $={\mathbb{E}}_{(i,j)\sim {P}_{c}}\left[i\right]$ | |||

${y}_{c}$ | $={\mathbb{E}}_{(i,j)\sim {P}_{c}}\left[j\right]$ | (5) |

where ${P}_{c}(i,j)$ is the $(i,j)$ coordinate of the $c$^{th} feature-map of the input. The *presence* of a feature is defined as the weighted sum

$${\rho}_{c}=\sum _{i=0}^{W}\sum _{j=0}^{H}{P}_{c}(i,j)\cdot \mathcal{N}(i,j|\bm{\mu}=({x}_{c},{y}_{c}),\mathbf{\Sigma}=k\cdot \mathbf{I})$$ | (6) |

Intuitively, a feature map which is highly localized around the estimated position has a presence near $1$, whereas one that is very spread out will have presence close to $0$. The output from $S$ is the concatenation of the $({x}_{c},{y}_{c},{\rho}_{c})$ of each feature map. In other words, the intermediate representation contains actual image-coordinates of the features.

The main difference between our SAEV architecture and the SAE described in [finn2015deep] is the decoder. The decoder we use is

${\varphi}_{dec}^{saev}\equiv B\circ {C}_{6}\circ {C}_{5}\circ {C}_{4}\circ \mathrm{\Delta}$ | (7) |

where ${C}_{i}$ are convolutional layers, ${C}_{4}$, ${C}_{5}$ uses ELU-activations, while ${C}_{6}$ uses a sigmoid activation. $\mathrm{\Delta}:{\mathbb{R}}^{N\times 3}\to {\mathbb{R}}^{W\times H\times N}$ is a transformation that takes the $N$ $({x}_{c},{y}_{c},{\rho}_{c})$-tuples and maps each to a feature map:

$\mathrm{\Delta}{({x}_{1},\mathrm{\dots},{x}_{C},{y}_{1},\mathrm{\dots},{y}_{C},{\rho}_{1},\mathrm{\dots},{\rho}_{C},)}_{i,j,c}$ | $=$ | |||

$ELU({\rho}_{c}-{\parallel (i,j)-({x}_{c},{y}_{c})\parallel}_{2})$ | (8) |

This creates $N$ feature maps, with peaks at $(i,j)=({x}_{c},{y}_{c})$ that decrease radially outwards according to the ELU [clevert2015fast]. To the output of $\mathrm{\Delta}$, three convolutional layers are applied, followed by an addition operation with a trainable constant to complete the decoder. The constant addition operation frees up the prior stages of the architecture to focus on learning positions of things that are not always in the same place.

All models are trained using the Adam-optimizer [kingma2014adam] on a loss function consisting of three components: reconstruction error ${L}_{err}={\parallel \varphi ({\mathbf{s}}_{t})-{\mathbf{s}}_{t}\parallel}_{2}$, a presence based loss ${L}_{pre}=1-\rho ({\mathbf{s}}_{t})$ that encourages localized features, and the smoothness loss ${L}_{smooth}=({\varphi}_{enc}({\mathbf{s}}_{\mathbf{t}+\mathrm{\U0001d7cf}})-{\varphi}_{enc}({\mathbf{s}}_{\mathbf{t}}))-({\varphi}_{enc}({\mathbf{s}}_{\mathbf{t}})-{\varphi}_{enc}({\mathbf{s}}_{\mathbf{t}-\mathrm{\U0001d7cf}}))$ defined in [finn2015deep]. For the convolutional AE, the presence loss is ill-defined and thus that term was omitted. One can now use the intermediate representation ${\varphi}_{enc}({\mathbf{s}}_{t})$ as input to the RL framework, or to visualize the corresponding image coordinates, as is shown in Figure 4.

### III-C Modeling Group Behavior

In a realistic multi-party conversation group, the individuals within it stand in appropriate positions with respect to others. This positional and orientational relationship has been defined as an F-formation as proposed by Kendon [kendon1990conducting]. It characterizes a group of two or more individuals, typically in a conversation, to share information and interact with each other. Most importantly, it defines the o-space which is a common focused space in the group in which all individuals look inward and is exclusive to those external. When conditions change, such as a new individual joining the group, the group members should change position or orientation in order to form a new group including the newcomer. Jan et al. [jan2007dynamic] proposed a group model which simulates these behaviors by a social force field. In this paper, we use an extended SSFM which maintains F-formation through repositioning and reorientating by a conversation force field. This force field is produced and updated by three forces: a repulsion force, an equality force, and a cohesion force. The details of social force fields are described in [pedica2008social]. In order to better model conversation groups, Hall’s proxemics theory [hall1968proxemics] is adopted when generating social force fields, i.e. the repulsion, equality and cohesion forces occurring in personal, social and public spaces, respectively.

The repulsion force prevents other agents from stepping inside its personal space and generates a repulsion force to push them away. Let ${N}_{p}$ be the number of other agents inside the personal space of agent $i$, and ${\mathbf{p}}_{i}$ is the corresponding position of agent $i$. The repulsion force is shown in equation 9.

$${\mathbf{F}}_{r}=-{({d}_{p}-{d}_{min})}^{2}\frac{{\mathbf{p}}_{r}}{||{\mathbf{p}}_{r}||}$$ | (9) |

where ${\mathbf{p}}_{r}={\sum}_{i}^{{N}_{p}}({\mathbf{p}}_{i}-\mathbf{p})$, $\mathbf{p}$ is the position of the agent currently being evaluated. ${d}_{p}$ is the radius of its personal space, and ${d}_{min}$ is the distance to its closet agent inside the personal space.

The equality force keeps o-space shared to all group members by generating an attraction or a repulsion force towards a point in o-space. Also, an orientation force towards o-space is generated to change body orientation. Let ${N}_{s}$ be the number of other agents inside the social space. The equality force ${\mathbf{F}}_{e}$ and equality orientation ${\mathbf{d}}_{e}$ are shown in equation 10.

$\begin{array}{cc}\hfill {\mathbf{F}}_{e}& =(1-{\displaystyle \frac{m}{||\mathbf{c}-\mathbf{p}||}})(\mathbf{c}-\mathbf{p})\hfill \\ \hfill {\mathbf{d}}_{e}& ={\displaystyle \sum _{i}^{{N}_{s}}}({\mathbf{p}}_{i}-\mathbf{p})\hfill \end{array}$ | (10) |

where $c$ is the centroid, i.e. $\mathbf{c}=(\mathbf{p}+{\sum}_{i}^{{N}_{s}}{\mathbf{p}}_{i})/({N}_{s}+1)$, and $m$ is the mean distance of the members from the centroid.

The cohesion force prevents an agent to be isolated from a group and keeps agents close to each other by generating an attraction force. Let ${N}_{a}$ be the number of other agents inside the public area, $o$ is the conversation center and $s$ is the radius of the o-space. The cohesion force ${\mathbf{F}}_{c}$ and cohesion orientation ${\mathbf{d}}_{c}$ are shown in equation 11.

$\begin{array}{cc}\hfill {\mathbf{F}}_{c}& =\alpha (1-{\displaystyle \frac{s}{||\mathbf{o}-\mathbf{p}||}})(\mathbf{o}-\mathbf{p})\hfill \\ \hfill {\mathbf{d}}_{c}& ={\displaystyle \sum _{i}^{{N}_{a}}}({\mathbf{p}}_{i}-\mathbf{p})\hfill \end{array}$ | (11) |

where $\alpha ={N}_{a}/({N}_{s}+1)$, which is the scaling factor for the cohesion force used to reduce the magnitude of the cohesion force if the agent is surrounded by other agents in its social area.

In order to include a component in reward function to drive the robot to approach the group. We incorporate the extended SSFM described previously and consider a line integral ${r}_{1}$ over a path $L$ in aforementioned force fields, namely force fields in personal, social and public spaces, to be the group forming reward. Mathematically, the group forming reward for the robot agent is defined as follows

${R}_{1}$ | $={\displaystyle {\int}_{L}}{r}_{1}(\mathbf{u})\cdot \mathit{d}\mathbf{u}$ | (12) |

where $\mathbf{u}$ is the position of the robot along the trajectory $L$, and ${r}_{1}(\mathbf{u})={\sum}_{i\in \{r,e,c\}}{\mathbf{F}}_{i}(\mathbf{u})$ is the combined force on the robot agent. Note that the force fields ${\mathbf{F}}_{i}$ depend on the positioning of all agents, including the SHAs, but for notational simplicity, this is not made explicit in the formulae.

Together with the group forming reward, another reward function called non-increasing reward is added to ensure the the energy in the force field is non-increasing. Mathematically, it is defined as

${R}_{2}$ | $={\displaystyle {\int}_{{t}_{0}}^{{t}_{1}}}{\mathrm{\U0001d7d9}}_{\mathcal{A}}(\mathbf{u}(t))\mathit{d}t$ | (13) |

where $\mathrm{\U0001d7d9}$ is the indicator function and $\mathcal{A}$ is the set of points along the robot’s trajectory where $d{r}_{1}(\mathbf{u}(t))/dt\ge 0$. These two reward functions help the robot agent to approach the group center. To add further incentive to complete the task, two other other reward components are added. They are a time-penalty ${R}_{3}=-{\int}_{{t}_{0}}^{{t}_{1}}\mathit{d}t$ (${t}_{0}$, ${t}_{1}$ are the times an episode starts and ends), together with a bonus reward ${R}_{4}$ for successful approaching behavior within the required number of time steps.

### III-D Following Social Norms

In order to make the robot adhere to social norms when it is approaching the group, simulated feedback from other agents is taken into consideration. Therefore, the robot agent considers the impact of its own behavior on others, which is important in generating appropriate real-world robot approaching behaviors. Here, we define summation of all the line integrals of SHAs’ paths in the force fields,

${R}_{5}=-{\displaystyle \sum _{j=0}^{{N}_{p}}}{\displaystyle {\int}_{{L}_{j}}}{\displaystyle \sum _{i\in \{r,e,c\}}}{\mathbf{F}}_{ij}\mathbf{\cdot}d{\mathbf{u}}_{j}$ | (14) |

where ${N}_{p}$ means the total number of the SHAs.

The final reward is a combination of all five rewards. Each is associated with a weight ${w}_{i}\ge 0$ to indicate the importance of that reward category. On top of the weights considered for each category of rewards, two other weights are used to influence the behavior of the robot. One weight is called egoism wight ${w}_{e}$, which decides how much the robot agent considers achieving its own goal of approaching the group center. The other weight, altruism weight ${w}_{a}$ decides how much it cares about other agents, meaning avoiding pushing other SHAs around. The final reward is defined as follows:

$R=$ | ${w}_{e}\cdot ({w}_{1}\cdot {R}_{1}+{w}_{2}\cdot {R}_{2}+{w}_{3}\cdot {R}_{3}+{w}_{4}\cdot {R}_{4})+$ | |||

$+{w}_{a}\cdot {w}_{5}\cdot {R}_{5}$ | (15) |

By balancing the different weights, we produce a realistic reward function that captures important notions from human social interaction, such as respecting the private space of others.

## IV Results

We used a DRL algorithm called Proximal Policy Optimization (PPO) [schulman2017proximal] to learn an appropriate behavior for the robot agent. We selected PPO due to its stability advantages [henderson2017deep] over DQN-based RL algorithms. We used ML-Agents Toolkit^{4}^{4}
4
https://github.com/Unity-Technologies/ml-agents [juliani2018unity] to carry out our experiments.

### IV-A Models Configurations

To determine what state representation and type of network structure for the value and policy networks are the most suitable for robot approaching behavior, we evaluate combinations of state representations, and network architectures. For the state representations containing visual information, we evaluate both AEs (*conv* and *SAEV* from section III-B). The network structures considered are Feed-Forward (FF) networks and LSTM
networks. Table I shows the model configurations and their corresponding performance.

Model | Reward | Percentage |
---|---|---|

Vector + LSTM (Baseline) | -0.256 | 100.00% |

CameraOnly + SAEV + FF | -0.869 | 57.06% |

CameraOnly + SAEV + LSTM | -0.804 | 61.63% |

CameraOnly + conv + FF | -0.810 | 61.18% |

CameraOnly + conv + LSTM | -1.091 | 41.51% |

CameraSpeed + SAEV + LSTM | -0.544 | 79.80% |

CameraSpeed + conv + LSTM | -0.709 | 68.22% |

Random policy | -1.684 | 0.00% |

Performance is measured both as cumulative reward (an exponentially weighted running average is used to smooth the function.), described in Section III-A, and as percentages. Percentages express relative performance, such that $100\%$ correspond to the baseline performance, and $0\%$ to the mean performance of a uniformly random agent. Figure 5 shows the learning curve of the best model, which uses image and robot’s speed as input, output of SAEV as learning state representation and a LSTM as policy network.

### IV-B Approaching Behavior: Perceptual Study

We compare the robot approaching behavior learned by our model with the one generated by SFFM [pedica2008social]. In a study conducted by Pedica et al. [pedica2010avatars], it was shown that SFFM increased believability of static group formation. A major drawback of SFFM is that it is directly controlled by the social forces and therefore does not act according to the current situation of the environment. We hypothesize that a learned robot agent that is able to accelerate and decelerate based on the simulated social feedbacks in RL framework can introduce more believability and social appropriateness. In order to compare the behaviors generated according to the SFFM with those generated with our proposed model, we implemented a version of SFFM and compared it with a model learned with the reward function defined in Section III-D. Figure 6 shows paths sampled from our trained model and paths sampled from the SFFM with the same initial positions. One thing we note here is that, though it is not the case in this study, a smoothing algorithm can also be applied to the learned policy to make the approaching behavior better.

In order to evaluate the behavior of our learned model compared to the behavior generated by SFFM, we conducted a perceptual study to evaluate the approaching behaviors using subjective measures. In this study we are interested in three dimensions of social appropriateness, namely polite, sociable and rude, as in [okal2016learning].

We created six videos of approaching behaviors in the simulated environment from a top-down view. The videos show six different approaching behaviors of the robot towards groups from three starting locations by both our model and the SFFM (Figure 6). Twenty participants (engineering students with a mixed cultural background; average age: 28.25 years) were asked to watch the videos and answer four questions for each video. Specifically, participants were asked to rate how much they thought the behaviors were polite, sociable, rude and human-like, using a 1-7 Likert scale, where 1 means ”not at all” and 7 means ”very”. The videos and their corresponding questions are given to the participants in a random order.

Figure 7 shows participants’ ratings of approaching behaviors generated by the two models. We found that people consider the behavior generated by our model to be significantly more polite ($$), less rude ($$) and more sociable ($$). However, we did not find the approaching behavior generated by our model to be significantly more human-like than the ones generated by SFFM ($t(19)=1.01,p>.05$). This might be related to the fact that human-likeness is hard to measure when there are more than one factor involved, e.g. the agent’s appearance [macdorman2006subjective], in addition to its movement.

### IV-C Approaching Behavior: Pilot User Study with Physical Robot

We implemented robot approaching behaviors learnt with our model in a physical Pepper robot and conducted a user study with human participants to evaluate the model’s performance in a real environment. In the study, a Pepper robot approaches a group of two people facing each other. Each group consists of a participant and an experimenter. We used the same questionnaire as in the Section IV-B to evaluate whether we obtain similar results to the perceptual study.

Twelve participants (mostly computer science students with mixed cultural background; average age: 31.1), were asked to evaluate two conditions in a within-subject design, namely robot approaching the group using the SFFM (condition one) and robot approaching the group according to our proposed model (condition two). For each condition, they were asked to first experience the robot’s approaching behavior from one of two positions in the group (e.g., position A in Figure 1) and then to switch position with the experimenter and experience the robot’s approaching behavior from this position (e.g., position B in Figure 1). During the study, participants interacted with the two conditions in a random order. After each condition, they were asked to fill in the questionnaire. We found that the robot’s approaching behavior generated according to our model was perceived as significantly more polite ($$), less rude ($$) and more sociable ($$) than the one generated according to SFFM, but we did not find any significant difference for human-likeness ($t(11)=0.7361,p>.05$). This is in line with the results from the perceptual study. Figure 8 shows more detailed results.

## V Discussion

There are several things to be considered while using this approach to build a prior model. One of the main things is the necessity of using simulation. While it does need a well-established model like SFM to form a reward function using current RL technology, the generated behavior using RL is much richer. Also, when more advanced techniques are used, e.g. self-play or learning using a sparse reward, the agent may not need established models any more. One of the other questions could be is the state representation really needed? In this study, we specifically used an architecture similar to spatial AE [finn2015deep]. This architecture is easily transferable to the real world. Additionally, it is of importance to see, using these learned features, can we get similar results as using positions of the SHAs. Using a camera is a normal setup in real-world HRI scenarios.

## VI Conclusion and Future Work

In this work, we proposed a deep learning scheme (SSBL) that can be considered as a general framework for social robot learning. As a demonstrator, we implemented a robot approaching behavior task based on this scheme. We designed a reward function combining concepts from SFFM and Hall’s proxemics theory to enable the robot agent to learn a dynamical model which takes social norms into account. We found that SAEV outperforms the vanilla convolutional AE on this task with video input along or with video and speed information together given as input. Moreover, results from a perceptual study and a HRI study with a physical robot show that our model can generate more socially appropriate approaching behavior than SFFM.

Future work will include a larger-scale study where human participants are asked to qualitatively assess the behavior of our learned model compared to the behavior generated by SFFM in real-world situations. Regarding the model configuration experiments, we will also investigate how to utilize more subtle real-world human feedback such as user engagement to refine our learned model using model-based RL algorithms. The expectation is that, by taking user affective and social behavior into account, robots will exhibit more socially appropriate approaching behavior. The next step in this process is to conduct policy refinement experiments through learning from subsequent real-world interaction with a physical robot interacting with humans.

## Acknowledgement

This work was supported by the COIN project (RIT15-0133) funded by the Swedish Foundation for Strategic Research and by the Swedish Research Council (grant n. 2015-04378)