Abstract
In reinforcement learning (RL) research, it is common to assume access todirect online interactions with the environment. However in many realworldapplications, access to the environment is limited to a fixed offline datasetof logged experience. In such settings, standard RL algorithms have been shownto diverge or otherwise yield poor performance. Accordingly, recent work hassuggested a number of remedies to these issues. In this work, we introduce ageneral framework, behavior regularized actor critic (BRAC), to empiricallyevaluate recently proposed methods as well as a number of simple baselinesacross a variety of offline continuous control tasks. Surprisingly, we findthat many of the technical complexities introduced in recent methods areunnecessary to achieve strong performance. Additional ablations provideinsights into which design choices matter most in the offline RL setting.
Quick Read (beta)
Behavior Regularized Offline Reinforcement Learning
Abstract
In reinforcement learning (RL) research, it is common to assume access to direct online interactions with the environment. However in many realworld applications, access to the environment is limited to a fixed offline dataset of logged experience. In such settings, standard RL algorithms have been shown to diverge or otherwise yield poor performance. Accordingly, recent work has suggested a number of remedies to these issues. In this work, we introduce a general framework, behavior regularized actor critic (BRAC), to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Surprisingly, we find that many of the technical complexities introduced in recent methods are unnecessary to achieve strong performance. Additional ablations provide insights into which design choices matter most in the offline RL setting.^{1}^{1} 1 Code is available at https://github.com/googleresearch/googleresearch/tree/master/behavior_regularized_offline_rl.
Behavior Regularized Offline Reinforcement Learning
Yifan Wu^{†}^{†}thanks: Work performed while an intern at Google Brain. 

Carnegie Mellon University 
[email protected] 
George Tucker 

Google Research 
[email protected] 
Ofir Nachum 

Google Research 
[email protected] 
1 Introduction
Offline reinforcement learning (RL) describes the setting in which a learner has access to only a fixed dataset of experience. In contrast to online RL, additional interactions with the environment during learning are not permitted. This setting is of particular interest for applications in which deploying a policy is costly or there is a safety concern with updating the policy online (Li et al., 2015). For example, for recommendation systems (Li et al., 2011; Covington et al., 2016) or health applications (Murphy et al., 2001), deploying a new policy may only be done at a low frequency after extensive testing and evaluation. In these cases, the offline dataset is often very large, potentially encompassing years of logged experience. Nevertheless, the inability to interact with the environment directly poses a challenge to modern RL algorithms.
Issues with RL algorithms in the offline setting typically arise in cases where state and actions spaces are large or continuous, necessitating the use of function approximation. While offpolicy (deep) RL algorithms such as DQN (Mnih et al., 2013), DDPG (Lillicrap et al., 2015), and SAC (Haarnoja et al., 2018) may be run directly on offline datasets to learn a policy, the performance of these algorithms has been shown to be sensitive to the experience dataset distribution, even in the online setting when using a replay buffer (Van Hasselt et al., 2018; Fu et al., 2019). Moreover, Fujimoto et al. (2018a) and Kumar et al. (2019) empirically confirm that in the offline setting, DDPG fails to learn a good policy, even when the dataset is collected by a single behavior policy, with or without noise added to the behavior policy. These failure cases are hypothesized to be caused by erroneous generalization of the stateaction value function (Qvalue function) learned with function approximators, as suggested by Sutton (1995); Baird (1995); Tsitsiklis and Van Roy (1997); Van Hasselt et al. (2018). To remedy this issue, two types of approaches have been proposed recently: 1) Agarwal et al. (2019) proposes to apply a random ensemble of Qvalue targets to stabilize the learned Qfunction, 2) Fujimoto et al. (2018a); Kumar et al. (2019); Jaques et al. (2019); Laroche and Trichelair (2017) propose to regularize the learned policy towards the behavior policy based on the intuition that unseen stateaction pairs are more likely to receive overestimated Qvalues. These proposed remedies have been shown to improve upon DQN or DDPG at performing policy improvement based on offline data. Still, each proposal makes several modifications to the building components of baseline offpolicy RL algorithms, and each modification may be implemented in various ways. So a natural question to ask is, which of the design choices in these offline RL algorithms are necessary to achieve good performance? For example, to estimate the target Qvalue when minimizing the Bellman error, Fujimoto et al. (2018a) uses a soft combination of two target Qvalues, which is different from TD3 (Fujimoto et al., 2018b), where the minimum of two target Qvalues is used. This soft combination is maintained by Kumar et al. (2019), while further increasing the number of Qnetworks from two to four. As another example, when regularizing towards the behavior policy, Jaques et al. (2019) uses KullbackLeibler (KL) divergence with a fixed regularization weight while Kumar et al. (2019) proposes to use Maximum Mean Discrepancy (MMD) with an adaptively trained regularization weight. Are these design choices crucial to success in offline settings? Or are they simply the result of multiple, humandirected iterations of research?
In this work, we aim at evaluating the importance of different algorithmic building components as well as comparing different design choices in offline RL approaches. We focus on behavior regularized approaches applied to continuous action domains, encompassing many of the recently demonstrated successes (Fujimoto et al., 2018a; Kumar et al., 2019). We introduce behavior regularized actor critic (BRAC), a general algorithmic framework which covers existing approaches while enabling us to compare the performance of different variants in a modular way. We find that many simple variants of the behavior regularized approach can yield good performance, while previously suggested sophisticated techniques such as weighted Qensembles and adaptive regularization weights are not crucial. Experimental ablations reveal further insights into how different design choices affect the performance and robustness of the behavior regularized approach in the offline RL setting.
2 Background
2.1 Markov Decision Processes
We consider the standard fullyobserved Markov Decision Process (MDP) setting (Puterman, 1990). An MDP can be represented as $\mathcal{M}=(\mathcal{S},\mathcal{A},P,R,\gamma )$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $P(\cdot s,a)$ is the transition probability distribution function, $R(s,a)$ is the reward function and $\gamma $ is the discount factor. The goal is to find a policy $\pi (\cdot s)$ that maximizes the cumulative discounted reward starting from any state $s\in \mathcal{S}$. Let ${P}^{\pi}(\cdot s)$ denote the induced transition distribution for policy $\pi $. For later convenience, we also introduce the notion of multistep transition distributions as ${P}_{t}^{\pi}$, where ${P}_{t}^{\pi}(\cdot s)$ denotes the distribution over the state space after rolling out ${P}^{\pi}$ for $t$ steps starting from state $s$. For example, ${P}_{0}^{\pi}(\cdot s)$ is the Dirac delta function at $s$ and ${P}_{1}^{\pi}(\cdot s)={P}^{\pi}(\cdot s)$. We use ${R}^{\pi}(s)$ to denote the expected reward at state $s$ when following policy $\pi $, i.e. ${R}^{\pi}(s)={\mathbb{E}}_{a\sim \pi (\cdot s)}\left[R(s,a)\right]$. The state value function (a.k.a. value function) is defined by ${V}^{\pi}(s)={\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{\mathbb{E}}_{{s}_{t}\sim {P}_{t}^{\pi}(s)}\left[{R}^{\pi}({s}_{t})\right]$. The actionvalue function (a.k.a. Qfunction) can be written as ${Q}^{\pi}(s,a)=R(s,a)+\gamma {\mathbb{E}}_{{s}^{\prime}\sim P(\cdot s,a)}\left[{V}^{\pi}({s}^{\prime})\right]$. The optimal policy is defined as the policy ${\pi}^{*}$ that maximizes ${V}^{{\pi}^{*}}(s)$ at all states $s\in \mathcal{S}$. In the commonly used actor critic paradigm, one optimizes a policy ${\pi}_{\theta}(\cdot s)$ by alternatively learning a Qvalue function ${Q}_{\psi}$ to minimize Bellman errors over single step transitions $(s,a,r,{s}^{\prime})$, ${\mathbb{E}}_{{a}^{\prime}\sim {\pi}_{\theta}(\cdot {s}^{\prime})}\left[{\left(r+\gamma \overline{Q}({s}^{\prime},{a}^{\prime}){Q}_{\psi}(s,a)\right)}^{2}\right]$, where $\overline{Q}$ denotes a target Q function; e.g., it is common to use a slowlyupdated target parameter set ${\psi}^{\prime}$ to determine the target Q function as ${Q}_{{\psi}^{\prime}}({s}^{\prime},{a}^{\prime})$. Then, the policy is updated to maximize the Qvalues, ${\mathbb{E}}_{a\sim \pi (\cdot s)}\left[{Q}_{\psi}(s,a)\right]$.
2.2 Offline Reinforcement Learning
Offline RL (also known as batch RL (Lange et al., 2012)) considers the problem of learning a policy $\pi $ from a fixed dataset $\mathcal{D}$ consisting of singlestep transitions $(s,a,r,{s}^{\prime})$. Slightly abusing the notion of “behavior”, we define the behavior policy ${\pi}_{b}(as)$ as the conditional distribution $p(as)$ observed in the dataset distribution $\mathcal{D}$. Under this definition, such a behavior policy ${\pi}_{b}$ is always welldefined even if the dataset was collected by multiple, distinct behavior policies. Because we do not assume direct access to ${\pi}_{b}$, it is common in previous work to approximate this behavior policy with maxlikelihood over $\mathcal{D}$:
$${\widehat{\pi}}_{b}:=\underset{\widehat{\pi}}{argmax}{\mathbb{E}}_{(s,a,r,{s}^{\prime})\sim \mathcal{D}}\left[\mathrm{log}\widehat{\pi}(as)\right].$$  (1) 
We denote the learned policy as ${\widehat{\pi}}_{b}$ and refer to it as the “cloned policy” to distinguish it from the true behavior policy.
In this work, we focus on the offline RL problem for complex continuous domains. We briefly review two recently proposed approaches, BEAR (Kumar et al., 2019) and BCQ (Fujimoto et al., 2018a).
BEAR
Motivated by the hypothesis that deep RL algorithms generalize poorly to actions outside the support of the behavior policy, Kumar et al. (2019) propose BEAR, which learns a policy to maximize Qvalues while penalizing it from diverging from behavior policy support. BEAR measures divergence from the behavior policy using kernel MMD (Gretton et al., 2007):
${\mathrm{MMD}}_{k}^{2}(\pi (\cdot s),{\pi}_{b}(\cdot s))=\underset{x,{x}^{\prime}\sim \pi (\cdot s)}{\mathbb{E}}\left[K(x,{x}^{\prime})\right]2{\mathbb{E}}_{\begin{array}{c}x\sim \pi (\cdot s)\\ y\sim {\pi}_{b}(\cdot s)\end{array}}\left[K(x,y)\right]+\underset{y,{y}^{\prime}\sim {\pi}_{b}(\cdot s)}{\mathbb{E}}\left[K(y,{y}^{\prime})\right],$  (2) 
where $K$ is a kernel function. Furthermore, to avoid overestimation in the Qvalues, the target Qvalue function $\overline{Q}$ is calculated as,
$$\overline{Q}({s}^{\prime},{a}^{\prime}):=0.75\cdot \underset{j=1,\mathrm{\dots},k}{\mathrm{min}}{Q}_{{\psi}_{j}^{\prime}}({s}^{\prime},{a}^{\prime})+0.25\cdot \underset{j=1,\mathrm{\dots},k}{\mathrm{max}}{Q}_{{\psi}_{j}^{\prime}}({s}^{\prime},{a}^{\prime}),$$  (3) 
where ${\psi}_{j}^{\prime}$ is denotes a softupdated ensemble of target Q functions. In BEAR’s implementation, this ensemble is of size $k=4$. BEAR also penalizes target Qvalues by an ensemble variance term. However, their empirical results show that there is no clear benefit to doing so, thus we omit this term.
BCQ
BCQ enforces $\pi $ to be close to ${\pi}_{b}$ with a specific parameterization of $\pi $:
$${\pi}_{\theta}(as):=\underset{{a}_{i}+{\xi}_{\theta}(s,{a}_{i})}{argmax}{Q}_{\psi}(s,{a}_{i}+{\xi}_{\theta}(s,{a}_{i}))\text{for}{a}_{i}\sim {\pi}_{b}(as),i=1,\mathrm{\dots},N,$$  (4) 
where ${\xi}_{\theta}$ is a function approximator with bounded ouptput in $[\mathrm{\Phi},\mathrm{\Phi}]$ where $\mathrm{\Phi}$ is a hyperparameter. $N$ is an additional hyperparameter used during evaluation to compute ${\pi}_{\theta}$ and during training for Qvalue updates. The target Qvalue function $\overline{Q}$ is calculated as in Equation 3 but with $k=2$.
3 Behavior Regularized Actor Critic
Encouraging the learned policy to be close to the behavior policy is a common theme in previous approaches to offline RL. To evaluate the effect of different behavior policy regularizers, we introduce behavior regularized actor critic (BRAC), an algorithmic framework which generalizes existing approaches while providing more implementation options.
There are two common ways to incorporate regularization to a specific policy: through a penalty in the value function or as a penalty solely on the policy. We begin by introducing the former, value penalty (vp). Similar to SAC (Haarnoja et al., 2018) which adds an entropy term to the target Qvalue calculation, we add a term to the target Qvalue calculation that regularizes the learned policy $\pi $ towards the behavior policy ${\pi}_{b}$. Specifically, we define the penalized value function as
${V}_{D}^{\pi}(s)={\displaystyle {\sum}_{t=0}^{\mathrm{\infty}}}{\gamma}^{t}{\mathbb{E}}_{{s}_{t}\sim {P}_{t}^{\pi}(s)}[{R}^{\pi}({s}_{t})\alpha D(\pi (\cdot {s}_{t}),{\pi}_{b}(\cdot {s}_{t}))],$  (5) 
where $D$ is a divergence function between distributions over actions (e.g., MMD or KL divergence). Following the typical actor critic framework, the Qvalue objective is given by,
$\underset{{Q}_{\psi}}{\mathrm{min}}{\mathbb{E}}_{\begin{array}{c}(s,a,r,{s}^{\prime})\sim \mathcal{D}\\ {a}^{\prime}\sim {\pi}_{\theta}(\cdot {s}^{\prime})\end{array}}\left[{(r+\gamma (\overline{Q}({s}^{\prime},{a}^{\prime})\alpha \widehat{D}({\pi}_{\theta}(\cdot {s}^{\prime}),{\pi}_{b}(\cdot {s}^{\prime}))){Q}_{\psi}(s,a))}^{2}\right],$  (6) 
where $\overline{Q}$ again denotes a target Q function and $\widehat{D}$ denotes a samplebased estimate of the divergence function $D$. The policy learning objective can be written as,
$\underset{{\pi}_{\theta}}{\mathrm{max}}{\mathbb{E}}_{(s,a,r,{s}^{\prime})\sim \mathcal{D}}[{\mathbb{E}}_{{a}^{\prime \prime}\sim {\pi}_{\theta}(\cdot s)}\left[{Q}_{\psi}(s,{a}^{\prime \prime})\right]\alpha \widehat{D}({\pi}_{\theta}(\cdot s),{\pi}_{b}(\cdot s))].$  (7) 
Accordingly, one performs alternating gradient updates based on (6) and (7). This algorithm is equivalent to SAC when using a singlesample estimate of the entropy for $\widehat{D}$; i.e., $\widehat{D}({\pi}_{\theta}(\cdot {s}^{\prime}),{\pi}_{b}(\cdot {s}^{\prime})):=\mathrm{log}\pi ({a}^{\prime}{s}^{\prime})$ for ${a}^{\prime}\sim \pi (\cdot {s}^{\prime})$.
The second way to add the regularizer is to only regularize the policy during policy optimization. That is, we use the same objectives in Equations 6 and 7, but use $\alpha =0$ in the Q update while using a nonzero $\alpha $ in the policy update. We call this variant policy regularization (pr). This proposal is similar to the regularization employed in A3C (Mnih et al., 2016), if one uses the entropy of ${\pi}_{\theta}$ to compute $\widehat{D}$.
In addition to the choice of value penalty or policy regularization, the choice of $D$ and how to perform sample estimation of $\widehat{D}$ is a key design choice of BRAC:
Kernel MMD
We can compute a sample based estimate of kernel MMD (Equation 2) by drawing samples from both ${\pi}_{\theta}$ and ${\pi}_{b}$. Because we do not have access to multiple samples from ${\pi}_{b}$, this requires a preestimated cloned policy ${\widehat{\pi}}_{b}$.
KL Divergence
With KL Divergence, the behavior regularizer can be written as
${D}_{\mathrm{KL}}({\pi}_{\theta}(\cdot s),{\pi}_{b}(\cdot s))={\mathbb{E}}_{a\sim {\pi}_{\theta}(\cdot s)}[\mathrm{log}{\pi}_{\theta}(as)\mathrm{log}{\pi}_{b}(as)].$ 
Directly estimating ${D}_{\mathrm{KL}}$ via samples requires having access to the density of both ${\pi}_{\theta}$ and ${\pi}_{b}$; as in MMD, the cloned ${\widehat{\pi}}_{b}$ can be used in place of ${\pi}_{b}$. Alternatively, we can avoid estimating ${\pi}_{b}$ explicitly, by using the dual form of the KLdivergence. Specifically, any $f$divergence (Csiszár, 1964) has a dual form (Nowozin et al., 2016) given by,
${D}_{f}(p,q)={\mathbb{E}}_{x\sim p}\left[f\left(q(x)/p(x)\right)\right]=\underset{g:\mathcal{X}\mapsto \mathrm{dom}({f}^{*})}{\mathrm{max}}{\mathbb{E}}_{x\sim q}\left[g(x)\right]{\mathbb{E}}_{x\sim p}\left[{f}^{*}(g(x))\right],$ 
where ${f}^{*}$ is the Fenchel dual of $f$. In this case, one no longer needs to estimate a cloned policy ${\widehat{\pi}}_{b}$ but instead needs to learn a discriminator function $g$ with minimax optimization as in Nowozin et al. (2016). This sample based dual estimation can be applied to any $f$divergence. In the case of a KLdivergence, $f(x)=\mathrm{log}x$ and ${f}^{*}(t)=\mathrm{log}(t)1$.
Wasserstein Distance
One may also use the Wassertein distance as the divergence $D$. For samplebased estimation, one may use its dual form,
$W(p,q)=\underset{g:{g}_{L}\le 1}{sup}{\mathbb{E}}_{x\sim p}\left[g(x)\right]{\mathbb{E}}_{x\sim q}\left[g(x)\right]$ 
and maintain a discriminator $g$ as in Gulrajani et al. (2017).
Now we discuss how existing approaches can be instantiated under the framework of BRAC.
BEAR
To recreate BEAR with BRAC, one uses policy regularization with the samplebased kernel MMD for $\widehat{D}$ and uses a minmax ensemble estimate for $\overline{Q}$ (Equation 3). Furthermore, BEAR adaptively trains the regularization weight $\alpha $ as a Lagriagian multiplier: it sets a threshold $\u03f5>0$ for the kernel MMD distance and increases $\alpha $ if the current average divergence is above the threshold and decreases $\alpha $ if below the threshold.
BCQ
KLControl
There has been a rich set of work which investigates regularizing the learned policy through KLdivergence with respect to another policy, e.g. Abdolmaleki et al. (2018); Kakade (2002); Peters et al. (2010); Schulman et al. (2015); Nachum et al. (2017). Notably, Jaques et al. (2019) apply this idea to offline RL in discrete action domains by introducing a KL value penalty in the Qvalue definition. It is clear that BRAC can realize this algorithm as well.
To summarize, one can instantiate the behavior regularized actor critic framework with different design choices, including how to estimate the target Q value, which divergence to use, whether to learn $\alpha $ adaptively, whether to use a value penalty in the Q function objective (6) or just use policy regularization in (7) and so on. In the next section, we empirically evaluate a set of these different design choices to provide insights into what actually matters when approaching the offline RL problem.
4 Experiments
The BRAC framework encompasses several previously proposed methods depending on specific design choices (e.g., whether to use value penalty or policy regularization, how to compute the target Qvalue, and how to impose the behavior regularization). For a practitioner, key questions are: How should these design choices be made? Which variations among these different algorithms actually matter? To answer these questions, we perform a systematic evaluation of BRAC under different design choices.
Following Kumar et al. (2019), we evaluate performance on four Mujoco (Todorov et al., 2012) continuous control environments in OpenAI Gym (Brockman et al., 2016): Antv2, HalfCheetahv2, Hopperv2, and Walker2dv2. In many realworld applications of RL, one has logged data from suboptimal policies (e.g., robotic control and recommendation systems). To simulate this scenario, we collect the offline dataset with a suboptimal policy perturbed by additional noise. To obtain a partially trained policy, we train a policy with SAC and online interactions until the policy performance achieves a performance threshold ($1000,4000,1000,1000$ for Antv2, HalfCheetahv2, Hopperv2, Walker2dv2, respectively, similar to the protocol established by Kumar et al. (2019)). Then, we perturb the partially trained policy with noise (Gaussian noise or $\u03f5$greedy at different levels) to simulate different exploration strategies resulting in five noisy behavior policies. We collect 1 million transitions according to each behavior policy resulting in five datasets for each environment (see Appendix for implementation details). We evaluate offline RL algorithms by training on these fixed datasets and evaluating the learned policies on the real environments.
In preliminary experiments, we found that policy learning rate and regularization strength have a significant effect on performance. As a result, for each variant of BRAC and each environment, we do a grid search over policy learning rate and regularization strength. For policy learning rate, we search over six values, ranging from $3\cdot {10}^{6}$ to $0.001$. The regularization strength is controlled differently in different algorithms. In the simplest case, the regularization weight $\alpha $ is fixed; in BEAR the regularization weight is adaptively trained with dual gradient ascent based on a divergence constraint $\u03f5$ that is tuned as a hyperparameter; in BCQ the corresponding tuning is for the perturbation range $\mathrm{\Phi}$. For each of these options, we search over five values (see Appendix for details). For existing algorithms such as BEAR and BCQ, the reported hyperparameters in their papers (Kumar et al., 2019; Fujimoto et al., 2018a) are included in this search range, We select the best hyperparameters according to the average performance over all five datasets.
Currently, BEAR (Kumar et al., 2019) provides stateoftheart performance on these tasks, so to understand the effect of variations under our BRAC framework, we start by implementing BEAR in BRAC and run a series of comparisons by varying different design choices: adaptive vs. fixed regularization, different ensembles for estimating target Qvalues, value penalty vs. policy regularization and divergence choice for the regularizer. We then evaluate BCQ, which has a different design in the BRAC framework, and compare it to other BRAC variants as well as several baseline algorithms.
4.1 Fixed v.s. adaptive regularization weights
In BEAR, regularization is controlled by a threshold $\u03f5$, which is used for adaptively training the Lagrangian multiplier $\alpha $, whereas typically (e.g., in KLcontrol) one uses a fixed $\alpha $. In our initial experiments with BEAR, we found that when using the recommended value of $\u03f5$, the learned value of $\alpha $ consistently increased during training, implying that the MMD constraint between ${\pi}_{\theta}$ and ${\pi}_{b}$ was almost never satisfied. This suggests that BEAR is effectively performing policy regularization with a large $\alpha $ rather than constrained optimization. This led us to question if adaptively training $\alpha $ is better than using a fixed $\alpha $. To investigate this question, we evaluate the performance of both approaches (with appropriate hyperparameter tuning for each, over either $\alpha $ or $\u03f5$) in Figure 1. On most datasets, both approaches learn a policy that is much better than the partially trained policy^{2}^{2} 2 The partially trained policy is the policy used to collect data without injected noise. The true behavior policy and behavior cloning will usually get worse performance due to injected noise when collecting the data., although we do observe a consistent modest advantage when using a fixed $\alpha $. Because using a fixed $\alpha $ is simpler and performs better than adaptive training, we use this approach in subsequent experiments.
4.2 Ensemble for target Qvalues
Another important design choice in BRAC is how to compute the target Qvalue, and specifically, whether one should use the sophisticated ensemble strategies employed by BEAR and BCQ. Both BEAR and BCQ use a weighted mixture of the minimum and maximum among multiple learned Qfunctions (compared to TD3 which simply uses the minimum of two). BEAR further increases the number of Qfunctions from 2 to 4. To investigate these design choices, we first experiment with different number of Qfunctions $k=\{1,2,4\}$. Results are shown in Figure 2. Fujimoto et al. (2018b) show that using two Qfunctions provides significant improvements in online RL; similarly, we find that using $k=1$ sometimes fails to learn a good policy (e.g., in Walker2d) in the offline setting. Using $k=4$ has a small advantage compared to $k=2$ except in Hopper. Both $k=2$ and $k=4$ significantly improve over the partially trained policy baseline. In general, increasing the value of $k$ in ensemble will lead to more stable or better performance, but requires more computation cost. On these domains we found that $k=4$ only gives marginal improvement over $k=2$, so we use $k=2$ in our remaining experiments.
Regarding whether using a weighed mixture of Qvalues or the minimum, we compare these two options under $k=2$. Results are shown in Figure 3. We find that taking the minimum performs slightly better than taking a mixture except in Hopper, and both successfully outperform the partially trained policy in all cases. Due to the simplicity and strong performance of taking the minimum of two Qfunctions, we use this approach in subsequent experiments.
4.3 Value penalty or policy regularization
So far, we have evaluated variations in regularization weights and ensemble of Qvalues. We found that the technical complexity introduced in recent works is not always necessary to achieve stateoftheart performance. With these simplifications, we now evaluate a major variation of design choices in BRAC — using value penalty or policy regularization. We follow our simplified version of BEAR: MMD policy regularization, fixed $\alpha $, and computation of target Qvalues based on the minimum of a $k=2$ ensemble. We compare this instantiation of BRAC to its value penalty version, with results shown in Figure 4. While both variants outperform the partially trained policy, we find that value penalty performs slightly better than policy regularization in most cases. We consistently observed this advantage with other divergence choices (see Appendix Figure 8 for a full comparison).
4.4 Divergences for regularization
We evaluated four choices of divergences used as the regularizer $D$: (a) MMD (as in BEAR), (b) KL in the primal form with estimated behavior policy (as in KLcontrol), and (c) KL and (d) Wasserstein in their dual forms without estimating a behavior policy. As shown in Figure 5, we do not find any specific divergence performing consistently better or worse than the others. All variants are able to learn a policy that significantly improves over the behavior policy in all cases.
In contrast, Kumar et al. (2019) argue that sampled MMD is superior to KL based on the idea that it is better to regularize the support of the learned policy distribution to be within the support of the behavior policy rather than forcing the two distributions to be similar. While conceptually reasonable, we do not find support for that argument in our experiments: (i) we find that KL and Wassertein can perform similarly well to MMD even though they are not designed for support matching; (ii) we briefly tried divergences that are explicitly designed for support matching (the relaxed KL and relaxed Wasserstein distances proposed by Wu et al. (2019)), but did not observe a clear benefit to the additional complexity. We conjecture that this is because even if one uses noisy or multiple behavior policies to collect data, the noise is reflected more in the diversity of states rather than the diversity of actions on a single state (due to the nature of environment dynamics). However, we expect this support matching vs. distribution matching distinction may matter in other scenarios such as smaller state spaces or contextual bandits, which is a potential direction for future work.
4.5 Comparison to BCQ and other baselines
We now compare one of our best performing algorithms so far, kl_vp (value penalty with KL divergence in the primal form), to BCQ, BEAR, and two other baselines: vanilla SAC (which uses adaptive entropy regularization) and behavior cloning. Figure 6 shows the comparison. We find that vanilla SAC only works in the HalfCheetah environment and fails in the other three environments. Behavior cloning never learns a better policy than the partially trained policy used to collect the data. Although BCQ consistently learns a policy that is better than the partially trained policy, its performance is always clearly worse than kl_vp (and other variants whose performance is similar to kl_vp, according to our previous experiments). We conclude that BCQ is less favorable than explicitly using a divergence for behavior regularization (BEAR and kl_vp). Although, tuning additional hyperparameters beyond $\mathrm{\Phi}$ for BCQ may improve performance.
4.6 Hyperparameter Sensitivity
In our experiments, we find that many simple algorithmic designs achieve good performance under the framework of BRAC. For example, all of the 4 divergences we tried perform similarly well when used for regularization. In these experiments, we allowed for appropriate hyperparameter tuning over policy learning rate and regularization weight, as we initially found that not doing so can lead to premature and incorrect conclusions. ^{3}^{3} 3 For example, taking the optimal hyperparameters from one design choice and then applying them to a different design choice (e.g., MMD vs KL divergence) can lead to incorrect conclusions (specifically, that using KL is worse than using MMD, only because one transferred the hyperparameters used for MMD to KL). However, some design choices may be more robust to hyperparameters than others. To investigate this, we also analyzed the sensitivity to hyperparameters for all algorithmic variants (Appendix Figures 9 and 11). To summarize, we found that (i) MMD and KL Divergence are similar in terms of sensitivity to hyperparameters, (ii) using the dual form of divergences (e.g. KL dual, Wasserstein) appears to be more sensitive to hyperparameters, possibly because of the more complex training procedure (optimizing a minimax objective), and (iii) value penalty is slightly more sensitive to hyperparameters than policy regularization despite its more favorable performance under the best hyperparameters.
Although we utilized hyperparameter searches in our results, in pure offline RL settings, testing on the real environment is infeasible. Thus, a natural question is how to select the best hyperparameter or the best learned policy among many without direct testing. As a preliminary attempt, we evaluated whether the Qvalues learned during training can be used as a proxy for hyperparameter selection. Specifically, we look at the correlation between the average learned Qvalues (in minibatches) and the true performance. Figure 7 shows sampled visualizations of these Qvalues. We find that the learned Qvalues are not a good indicator of the performance, even when they are within a reasonable range (i.e., not diverging during training). A more formal direction for doing hyperparameter selection is to do offpolicy evaluation. However, offpolicy evaluation is an open research problem with limited success on complex continuous control tasks (see Liu et al. (2018); Nachum et al. (2019); Irpan et al. (2019) for recent attempts), we leave hyperparameter selection as future work and encourage more researchers to investigate this direction.
5 Conclusion
In this work, we introduced behavior regularized actor critic (BRAC), an algorithmic framework, which generalizes existing approaches to solve the offline RL problem by regularizing to the behavior policy. In our experiments, we showed that many sophisticated training techniques, such as weighted target Qvalue ensembles and adaptive regularization coefficients are not necessary in order to achieve stateoftheart performance. We found that the use of value penalty is slightly better than policy regularization, while many possible divergences (KL, MMD, Wasserstein) can achieve similar performance. Perhaps the most important differentiator in these offline settings is whether proper hyperparameters are used. Although some variants of BRAC are more robust to hyperparameters than others, every variant relies on a suitable set of hyperparameters to train well. Offpolicy evaluation without interacting with the environment is a challenging open problem. While previous offpolicy evaluation work focuses on reducing meansquarederror to the expected return, in our problem, we only require a ranking of policies. This relaxation may allow novel solutions, and we encourage more researchers to investigate this direction in the pursuit of truly offline RL. Another potential direction is to look at the situations when the dataset is much smaller. Our preliminary observations on smaller datasets is that it is hard to get a hyperparameter that works consistently well on multiple runs with different random seeds. So we conjecture that smaller datasets may need either more careful hyperparameter search or (more interestingly) a better algorithm. We leave an extensive study of this setting to future work.
Acknowledgements
We thank Aviral Kumar, Ilya Kostrikov, Yinlam Chow, and others at Google Research for helpful thoughts and discussions.
References
 Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920. Cited by: §3.
 Striving for simplicity in offpolicy deep reinforcement learning. arXiv preprint arXiv:1907.04543. Cited by: §1.
 Residual algorithms: reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30–37. Cited by: §1.
 Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §4.
 Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: §1.
 Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl. 8, pp. 85–108. Cited by: §3.
 Diagnosing bottlenecks in deep qlearning algorithms. arXiv preprint arXiv:1902.10250. Cited by: §1.
 Offpolicy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900. Cited by: 1st item, §A.1, §A.1, §1, §1, §2.2, §4.
 Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §1, §4.2.
 A kernel approach to comparing distributions. In Proceedings of the National Conference on Artificial Intelligence, Vol. 22, pp. 1637. Cited by: §2.2.
 Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §A.1, §3.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1, §3.
 Offpolicy evaluation via offpolicy classification. arXiv preprint arXiv:1906.01624. Cited by: §4.6.
 Way offpolicy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456. Cited by: §1, §3.
 A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538. Cited by: §3.
 Stabilizing offpolicy qlearning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949. Cited by: 2nd item, §A.1, §A.1, §1, §1, §2.2, §2.2, §4.4, §4, §4, §4.
 Batch reinforcement learning. In Reinforcement learning, pp. 45–73. Cited by: §2.2.
 Safe policy improvement with baseline bootstrapping. arXiv preprint arXiv:1712.06924. Cited by: §1.
 Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 297–306. Cited by: §1.
 Toward minimax offpolicy value estimation. Cited by: §1.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
 Breaking the curse of horizon: infinitehorizon offpolicy estimation. In Advances in Neural Information Processing Systems, pp. 5356–5366. Cited by: §4.6.
 Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §3.
 Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
 Marginal mean models for dynamic regimes. Journal of the American Statistical Association 96 (456), pp. 1410–1423. Cited by: §1.
 DualDICE: behavioragnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733. Cited by: §4.6.
 Trustpcl: an offpolicy trust region method for continuous control. arXiv preprint arXiv:1707.01891. Cited by: §3.
 Fgan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §3.
 Relative entropy policy search. In TwentyFourth AAAI Conference on Artificial Intelligence, Cited by: §3.
 Markov decision processes. Handbooks in operations research and management science 2, pp. 331–434. Cited by: §2.1.
 Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §3.
 On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, pp. 85. Cited by: §1.
 Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §4.
 Analysis of temporaldiffference learning with function approximation. In Advances in neural information processing systems, pp. 1075–1081. Cited by: §1.
 Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648. Cited by: §1.
 Domain adaptation with asymmetricallyrelaxed distribution alignment. In International Conference on Machine Learning, pp. 6872–6881. Cited by: §4.4.
Appendix A Additional Experiment Results
A.1 Additional experiment details
Dataset collection
For each environment, we collect five datasets: {nonoise, eps0.1, eps0.3, gauss0.1, gauss0.3} using a partially trained policy $\pi $. Each dataset contains 1 million transitions. Different datasets are collected with different injected noise, corresponding to different levels and strategies of exploration. The specific noise configurations are shown below:

•
nonoise : The dataset is collected by purely executing the partially trained policy $\pi $ without adding noise.

•
eps0.1: We make an epsilon greedy policy ${\pi}^{\prime}$ with 0.1 probability. That is, at each step, ${\pi}^{\prime}$ has 0.1 probability to take a uniformly random action, otherwise takes the action sampled from $\pi $. The final dataset is a mixture of three parts: 40% transitions are collected by ${\pi}^{\prime}$, 40% transitions are collected by purely executing $\pi $, the remaining 20% are collected by a random walk policy which takes a uniformly random action at every step. This mixture is motivated by that one may only want to perform exploration in only a portion of episodes when deploying a policy.

•
eps0.3: ${\pi}^{\prime}$ is an epsilon greedy policy with 0.3 probability to take a random action. We do the same mixture as in eps0.1.

•
gauss0.1: ${\pi}^{\prime}$ is taken as adding an independent $\mathcal{N}(0,{0.1}^{2})$ Gaussian noise to each action sampled from $\pi $. We do the same mixture as in eps0.1.

•
gauss0.3: ${\pi}^{\prime}$ is taken as adding an independent $\mathcal{N}(0,{0.3}^{2})$ Gaussian noise to each action sampled from $\pi $. We do the same mixture as in eps0.1.
Hyperparameter search
As we mentioned in main text, for each variant of BRAC and each environment, we do a grid search over policy learning rate and regularization strength. For policy learning rate, we search over six values: $\{3\cdot {10}^{6},1\cdot {10}^{5},3\cdot {10}^{5},0.0001,0.0003,0.001\}$. The regularization strength is controlled differently in different algorithms:

•
In BCQ, we search for the perturbation range $\mathrm{\Phi}\in \{0.005,0.015,0.05,0.15,0.5\}$. $0.05$ is the reported value by its paper (Fujimoto et al., 2018a).

•
In BEAR the regularization weight $\alpha $ is adaptively trained with dual gradient ascent based on a divergence constraint $\u03f5$ that is tuned as a hyperparameter. We search for $\u03f5\in \{0.015,0.05,0.15,0.5,1.5\}$. $0.05$ is the reported value by its paper (Kumar et al., 2019).

•
When MMD is used with a fixed $\alpha $, we search for $\alpha \in \{3,10,30,100,300\}$.

•
When KL divergence is used with a fixed $\alpha $ (both KL and KL_dual), we search for $\alpha \in \{0.1,0.3,1.0,3.0,10.0\}$.

•
When Wasserstein distance is used with a fixed $\alpha $, we search for $\alpha \in \{0.3,1.0,3.0,10.0,30.0\}$.
in BEAR the regularization weight is adaptively trained with dual gradient ascent based on a divergence constraint $\u03f5$ that is tuned as a hyperparameter;
In the simplest case, the regularization weight $\alpha $ is fixed; in BCQ the corresponding tuning is for the perturbation range $\mathrm{\Phi}$. For each of these options, we search over five values (see Appendix for details). For existing algorithms such as BEAR and BCQ, the reported hyperparameters in their papers (Kumar et al., 2019; Fujimoto et al., 2018a) are included in this search range, We select the best hyperparameters according to the average performance over all five datasets.
Implementation details
All experiments are implemented with Tensorflow and executed on CPUs. For all function approximators, we use fully connected neural networks with RELU activations. For policy networks, we use $\mathrm{tanh}(\mathrm{Gaussian})$ on outputs following BEAR (Kumar et al., 2019), except for BCQ where we follow their open sourced implementation. For BEAR and BCQ we follow the network sizes as in their papers. For other variants of BRAC, we shrink the policy networks from $(400,300)$ to $(200,200)$ and Qnetworks from $(400,300)$ to $(300,300)$ for saving computation time without losing performance. Qfunction learning rate is always 0.001. As in other deep RL algorithms, we maintain source and target Qfunctions with an update rate 0.005 per iteration. For MMD we use Laplacian kernels with bandwidth reported by Fujimoto et al. (2018a). For divergences in the dual form (both KL_dual and Wasserstein), we training a $(300,300)$ fully connected network as the critic in the minimax objective. Gradient penalty (one sided version of the penalty in Gulrajani et al. (2017) with coefficient 5.0) is applied to both KL and Wasserstein dual training. In each training iteration, the dual critic is updated for 3 steps (which we find better than only 1 step) with learning rate 0.0001. We use Adam for all optimizers. Each agent is trained for 0.5 million steps with batch size 256 (except for BCQ we use 100 according their open sourced implementation). At test time we follow Kumar et al. (2019) and Fujimoto et al. (2018a) by sampling 10 actions from ${\pi}_{\theta}$ at each step and take the one with highest learned Qvalue.
A.2 Value penalty v.s. policy regularization
A.3 Full performance results under different hyperparameters
A.4 Additional training curves
A.5 Full performance results under the best hyperparameters
Environment: Antv2  Partially trained policy: 1241  
dataset  nonoise  eps0.1  eps0.3  gauss0.1  gauss0.3 
SAC  0  1109  911  1071  1498 
BC  1235  1300  1278  1203  1240 
BCQ  1921  1864  1504  1731  1887 
BEAR  2100  1897  2008  2054  2018 
MMD_vp  2839  2672  2602  2667  2640 
KL_vp  2514  2530  2484  2615  2661 
KL_dual_vp  2626  2334  2256  2404  2433 
W_vp  2646  2417  2409  2474  2487 
MMD_pr  2583  2280  2285  2477  2435 
KL_pr  2241  2247  2181  2263  2233 
KL_dual_pr  2218  1984  2144  2215  2201 
W_pr  2241  2186  2284  2365  2344 
Environment: HalfCheetahv2  Partially trained policy: 4206  
dataset  nonoise  eps0.1  eps0.3  gauss0.1  gauss0.3 
SAC  5093  6174  5978  6082  6090 
BC  4465  3206  3751  4084  4033 
BCQ  5064  5693  5588  5614  5837 
BEAR  5325  5435  5149  5394  5329 
MMD_vp  6207  6307  6263  6323  6400 
KL_vp  6104  6212  6104  6219  6206 
KL_dual_vp  6209  6087  6359  5972  6340 
W_vp  5957  6014  6001  5939  6025 
MMD_pr  5936  6242  6166  6200  6294 
KL_pr  6032  6116  6035  5969  6219 
KL_dual_pr  5944  6183  6207  5789  6050 
W_pr  5897  5923  5970  5894  6031 
Environment: Hopperv2  Partially trained policy: 1202  
dataset  nonoise  eps0.1  eps0.3  gauss0.1  gauss0.3 
SAC  0.2655  661.7  701  311.2  592.6 
BC  1330  129.4  828.3  221.1  284.6 
BCQ  1543  1652  1632  1599  1590 
BEAR  0  1620  2213  1825  1720 
MMD_vp  2291  2282  1892  2255  1458 
KL_vp  2774  2360  2892  1851  2066 
KL_dual_vp  1735  2121  2043  1770  1872 
W_vp  2292  2187  2178  1390  1739 
MMD_pr  2334  1688  1725  1666  2097 
KL_pr  2574  1925  2064  1688  1947 
KL_dual_pr  2053  1985  1719  1641  1551 
W_pr  2080  2089  2015  1635  2097 
Environment: Walkerv2  Partially trained policy: 1439  
dataset  nonoise  eps0.1  eps0.3  gauss0.1  gauss0.3 
SAC  131.7  213.5  127.1  119.3  109.3 
BC  1334  1092  1263  1199  1137 
BCQ  2095  1921  1953  2094  1734 
BEAR  2646  2695  2608  2539  2194 
MMD_vp  2694  3241  3255  2893  3368 
KL_vp  2907  3175  2942  3193  3261 
KL_dual_vp  2575  3490  3236  3103  3333 
W_vp  2635  2863  2758  2856  2862 
MMD_pr  2670  2957  2897  2759  3004 
KL_pr  2744  2990  2747  2837  2981 
KL_dual_pr  2682  3109  3080  2357  3155 
W_pr  2667  3140  2928  1804  2907 