### Abstract

Online learning has traditionally focused on the expected rewards. In thispaper, a risk-averse online learning problem under the performance measure ofthe mean-variance of the rewards is studied. Both the bandit and fullinformation settings are considered. The performance of several existingpolicies is analyzed, and new fundamental limitations on risk-averse learningis established. In particular, it is shown that although a logarithmicdistribution-dependent regret in time $T$ is achievable (similar to therisk-neutral problem), the worst-case (i.e. minimax) regret is lower bounded by$\Omega(T)$ (in contrast to the $\Omega(\sqrt{T})$ lower bound in therisk-neutral problem). This sharp difference from the risk-neutral counterpartis caused by the the variance in the player's decisions, which, while absent inthe regret under the expected reward criterion, contributes to excessmean-variance due to the non-linearity of this risk measure. The role of thedecision variance in regret performance reflects a risk-averse player's desirefor robust decisions and outcomes.

### Introduction (beta)

None

### Conclusion (beta)

None