### Abstract

Multi-objective reinforcement learning (MORL) is a relatively new field whichbuilds on conventional Reinforcement Learning (RL) to solve multi-objectiveproblems. One of common algorithm is to extend scalar value Q-learning by usingvector Q values in combination with a utility function, which captures theuser's preference for action selection. This study follows on prior works, andfocuses on what factors influence the frequency with which value-based MORLQ-learning algorithms learn the optimal policy for an environment withstochastic state transitions in scenarios where the goal is to maximise theScalarised Expected Return (SER) - that is, to maximise the average outcomeover multiple runs rather than the outcome within each individual episode. Theanalysis of the interaction between stochastic environment and MORL Q-learningalgorithms run on a simple Multi-objective Markov decision process (MOMDP)Space Traders problem with different variant versions. The empiricalevaluations show that well designed reward signal can improve the performanceof the original baseline algorithm, however it is still not enough to addressmore general environment. A variant of MORL Q-Learning incorporating globalstatistics is shown to outperform the baseline method in original Space Tradersproblem, but remains below 100 percent effectiveness in finding the finddesired SER-optimal policy at the end of training. On the other hand, Optionlearning is guarantied to converge to desired SER-optimal policy but it is notable to scale up to solve more complex problem in real-life. The maincontribution of this thesis is to identify the extent to which the issue ofnoisy Q-value estimates impacts on the ability to learn optimal policies underthe combination of stochastic environments, non-linear utility and a constantlearning rate.