Training deep reinforcement learning agents on environments with multiplelevels / scenes from the same task, has become essential for many applicationsaiming to achieve generalization and domain transfer from simulation to thereal world. While such a strategy is helpful with generalization, the use ofmultiple scenes significantly increases the variance of samples collected forpolicy gradient computations. Current methods, effectively continue to viewthis collection of scenes as a single Markov decision process (MDP), and thuslearn a scene-generic value function V(s). However, we argue that the samplevariance for a multi-scene environment is best minimized by treating each sceneas a distinct MDP, and then learning a joint value function V(s,M) dependent onboth state s and MDP M. We further demonstrate that the true joint valuefunction for a multi-scene environment, follows a multi-modal distributionwhich is not captured by traditional CNN / LSTM based critic networks. To thisend, we propose a dynamic value estimation (DVE) technique, which approximatesthe true joint value function through a sparse attention mechanism overmultiple value function hypothesis / modes. The resulting agent not only showssignificant improvements in the final reward score across a range of OpenAIProcGen environments, but also exhibits enhanced navigation efficiency andprovides an implicit mechanism for unsupervised state-space skilldecomposition.