In this paper we address speaker-independent multichannel speech enhancementin unknown noisy environments. Our work is based on a well-establishedmultichannel local Gaussian modeling framework. We propose to use a neuralnetwork for modeling the speech spectro-temporal content. The parameters ofthis supervised model are learned using the framework of variationalautoencoders. The noisy recording environment is supposed to be unknown, so thenoise spectro-temporal modeling remains unsupervised and is based onnon-negative matrix factorization (NMF). We develop a Monte Carloexpectation-maximization algorithm and we experimentally show that the proposedapproach outperforms its NMF-based counterpart, where speech is modeled usingsupervised NMF.