In neural decoding research, one of the most intriguing topics is thereconstruction of perceived natural images based on fMRI signals. Previousstudies have succeeded in re-creating different aspects of the visuals, such aslow-level properties (shape, texture, layout) or high-level features (categoryof objects, descriptive semantics of scenes) but have typically failed toreconstruct these properties together for complex scene images. Generative AIhas recently made a leap forward with latent diffusion models capable ofgenerating high-complexity images. Here, we investigate how to take advantageof this innovative technology for brain decoding. We present a two-stage scenereconstruction framework called ``Brain-Diffuser''. In the first stage,starting from fMRI signals, we reconstruct images that capture low-levelproperties and overall layout using a VDVAE (Very Deep Variational Autoencoder)model. In the second stage, we use the image-to-image framework of a latentdiffusion model (Versatile Diffusion) conditioned on predicted multimodal (textand visual) features, to generate final reconstructed images. On the publiclyavailable Natural Scenes Dataset benchmark, our method outperforms previousmodels both qualitatively and quantitatively. When applied to synthetic fMRIpatterns generated from individual ROI (region-of-interest) masks, our trainedmodel creates compelling ``ROI-optimal'' scenes consistent with neuroscientificknowledge. Thus, the proposed methodology can have an impact on both applied(e.g. brain-computer interface) and fundamental neuroscience.