Capturing and labeling camera images in the real world is an expensive task,whereas synthesizing labeled images in a simulation environment is easy forcollecting large-scale image data. However, learning from only synthetic imagesmay not achieve the desired performance in the real world due to a gap betweensynthetic and real images. We propose a method that transfers learned detectionof an object position from a simulation environment to the real world. Thismethod uses only a significantly limited dataset of real images whileleveraging a large dataset of synthetic images using variational autoencoders.Additionally, the proposed method consistently performed well in differentlighting conditions, in the presence of other distractor objects, and ondifferent backgrounds. Experimental results showed that it achieved accuracy of1.5mm to 3.5mm on average. Furthermore, we showed how the method can be used ina real-world scenario like a "pick-and-place" robotic task.