We present a system for learning motion of independently moving objects fromstereo videos. The only human annotation used in our system are 2D objectbounding boxes which introduce the notion of objects to our system. Unlikeprior learning based work which has focused on predicting dense pixel-wiseoptical flow field and/or a depth map for each image, we propose to predictobject instance specific 3D scene flow maps and instance masks from which weare able to derive the motion direction and speed for each object instance. Ournetwork takes the 3D geometry of the problem into account which allows it tocorrelate the input images. We present experiments evaluating the accuracy ofour 3D flow vectors, as well as depth maps and projected 2D optical flow whereour jointly learned system outperforms earlier approaches trained for each taskindependently.