We present a reinforcement learning approach for detecting objects within animage. Our approach performs a step-wise deformation of a bounding box with thegoal of tightly framing the object. It uses a hierarchical tree-likerepresentation of predefined region candidates, which the agent can zoom in on.This reduces the number of region candidates that must be evaluated so that theagent can afford to compute new feature maps before each step to enhancedetection quality. We compare an approach that is based purely on zoom actionswith one that is extended by a second refinement stage to fine-tune thebounding box after each zoom step. We also improve the fitting ability byallowing for different aspect ratios of the bounding box. Finally, we proposedifferent reward functions to lead to a better guidance of the agent whilefollowing its search trajectories. Experiments indicate that each of theseextensions leads to more correct detections. The best performing approachcomprises a zoom stage and a refinement stage, uses aspect-ratio modifyingactions and is trained using a combination of three different reward metrics.