Abstract
With the advent of deep learning, object detection drifted from a bottom-upto a top-down recognition problem. State of the art algorithms enumerate anear-exhaustive list of object locations and classify each into: object or not.In this paper, we show that bottom-up approaches still perform competitively.We detect four extreme points (top-most, left-most, bottom-most, right-most)and one center point of objects using a standard keypoint estimation network.We group the five keypoints into a bounding box if they are geometricallyaligned. Object detection is then a purely appearance-based keypoint estimationproblem, without region classification or implicit feature learning. Theproposed method performs on-par with the state-of-the-art region baseddetection methods, with a bounding box AP of 43.2% on COCO test-dev. Inaddition, our estimated extreme points directly span a coarse octagonal mask,with a COCO Mask AP of 18.9%, much better than the Mask AP of vanilla boundingboxes. Extreme point guided segmentation further improves this to 34.6% MaskAP.