Abstract
Computer vision methods have demonstrated considerable potential tostreamline ecological and biological workflows, with a growing number ofdatasets and models becoming available to the research community. However,these resources focus predominantly on evaluation using machine learningmetrics, with relatively little emphasis on how their application impactsdownstream analysis. We argue that models should be evaluated usingapplication-specific metrics that directly represent model performance in thecontext of its final use case. To support this argument, we present twodisparate case studies: (1) estimating chimpanzee abundance and density withcamera trap distance sampling when using a video-based behaviour classifier and(2) estimating head rotation in pigeons using a 3D posture estimator. We showthat even models with strong machine learning performance (e.g., 87% mAP) canyield data that leads to discrepancies in abundance estimates compared toexpert-derived data. Similarly, the highest-performing models for postureestimation do not produce the most accurate inferences of gaze direction inpigeons. Motivated by these findings, we call for researchers to integrateapplication-specific metrics in ecological/biological datasets, allowing formodels to be benchmarked in the context of their downstream application and tofacilitate better integration of models into application workflows.