Interactions between the government officials and civilians affect publicwellbeing and the state legitimacy that is necessary for the functioning ofdemocratic society. Police officers, the most visible and contacted agents ofthe state, interact with the public more than 20 million times a year duringtraffic stops. Today, these interactions are regularly recorded by body-worncameras (BWCs), which are lauded as a means to enhance police accountabilityand improve police-public interactions. However, the timely analysis of theserecordings is hampered by a lack of reliable automated tools that can enablethe analysis of these complex and contested police-public interactions. Thisarticle proposes an approach to developing new multi-perspective, multimodalmachine learning (ML) tools to analyze the audio, video, and transcriptinformation from this BWC footage. Our approach begins by identifying theaspects of communication most salient to different stakeholders, including bothcommunity members and police officers. We move away from modeling approachesbuilt around the existence of a single ground truth and instead utilize newadvances in soft labeling to incorporate variation in how different observersperceive the same interactions. We argue that this inclusive approach to theconceptualization and design of new ML tools is broadly applicable to the studyof communication and development of analytic tools across domains of humaninteraction, including education, medicine, and the workplace.