Semi-Supervised Classification of Social Media Posts: Identifying Sex-Industry Posts to Enable Better Support for Those Experiencing Sex-Trafficking

Abstract

Social media is both helpful and harmful to the work against sex trafficking.On one hand, social workers carefully use social media to support peopleexperiencing sex trafficking. On the other hand, traffickers use social mediato groom and recruit people into trafficking situations. There is theopportunity to use social media data to better provide support for peopleexperiencing trafficking. While AI and Machine Learning (ML) have been used in work against sextrafficking, they predominantly focus on detecting Child Sexual Abuse Material.Work using social media data has not been done with the intention to providecommunity level support to people of all ages experiencing trafficking. Withinthis context, this thesis explores the use of semi-supervised classification toidentify social media posts that are a part of the sex industry. Several methods were explored for ML. However, the primary method used wassemi-supervised learning, which has the benefit of providing automatedclassification with a limited set of labelled data. Social media posts wereembedded into low-dimensional vectors using FastText and Doc2Vec models. Thedata were then clustered using k-means clustering, and cross-validation wasused to determine label propagation accuracy. The results of the semi-supervised algorithm were encouraging. The FastTextCBOW model provided 98.6% accuracy to over 12,000 posts in clusters where labelpropagation was applied. The results of this thesis suggest that furthersemi-supervised learning, in conjunction with manual labeling, may allow forthe entire dataset containing over 50,000 posts to be accurately labeled. A fully labeled dataset could be used to develop a tool to identify anoverview of where and when social media is used within the sex industry. Thiscould be used to help determine better ways to provide support to peopleexperiencing trafficking.

Quick Read (beta)

loading the full paper ...