Machine Learning (ML) is increasingly applied in real-life scenarios, raisingconcerns about bias in automatic decision making. We focus on bias as a notionof opinion exclusion, that stems from the direct application of traditional MLpipelines to infer subjective properties. We argue that such ML systems shouldbe evaluated with subjectivity and bias in mind. Considering the lack ofevaluation standards yet to create evaluation benchmarks, we propose an initiallist of specifications to define prior to creating evaluation datasets, inorder to later accurately evaluate the biases. With the example of a sentencetoxicity inference system, we illustrate how the specifications support theanalysis of biases related to subjectivity. We highlight difficulties ininstantiating these specifications and list future work for the crowdsourcingcommunity to help the creation of appropriate evaluation datasets.