The classification experiments covered by machine learning (ML) are composedby two important parts: the data and the algorithm. As they are a fundamentalpart of the problem, both must be considered when evaluating a model'sperformance against a benchmark. The best classifiers need robust benchmarks tobe properly evaluated. For this, gold standard benchmarks such as OpenML-CC18are used. However, data complexity is commonly not considered along with themodel during a performance evaluation. Recent studies employ Item ResponseTheory (IRT) as a new approach to evaluating datasets and algorithms, capableof evaluating both simultaneously. This work presents a new evaluationmethodology based on IRT and Glicko-2, jointly with the decodIRT tool developedto guide the estimation of IRT in ML. It explores the IRT as a tool to evaluatethe OpenML-CC18 benchmark for its algorithmic evaluation capability and checksif there is a subset of datasets more efficient than the original benchmark.Several classifiers, from classics to ensemble, are also evaluated using theIRT models. The Glicko-2 rating system was applied together with IRT tosummarize the innate ability and classifiers performance. It was noted that notall OpenML-CC18 datasets are really useful for evaluating algorithms, whereonly 10% were rated as being really difficult. Furthermore, it was verified theexistence of a more efficient subset containing only 50% of the original size.While Randon Forest was singled out as the algorithm with the best innateability.