CsFEVER and CTKFacts: Czech Datasets for Fact Verification

Abstract

In this paper we present two Czech datasets aimed for training automatedfact-checking machine learning models. Specifically we deal with the task ofassessment of a textual claim veracity w.r.t. to a (presumably) verifiedcorpus. The output of the system is the claim classification SUPPORTS orREFUTES complemented with evidence documents or NEI (Not Enough Info) alone. Inthe first place we publish CsFEVER of approximately 112k claims which is anautomatically generated Czech version of the well-known Wikipedia-based FEVERdataset. We took a hybrid approach of machine translation and languagealignment, where the same method (and tools we provide) can be easily appliedto other languages. The second dataset CTKFacts of 3,097 claims is built on thecorpus of approximately two million Czech News Agency news reports. We presentan extended methodology based on the FEVER approach. Most notably, we describea method to automatically generate wider claim contexts (dictionaries) fornon-hyperlinked corpora. The datasets are analyzed for spurious cues, which areannotation patterns leading to model overfitting. CTKFacts is further examinedfor inter-annotator agreement, and a typology of common annotator errors isextracted. Finally, we provide baseline models for all stages of thefact-checking pipeline.

Quick Read (beta)

loading the full paper ...