The Cambridge Law Corpus: A Corpus for Legal AI Research

Abstract

We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research.It consists of over 250 000 court cases from the UK. Most cases are from the21st century, but the corpus includes cases as old as the 16th century. Thispaper presents the first release of the corpus, containing the raw text andmeta-data. Together with the corpus, we provide annotations on case outcomesfor 638 cases, done by legal experts. Using our annotated data, we have trainedand evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models toprovide benchmarks. We include an extensive legal and ethical discussion toaddress the potentially sensitive nature of this material. As a consequence,the corpus will only be released for research purposes under certainrestrictions.

Quick Read (beta)

loading the full paper ...