Abstract
This study introduces a novel approach for generating high-quality,language-specific chat corpora using a self-chat mechanism. We combine agenerator LLM for creating new samples and an embedder LLM to ensure diversity.A new Masked Language Modelling (MLM) model-based quality assessment metric isproposed for evaluating and filtering the corpora. Utilizing the llama2-70b asthe generator and a multilingual sentence transformer as embedder, we generatean Italian chat corpus and refine the Fauno corpus, which is based ontranslated English ChatGPT self-chat data. The refinement uses structuralassertions and Natural Language Processing techniques. Both corpora undergo acomprehensive quality evaluation using the proposed MLM model-based qualitymetric. The Italian LLM fine-tuned with these corpora demonstratessignificantly enhanced language comprehension and question-answering skills.The resultant model, cerbero-7b, establishes a new state-of-the-art for ItalianLLMs. This approach marks a substantial advancement in the development oflanguage-specific LLMs, with a special emphasis on augmenting corpora forunderrepresented languages like Italian.