CodeV: Empowering LLMs with HDL Generation through Multi-Level Summarization

Abstract

The design flow of processors, particularly in hardware description languages(HDL) like Verilog and Chisel, is complex and costly. While recent advances inlarge language models (LLMs) have significantly improved coding tasks insoftware languages such as Python, their application in HDL generation remainslimited due to the scarcity of high-quality HDL data. Traditional methods ofadapting LLMs for hardware design rely on synthetic HDL datasets, which oftensuffer from low quality because even advanced LLMs like GPT perform poorly inthe HDL domain. Moreover, these methods focus solely on chat tasks and theVerilog language, limiting their application scenarios. In this paper, we observe that: (1) HDL code collected from the real world isof higher quality than code generated by LLMs. (2) LLMs like GPT-3.5 excel insummarizing HDL code rather than generating it. (3) An explicit language tagcan help LLMs better adapt to the target language when there is insufficientdata. Based on these observations, we propose an efficient LLM fine-tuningpipeline for HDL generation that integrates a multi-level summarization datasynthesis process with a novel Chat-FIM-Tag supervised fine-tuning method. Thepipeline enhances the generation of HDL code from natural language descriptionsand enables the handling of various tasks such as chat and infilling incompletecode. Utilizing this pipeline, we introduce CodeV, a series of HDL generationLLMs. Among them, CodeV-All not only possesses a more diverse range of languageabilities, i.e. Verilog and Chisel, and a broader scope of tasks, i.e. Chat andfill-in-middle (FIM), but it also achieves performance on VerilogEval that iscomparable to or even surpasses that of CodeV-Verilog fine-tuned on Verilogonly, making them the first series of open-source LLMs designed formulti-scenario HDL generation.

Quick Read (beta)

loading the full paper ...