HLS-Eval: A Benchmark and Framework for Evaluating LLMs on High-Level Synthesis Design Tasks

Abstract

The rapid scaling of large language model (LLM) training and inference hasdriven their adoption in semiconductor design across academia and industry.While most prior work evaluates LLMs on hardware description language (HDL)tasks, particularly Verilog, designers are increasingly using high-levelsynthesis (HLS) to build domain-specific accelerators and complex hardwaresystems. However, benchmarks and tooling to comprehensively evaluate LLMs forHLS design tasks remain scarce. To address this, we introduce HLS-Eval, the first complete benchmark andevaluation framework for LLM-driven HLS design. HLS-Eval targets two coretasks: (1) generating HLS code from natural language descriptions, and (2)performing HLS-specific code edits to optimize performance and hardwareefficiency. The benchmark includes 94 unique designs drawn from standard HLSbenchmarks and novel sources. Each case is prepared via a semi-automated flowthat produces a natural language description and a paired testbench forC-simulation and synthesis validation, ensuring each task is "LLM-ready." Beyond the benchmark, HLS-Eval offers a modular Python framework forautomated, parallel evaluation of both local and hosted LLMs. It includes aparallel evaluation engine, direct HLS tool integration, and abstractions forto support different LLM interaction paradigms, enabling rapid prototyping ofnew benchmarks, tasks, and LLM methods. We demonstrate HLS-Eval through baseline evaluations of open-source LLMs onVitis HLS, measuring outputs across four key metrics - parseability,compilability, runnability, and synthesizability - reflecting the iterative HLSdesign cycle. We also report pass@k metrics, establishing clear baselines andreusable infrastructure for the broader LLM-for-hardware community. All benchmarks, framework code, and results are open-sourced athttps://github.com/stefanpie/hls-eval.

Quick Read (beta)

loading the full paper ...