2026-06-19

FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware

Year: 2024
Authors: Minwoo Kang, Mingjie Liu, Ghaith Bany Hamad, Syed Suhaib, Haoxing Ren
DOI: 10.48550/arXiv.2410.23299

Persistent Notes

In-text annotations

“in tasks pertaining to FV." Page 1

What should be my focus? Is it on FV as a whole or just generating assertions from a RTL design? Page 1

“The benchmark consists of three sub-tasks that measure LLM capabilities at different levels—from the generation of SystemVerilog assertions (SVAs) given natural language descriptions to reasoning about the design RTL and suggesting assertions directly without additional human input." Page 1

“1 Introduction” Page 1

“FV workflows suffer from steep engineering costs of manually crafting formal testbenches and collateral by human experts." Page 1

“While prior work has demonstrated that current LLMs have the potential to generate hardware assertions, their evaluation, however, has been limited to a hand-select set of test instances with typically less than 10 or 20 problems. Due to the dearth of publicly available repositories of reliable designs and formal testbenches, it has been challenging to curate a dataset for evaluation that encompasses a variety of designs. Furthermore, prior evaluation of LLMs has considered a limited set of tasks which renders a holistic evaluation of model capabilities unviable. This altogether raises the question: how should we comprehensively understand and quantify current LLMs capabilities in performing tasks in hardware formal verification?" Page 2

Motivation Page 2

“diverse tasks” Page 2

I agree that it is on diverse tasks but how diverse are the tasks themselves? Page 2

“As test problem instances, we either collect human-written designs and testbenches or present methodologies to scalably generate synthetic yet realistic test cases motivated 2 by real-world designs” Page 2

To combat the lack of datasets they generate new stuff. I don’t think they are generated on the fly. Their repository has scripts to generate them and they recommended some knobs to generate a small set of them. I think this is distinct from a truly dynamic test where each run an entirely new test is generated without a template. Are they different FSM examples that different? They different in scale and may have different operators but they fundamentally represent the same thing. Page 2

“2 Preliminaries” Page 3

“3 Overview of the FVEval Benchmark Framework” Page 3

“3.1 Motivation” Page 3

“2) writing testbenches that consist of assertions checking the design against the specification” Page 3

They do write testbenches for RV but Idk if I would all them testbenches in the context of FV Page 3

“Do LLMs have the capability to generate SVA assertions, given real-world, human-written testbenches and high-level specifications of design functionality?" Page 3

““Can LLMs flexibly handle diverse NL specifications of formal properties and accurately formulate the same formal logic in SVA syntax?”" Page 4

“In Design2SVA, we ask: “Can LLMs craft relevant SVA assertions directly from design RTL and without human guidance?” Compared to previous benchmarks, this task demands a higher-level of understanding RTL code and the semantics of the hardware module described by the RTL. Models successful in this benchmark could have potential to realize further advanced usages of language models in hardware FV, as artificial agents assisting human engineers." Page 4

“3.2 NL2SVA-Human: Assertion Generation with Real-World Testbenches” Page 4

“NL2SVA-Human Inputs Input Context: Testbench SystemVerilog [arbiter_rr.sv] Question: Create a SVA assertion that checks: whether starvation occurs, i.e. check that each request from client is eventually granted. Expected Output (Reference Assertion) assert property (@(posedge clk) disable iff (tb_reset) (!busy && |tb_req && (tb_gnt == ’d0)) !== 1’b1 ); LLM Generated Output Examples: GPT-4: assert property (@(posedge clk) disable iff (tb_reset) (tb_req && !busy) |-> tb_gnt ); Llama-3-70B-Chat: assert property (@(posedge clk) disable iff (tb_reset) |tb_req && !busy |=> ##[1:$] (|tb_gnt)) );" Page 4

“Our method utilizes a custom-implemented Jasper function that formally proves logical equivalence, and further, can also evaluate whether there is an implication relationship between the two assertions. We also report a relaxed metric of functional accuracy that includes such a case of partial equivalence and thereby capture the performance of LLMs at a finer granularity than only considering exact equivalence." Page 5

The assertion generated by the LLM just has to have the same functionality rather than be verbatim.

How do they prove equivalence? Page 5

“3.3 NL2SVA-Machine: Synthetic Benchmark for Stress-Testing Formal Assertion Generation” Page 5

"(1) random SVA assertion generation, based on random sampling of SVA operators and symbolic signal names; (2) LLM generation of NL descriptions for each random assertion; (3) LLM as a critic to assess whether the NL description accurately reflects the temporal logic of the assertion—if this fails, re-try description generation; and (4) Human inspection to finalize the appropriateness of generated descriptions." Page 5

“3.4 Design2SVA: Direct Generation of Formal Assertions from Design RTL Alone” Page 6

“Our objective in formulating the Design2SVA benchmark is to present a set of test cases consisting of design RTL that are: (1) sufficiently correct, as in there exist formal properties that can be proven to be true; (2) relevant to real-world FV use-cases; (3) suitable as test instances for language models; and (4) varied in terms of complexity. While it is ideal to collect and curate existing RTL examples, we find that openly available SystemVerilog/Verilog repositories contain limited numbers of verified RTL designs. Of the cases that satisfy the first and second criteria, such as module instances from OpenTitan [33], these cases fail to be suitable for language model evaluation, as each module is a part of a large System-on-a-Chip (SoC) system and the context needed to resolve all sub-module dependency information is prohibitively large." Page 6

Reasons why not to use Existing RTL design such as opencores Page 6

“scalably generate complex and parameterized synthetic test instances that are derived from common design patterns encountered in industrial FV workflows." Page 6

“arithmetic pipelines that resemble scenarios where we check for data integrity and forward propagation across datapaths; and (2) finite-state machines (FSMs) that commonly appear in control logic implementations, such as cache controllers, memory interfaces, etc." Page 6

"(1) Syntax—we measure the LLM generated assertion is first syntactically correct; (2) Functionality—we use the results of formal proofs, i.e. whether the assertions are proven with model checkers and other formal engines in industrial tools, as an indication of functional correctness." Page 6

“4 Results and Analysis” Page 8

“4.1 Experiment Setup” Page 8

“4.2 NL2SVA-Human Results” Page 8

“LLMs can generate syntactically correct SVA code but are still prone to hallucinations." Page 9

“Partial functional equivalence as a correctness metric reveals further understanding of model capabilities." Page 10

“Sampling multiple response candidates improves accuracy." Page 11

“4.3 NL2SVA-Machine Results” Page 11

“4.4 Design2SVA Results” Page 12

“5 Related Work” Page 14

“6 Limitations and Future Work” Page 15

“besides the arithmetic pipeline and FSMs considered in this work." Page 15

“7 Conclusion” Page 15

“Acknowledgements” Page 15

%% Import Date: 2026-06-19T18:45:50.714-04:00 %%