ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs
- Year
- 2026
- Authors
- Lik Tung Fu, Jie Zhou, Shaokai Ren, Mengli Zhang, Jia Xiong, Hugo Jiang, Nan Guan, Xi Wang, Jun Yang
- DOI
- 10.1145/3770743.3804146
Related
Persistent Notes
Quick Summary
- Single pass is the way to go when performing assertion generation
- Introduce a multi-agent approach
- ChatSVA is a combination of GPT-4o and 4 LLama-3.1 finetuned models.
- AgentBridge is a framework to generate the data on which the above models will be trained on.
- Evaluated theirs on FIXME and found it to outperformed GPT-4o and Deepseek R1
Criticisms
Using older models
- GPT-4o was released in May 2024
- DeepSeek R1 was released in January 2025
- DAC ‘26 deadline was March 6, 2026
- The models aren’t SOTA by any measure.
- o1, 4.5, 4.1, o3, 5, 5.1, 5.2, 5.3 were all released before the DAC deadline
- Running FIXME on these could have easily been done
FIXME
- FIXME isn’t available publicly we don’t know how it actually is
Agent
- Their use of the word agent is misleading. As far as I know, agent implies a hierarchy of models possibly working in parallel. Their architecture seems like a mixture of different fine tuned models that work in series. No model controls another one and there are no loops.
Math
- We don’t know where $D_{gold}$ comes from. We also don’t know anything about the prompts that they use to generate these. It is very vague.
- Functional coverage is not defined anywhere
- Equation 4 seems too trivial to be included. $100%$ is just 1. Then the equation essentially describes taking the average. I don’t understand why they needed to call this General Pass Rate and even give it an equation.
- Section 5 has a bunch of math-y formula but we don’t have any explanation of $\mathcal{G}$.
- We don’t see the bug types that they are attempting to hunt
Confused about
- They say that RTL induces bugs into the assertions. While this is technically true, we heard from brad that it is standard practice to derive assertions from implementations. I cannot remember the reference for this off the top of my head but even proven assertions about faulty behavior can help the designer catch the bug. For example, if $G(A)$ is extracted and proven then it implies a stuck at. This immediately draws the attention of a VE compared to perusing the code base.
- Do we need separate agents1-4? Does fine-tuning a model with all the 4 kinds of prompts decrease performance? An LLM usually is fine-tuned on so many different kinds of prompts that adding 4 more is not an issue. I’m not sure if agent1-4 are a single model or multiple. I’m assuming multiple because they gave them different numbers.
Agree
- A single pass approach limits our ability to scale and doesn’t take full advantage of the model. We are limited by the no-shot capabilities of the model in this approach.
- We need something more than that. But their approach is just splitting the “cognitive” load into separate chat sessions. While this certain helps, I was imagining a loop of some sorts when I heard saw the word “agent”.
In-text annotations
“multi-agent framework” Page 1
“Evaluated on 24 RTL designs, ChatSVA achieves 98.66% syntax and 96.12% functional pass rates, generating 139.5 SVAs per design with 82.50% function coverage. This represents a 33.3 percentage point improvement in functional correctness and an over 11× enhancement in function coverage compared to the previous state-of-the-art (SOTA)." Page 1
“Current research has rapidly converged on two main strategies: (1) augmenting general-purpose models with prompt engineering or RAG [3, 16, 22, 23, 36], and (2) fine-tuning smaller models with domain-specific data [2, 20, 26, 28]." Page 2
“Most approaches treat SVA generation as a single-step translation from Spec to code” Page 2
“Furthermore, this methodological weakness is compounded by a risky practice: using the RTL code itself as an input to guide generation [3, 16, 22]. This introduces “verification contamination”, where the model may learn and replicate existing design flaws, undermining the very purpose of verification” Page 2
“This single-step approach forces models to prioritize syntactic correctness over functional efficacy, as it lacks the intermediate steps to reason about complex design intent. This forces a single, unguided inferential leap, making LLMs prone to “functional hallucinations”, generating SVAs that are syntactically valid but functionally incorrect” Page 2
“S1 directly targets C2 and C1 by decomposing the monolithic task into a pipeline of modular sub-tasks. This provides the intermediate reasoning steps C2 identifies as missing, shifting the focus from syntactic correctness to functional intent." Page 3
“A complete Spec should encompass functional descriptions, input and output ports, internal register mappings, and the temporal behavior of internal signals." Page 3
“addresses the data scarcity challenge in SVA generation." Page 3
“AgentBridge is built upon three foundational principles that guide its data synthesis process." Page 3
“three principles." Page 4
“Directional Information Constraint." Page 4
“G” Page 4
“Ground Truth Provenance” Page 4
“Output Verifiability” Page 4
“We evaluate ChatSVA against general-purpose models (GPT-4o, DeepSeek-R1) and AssertLLM [36]. AssertLLM is selected as the primary SOTA baseline as it is also a multi-agent framework that generates SVAs from Spec, making it a direct counterpart to our approach” Page 5
“All methods are evaluated on the FIXME benchmark [31], a comprehensive public test set featuring diverse industrial designs (e.g., CPUs, GPUs, IPs)." Page 5
“Function Coverage evaluates the bugdetection capability of functionally correct SVAs against predefined bug types. (Protocol Violations, Illegal Branch, etc.)" Page 5
%% Import Date: 2026-06-30T18:15:29.474-04:00 %%