../

ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs


Year
2026
Authors
Lik Tung Fu, Jie Zhou, Shaokai Ren, Mengli Zhang, Jia Xiong, Hugo Jiang, Nan Guan, Xi Wang, Jun Yang
DOI
10.1145/3770743.3804146

Related

Persistent Notes

Quick Summary

  • Single pass is the way to go when performing assertion generation
  • Introduce a multi-agent approach
    • ChatSVA is a combination of GPT-4o and 4 LLama-3.1 finetuned models.
    • AgentBridge is a framework to generate the data on which the above models will be trained on.
  • Evaluated theirs on FIXME and found it to outperformed GPT-4o and Deepseek R1

Criticisms

Using older models

  • GPT-4o was released in May 2024
  • DeepSeek R1 was released in January 2025
  • DAC ‘26 deadline was March 6, 2026
  • The models aren’t SOTA by any measure.
  • o1, 4.5, 4.1, o3, 5, 5.1, 5.2, 5.3 were all released before the DAC deadline
  • Running FIXME on these could have easily been done

FIXME

  • FIXME isn’t available publicly we don’t know how it actually is

Agent

  • Their use of the word agent is misleading. As far as I know, agent implies a hierarchy of models possibly working in parallel. Their architecture seems like a mixture of different fine tuned models that work in series. No model controls another one and there are no loops.

Math

  • We don’t know where $D_{gold}$ comes from. We also don’t know anything about the prompts that they use to generate these. It is very vague.
  • Functional coverage is not defined anywhere
  • Equation 4 seems too trivial to be included. $100%$ is just 1. Then the equation essentially describes taking the average. I don’t understand why they needed to call this General Pass Rate and even give it an equation.
  • Section 5 has a bunch of math-y formula but we don’t have any explanation of $\mathcal{G}$.
  • We don’t see the bug types that they are attempting to hunt

Confused about

  • They say that RTL induces bugs into the assertions. While this is technically true, we heard from brad that it is standard practice to derive assertions from implementations. I cannot remember the reference for this off the top of my head but even proven assertions about faulty behavior can help the designer catch the bug. For example, if $G(A)$ is extracted and proven then it implies a stuck at. This immediately draws the attention of a VE compared to perusing the code base.
  • Do we need separate agents1-4? Does fine-tuning a model with all the 4 kinds of prompts decrease performance? An LLM usually is fine-tuned on so many different kinds of prompts that adding 4 more is not an issue. I’m not sure if agent1-4 are a single model or multiple. I’m assuming multiple because they gave them different numbers.

Agree

  • A single pass approach limits our ability to scale and doesn’t take full advantage of the model. We are limited by the no-shot capabilities of the model in this approach.
  • We need something more than that. But their approach is just splitting the “cognitive” load into separate chat sessions. While this certain helps, I was imagining a loop of some sorts when I heard saw the word “agent”.

In-text annotations

“multi-agent framework” Page 1

“Evaluated on 24 RTL designs, ChatSVA achieves 98.66% syntax and 96.12% functional pass rates, generating 139.5 SVAs per design with 82.50% function coverage. This represents a 33.3 percentage point improvement in functional correctness and an over 11× enhancement in function coverage compared to the previous state-of-the-art (SOTA)." Page 1

“Current research has rapidly converged on two main strategies: (1) augmenting general-purpose models with prompt engineering or RAG [3, 16, 22, 23, 36], and (2) fine-tuning smaller models with domain-specific data [2, 20, 26, 28]." Page 2

“Most approaches treat SVA generation as a single-step translation from Spec to code” Page 2

“Furthermore, this methodological weakness is compounded by a risky practice: using the RTL code itself as an input to guide generation [3, 16, 22]. This introduces “verification contamination”, where the model may learn and replicate existing design flaws, undermining the very purpose of verification” Page 2

“This single-step approach forces models to prioritize syntactic correctness over functional efficacy, as it lacks the intermediate steps to reason about complex design intent. This forces a single, unguided inferential leap, making LLMs prone to “functional hallucinations”, generating SVAs that are syntactically valid but functionally incorrect” Page 2

“S1 directly targets C2 and C1 by decomposing the monolithic task into a pipeline of modular sub-tasks. This provides the intermediate reasoning steps C2 identifies as missing, shifting the focus from syntactic correctness to functional intent." Page 3

“A complete Spec should encompass functional descriptions, input and output ports, internal register mappings, and the temporal behavior of internal signals." Page 3

“addresses the data scarcity challenge in SVA generation." Page 3

“AgentBridge is built upon three foundational principles that guide its data synthesis process." Page 3

“three principles." Page 4

“Directional Information Constraint." Page 4

“G” Page 4

“Ground Truth Provenance” Page 4

“Output Verifiability” Page 4

“We evaluate ChatSVA against general-purpose models (GPT-4o, DeepSeek-R1) and AssertLLM [36]. AssertLLM is selected as the primary SOTA baseline as it is also a multi-agent framework that generates SVAs from Spec, making it a direct counterpart to our approach” Page 5

“All methods are evaluated on the FIXME benchmark [31], a comprehensive public test set featuring diverse industrial designs (e.g., CPUs, GPUs, IPs)." Page 5

“Function Coverage evaluates the bugdetection capability of functionally correct SVAs against predefined bug types. (Protocol Violations, Illegal Branch, etc.)" Page 5

%% Import Date: 2026-06-30T18:15:29.474-04:00 %%