2026-06-18

Meeting with Tommy

Prep

I was thinking of completing the benchmark. I am not convinced that AssertionBench is a good benchmark. It is merely a collection of RTL files. I want to read some literature on how to construct an LLM benchmark. This in and of itself is a paper, evidenced by AssertionBench.
This allows me have valid comparisons across models, agentic loops etc. The companies themselves boast these benchmarks when they release a model.
This allows us to deal with no determinism too. I can do what these benchmarks do
I was thinking about having benchmarks of different flavors
- SVABENCH-template
- SVABENCH-icl
- SVABENCH-one_shot
This is kind of like the last evaluation metric. As we have learned it is not possible to be comprehensive and understandable at the same time i.e small benchmarks that are easy to understand are not comprehensive. So we need a smaller set of benchmarks that we can iterate on.
AssertionBench had ICL examples from HARM but we know that those kinds of templates isn’t exactly what we would want the LLMs to extract
There was a paper by Michael Ernst which talked about using static and dynamic mining in conjunction instead of thinking them as opposite poles.

How do they do it internally? Are there even iterations or do they just hope for the best?
What contributes to the improvement of these models? Is it just scale or is there some theoretical underpinning that we can optimize on?

To essentially be a Verification engineer. As we learnt over the course of writing Periscope, VE taken in a design and write assertions. They don’t write extra assertions or assertions that fail intentionally. They compress the RTL into a LTL space.
VE write assertions that are meant to fail when they are given NL documents.