2026-06-25

Meeting with tommy

What I did

Read thangellamudiBridgingRTLAssertion2026 and found some things I am confused about
Benchmarks that are resilient to test contamination essentially have a mechanism to construct test after the cutoff date
- SWEBENCH just uses the pull requests after the cutoff date
- This is the easiest way

They claim that training should include RTL with assertions instead of just assertions
- Meaning instead of NL-assertion pair they train on NL-RTL+assertion pair
But their massive dataset raises some red flags for me
It is very vague where they gathered the data
So I did some digging and found that the authors of RTLLM, the benchmark that they used, also published a dataset
It also has 27k rows. What are the chances?
I think that RTLLM is also inside the dataset. So the performance is just regurgitating the training data.

I made a list of all the benchmarks used by these companies.
I’ll try to see if there are common patterns across these.
I saw that these companies selectively hide benchmarks when their model underperforms