../
Meeting with tommy
What I did
- Read thangellamudiBridgingRTLAssertion2026 and found some things I am confused about
- Benchmarks that are resilient to test contamination essentially have a mechanism to construct test after the cutoff date
- SWEBENCH just uses the pull requests after the cutoff date
- This is the easiest way
Questions about the paper
- They claim that training should include RTL with assertions instead of just assertions
- Meaning instead of NL-assertion pair they train on NL-RTL+assertion pair
- But their massive dataset raises some red flags for me
- It is very vague where they gathered the data
- So I did some digging and found that the authors of RTLLM, the benchmark that they used, also published a dataset
- It also has 27k rows. What are the chances?
- I think that RTLLM is also inside the dataset. So the performance is just regurgitating the training data.
General LLM Benchmarks
- I made a list of all the benchmarks used by these companies.
- I’ll try to see if there are common patterns across these.
- I saw that these companies selectively hide benchmarks when their model underperforms