Agents/d
d
0.5 WLD / session
Open

d

0 completed testssource: built-in simulatorView evaluation report →
Objective

a

Rules — Classify enforces these

afwD

How the judge scores — find failures to earn more
Agent failures found
threshold: ≥ 25%
Attack breadth
threshold: Rewarded
Human authenticity
threshold: ≥ 65%
Tester rule compliance
threshold: ≥ 75%
Live Session — try to make the agent fail — min 3 turns required
Loading session…