AI evaluation platform

Test agents, review outputs, catch failures before launch.

Classify gives teams one place to evaluate both full AI agents and individual AI outputs. Real humans generate signal, the judge scores the work, and the results come back as something a company can use.

agent
live conversation testing
output
single-response review
judge
scored evaluation layer
Human review flowsMulti-model judgmentHallucination flagsReport-ready evidence

Platform loop

Human evaluation, organized for shipping teams.

Judge-backed
Company01

Publish what needs testing

List a full AI agent for live conversations or post a single AI output for structured review.

Tester02

Generate real evaluation signal

Humans either stress-test the agent in chat or review the output directly against the task criteria.

Classify Judge03

Score and organize the result

The platform decides what is valid, flags failures, and turns the interaction into usable evidence for the team.

What Classify Covers

One product, two evaluation surfaces.

The core idea is simple: some teams need full conversational testing, while others just need feedback on a single output. Classify supports both without changing the reporting and judgment layer.

Agent Testing

Companies publish an AI agent and testers interact with it through a built-in chat interface.

Output Review

Teams can also post a single AI response, summary, or code snippet for direct human evaluation.

Judge Engine

Classify scores relevance, rule compliance, authenticity, and failure patterns before treating work as high-signal.

Reports

Results roll up into structured findings companies can use before shipping to real users.

How Judgment Works

Every submission is scored for quality, not just volume.

The platform is built to reward useful evaluation signal. Whether the work is a live session or a one-off output review, Classify focuses on relevance, compliance, authenticity, and whether the work exposed something meaningful.

01>= 70%

Human Authenticity

Is the work coming from a real human rather than low-effort automation or templated spam?

02>= 80%

Rule Compliance

Did the tester stay inside the rules the company defined for the task or session?

03>= 60%

Objective Relevance

Was the work actually aimed at the stated objective, rather than filler or noise?

04logged

Failure Capture

Did the session or review reveal unsupported claims, contradictions, or hallucination patterns worth logging?

Judge pipeline
Pre-checksstructure and obvious abuse filters
Primary judgemain scoring pass
Secondary judgeconsistency check
Resultvalid work becomes reportable signal
For Companies

Get evaluation evidence before users find the problem.

Run live adversarial testing against full AI agents
Collect ratings and written feedback on single outputs
Capture transcripts, criteria scores, and flagged failures
Give product teams evidence they can actually act on
Publish your first listing
For Testers

Earn by doing useful review work.

Choose either live agent sessions or output-review tasks
Do real evaluation work instead of fake engagement loops
Earn when the submission is relevant, useful, and valid
Build reputation by surfacing signal, not noise
Browse live work
Go There Now

The key surfaces are already live in the app.

If you want to understand the product quickly, start with the agent marketplace. If you want a simpler review loop, the task sandbox is still available for single-output evaluation.

Browse Agents

Open the main marketplace for live AI agent testing.

Open agents

Publish An Agent

Create a listing with objective, rules, persona, and payout.

Create listing

Use The Task Sandbox

Review single AI outputs with ratings and written feedback.

Open tasks
Ready

Catch the failure before it reaches production.

Publish an agent, post an output, or start testing what is already live in the marketplace.