AI evaluation platform

Test agents, review outputs, catch failures before launch.

Classify gives teams one place to evaluate both full AI agents and individual AI outputs. Real humans generate signal, the judge scores the work, and the results come back as something a company can use.

Publish Something Explore Marketplace

agent

live conversation testing

output

single-response review

judge

scored evaluation layer

Human review flowsMulti-model judgmentHallucination flagsReport-ready evidence

Platform loop

Human evaluation, organized for shipping teams.

Judge-backed

Company01

Publish what needs testing

List a full AI agent for live conversations or post a single AI output for structured review.

Tester02

Generate real evaluation signal

Humans either stress-test the agent in chat or review the output directly against the task criteria.

Classify Judge03

Score and organize the result

The platform decides what is valid, flags failures, and turns the interaction into usable evidence for the team.

What Classify Covers

One product, two evaluation surfaces.

The core idea is simple: some teams need full conversational testing, while others just need feedback on a single output. Classify supports both without changing the reporting and judgment layer.

Agent Testing

Companies publish an AI agent and testers interact with it through a built-in chat interface.

Output Review

Teams can also post a single AI response, summary, or code snippet for direct human evaluation.

Judge Engine

Classify scores relevance, rule compliance, authenticity, and failure patterns before treating work as high-signal.

Reports

Results roll up into structured findings companies can use before shipping to real users.

How Judgment Works

Every submission is scored for quality, not just volume.

The platform is built to reward useful evaluation signal. Whether the work is a live session or a one-off output review, Classify focuses on relevance, compliance, authenticity, and whether the work exposed something meaningful.

01>= 70%

Human Authenticity

Is the work coming from a real human rather than low-effort automation or templated spam?

02>= 80%

Rule Compliance

Did the tester stay inside the rules the company defined for the task or session?

03>= 60%

Objective Relevance

Was the work actually aimed at the stated objective, rather than filler or noise?

04logged

Failure Capture

Did the session or review reveal unsupported claims, contradictions, or hallucination patterns worth logging?

Judge pipeline

Pre-checksstructure and obvious abuse filters

→

Primary judgemain scoring pass

→

Secondary judgeconsistency check

→

Resultvalid work becomes reportable signal

For Companies

Get evaluation evidence before users find the problem.

◈Run live adversarial testing against full AI agents

◈Collect ratings and written feedback on single outputs

◈Capture transcripts, criteria scores, and flagged failures

◈Give product teams evidence they can actually act on

Publish your first listing

For Testers

Earn by doing useful review work.

◈Choose either live agent sessions or output-review tasks

◈Do real evaluation work instead of fake engagement loops

◈Earn when the submission is relevant, useful, and valid

◈Build reputation by surfacing signal, not noise

Browse live work

Go There Now

The key surfaces are already live in the app.

If you want to understand the product quickly, start with the agent marketplace. If you want a simpler review loop, the task sandbox is still available for single-output evaluation.

Browse Agents

Open the main marketplace for live AI agent testing.

Open agents

Publish An Agent

Create a listing with objective, rules, persona, and payout.

Create listing

Use The Task Sandbox

Review single AI outputs with ratings and written feedback.

Open tasks

Ready

Catch the failure before it reaches production.

Publish an agent, post an output, or start testing what is already live in the marketplace.

Publish now Start testing