Dashboard · cohesive-llm-benchmark

Model

Single-turn pass

—

Multi-turn pass (turns)

—

Conv. full-pass

—

silent_no_op rate

—

harness-corrected excludes failures where the model added a biologically sensible extra step (host depletion, kmerfinder, …) that the validator didn't know how to parametrise — those are harness-param-gap in the examples table, not model errors.

Pass rate — all models

Error category —

Multi-turn by modification kind

silent_no_op rate · species × model (multi-turn corpus — darker = worse)

Examples

Corpus Outcome Error Species Kind Tag Search

id	kind/cat	species	outcome	error	tags