GitHub →

cohesive-llm-benchmark · dashboard

Cross-model KPI dashboard with drill-down by species, modification kind, error category and verdict tag.
Single-turn pass
Multi-turn pass (turns)
Conv. full-pass
silent_no_op rate
harness-corrected excludes failures where the model added a biologically sensible extra step (host depletion, kmerfinder, …) that the validator didn't know how to parametrise — those are harness-param-gap in the examples table, not model errors.

Pass rate — all models

Error category —

Multi-turn by modification kind

silent_no_op rate · species × model (multi-turn corpus — darker = worse)

Examples

id kind/cat species outcome error tags