GitHub →
cohesive-llm-benchmark · explorer
Click any row to see the prompt, ground-truth
.nf
and the LLM-generated
.nf
side-by-side.
Overview
Dashboard
Explorer
Methodology
Error taxonomy
Schema
corpus
single-turn LLM eval (200)
multi-turn LLM eval (159 convs · 330 turns)
multi-sample workflows LLM eval (5)
single-turn LLM eval (50)
multi-turn LLM eval (34 turns)
category
all
error category
all
outcome
all
PASS only
FAIL only
tag
all
literal-match
extras-best-practice
extras-irrelevant
missing-steps
hallucinated
upstream-rate-limited
search
id
category
outcome
tags
n_proc
error
steps used