What it does
The bench takes a natural-language prompt asking for a bioinformatics pipeline, feeds it to your LLM, captures the generated Nextflow .nf, and validates the result end-to-end against the real cohesive-ngsmanager framework: the workflow must parse (DSL2), build a valid DAG, and schedule at least the expected number of step processes when run with nextflow -stub-run.
Failures are auto-categorised across 13 axes (arity, wrong emit, missing param, silent no-op, hallucinated step, …) so the bench surfaces why the model is wrong, not just that it is.
CHATTING
reply, so the LLM has no opportunity to ask "do you also want a trimming
step before assembly?" in a real interactive session. Pipelines that the
model adds spontaneously (e.g. fastp upstream of an assembler) are flagged
with the extras-best-practice
tag rather than counted as a failure — they're bioinformatically sound,
just beyond the literal prompt. See the
explorer > What the LLM said for the
full free-text replies behind each verdict.
Dataset overview
Every entry passes nextflow -stub-run validation against the framework. See
dataset_200.jsonl
and
dataset_modifications_full.jsonl.
The curated 50-prompt + 17-conversation subsets are kept as
dataset_50.jsonl and dataset_modifications.jsonl.
Reference results — izs-llm vs the curated subset
Headline numbers from the LLM run captured in
results/example_run_mistral/
and
results/example_run_mistral_multi_turn/.
The LLM was evaluated against the curated 50 + 17 subset; the full 200 + 159 corpus is for training / future LLM runs.
Single-turn — LLM eval (full 200)
loading…
Multi-turn — LLM eval (159 conversations · 330 turns)
loading…
Error category breakdown
| Category | Single-turn | Multi-turn |
|---|
Run history
Every CI-triggered run lands as one entry. Newest first. Click a run id to open its results/ folder on GitHub.
| date (UTC) | run id | LLM | framework | single-turn | multi-turn (turns) | multi-turn (convs) |
|---|---|---|---|---|---|---|
| loading… | ||||||
Programmatic access
If you want to consume the bench from a script or paper, a single
well-typed JSON manifest aggregates everything (datasets, runs, summary
stats, tag distribution, methodology pointers):
docs/data/benchmark.json.
Schema version 1.0; the same file is also reachable raw at
raw.githubusercontent.com/genpat-it/cohesive-llm-benchmark/main/docs/data/benchmark.json.
How to use it
git clone https://github.com/genpat-it/cohesive-llm-benchmark cd cohesive-llm-benchmark pip install -r requirements.txt export NGSMANAGER_DIR=/path/to/cohesive-ngsmanager export LLM_API_URL=http://localhost:8765 export BENCH_RUNS_DIR=$(pwd)/results/my_run python eval/run_llm.py # single-turn (~10 min for 50, ~40 min for 200) python eval/validate_llm.py # (~20 min for 50, ~90 min for 200) python eval/emit_report.py # TSV/CSV/MD reports python eval/run_llm_multi_turn.py # multi-turn (~10 min for 17, ~50 min for 159) python eval/validate_llm_multi_turn.py # (~15 min for 17, ~120 min for 159) python eval/emit_report.py # picks up multi-turn too
Explore the results
The interactive explorer lets you filter every example by category, pass/fail and error type, and compare the LLM-generated .nf with the ground truth side-by-side.