cohesive-llm-benchmark

What it does

The bench takes a natural-language prompt asking for a bioinformatics pipeline, feeds it to your LLM, captures the generated Nextflow .nf, and validates the result end-to-end against the real cohesive-ngsmanager framework: the workflow must parse (DSL2), build a valid DAG, and schedule at least the expected number of step processes when run with nextflow -stub-run.

Failures are auto-categorised across 13 axes (arity, wrong emit, missing param, silent no-op, hallucinated step, …) so the bench surfaces why the model is wrong, not just that it is.

Methodology note · single-shot, no clarification dialogue. The harness sends each prompt once and auto-approves any CHATTING reply, so the LLM has no opportunity to ask "do you also want a trimming step before assembly?" in a real interactive session. Pipelines that the model adds spontaneously (e.g. fastp upstream of an assembler) are flagged with the extras-best-practice tag rather than counted as a failure — they're bioinformatically sound, just beyond the literal prompt. See the explorer > What the LLM said for the full free-text replies behind each verdict.

Dataset overview

Every entry passes nextflow -stub-run validation against the framework. See dataset_200.jsonl and dataset_modifications_full.jsonl. The curated 50-prompt + 17-conversation subsets are kept as dataset_50.jsonl and dataset_modifications.jsonl.

Reference results — izs-llm vs the curated subset

Headline numbers from the LLM run captured in results/example_run_mistral/ and results/example_run_mistral_multi_turn/. The LLM was evaluated against the curated 50 + 17 subset; the full 200 + 159 corpus is for training / future LLM runs.

Single-turn — LLM eval (full 200)

loading…

Multi-turn — LLM eval (159 conversations · 330 turns)

loading…

Error category breakdown

Category	Single-turn	Multi-turn

Run history

Every CI-triggered run lands as one entry. Newest first. Click a run id to open its results/ folder on GitHub.

date (UTC)	run id	LLM	framework	single-turn	multi-turn (turns)	multi-turn (convs)
loading…

Programmatic access

If you want to consume the bench from a script or paper, a single well-typed JSON manifest aggregates everything (datasets, runs, summary stats, tag distribution, methodology pointers): docs/data/benchmark.json. Schema version 1.0; the same file is also reachable raw at raw.githubusercontent.com/genpat-it/cohesive-llm-benchmark/main/docs/data/benchmark.json.

How to use it

git clone https://github.com/genpat-it/cohesive-llm-benchmark
cd cohesive-llm-benchmark
pip install -r requirements.txt

export NGSMANAGER_DIR=/path/to/cohesive-ngsmanager
export LLM_API_URL=http://localhost:8765
export BENCH_RUNS_DIR=$(pwd)/results/my_run

python eval/run_llm.py             # single-turn  (~10 min for 50, ~40 min for 200)
python eval/validate_llm.py        # (~20 min for 50, ~90 min for 200)
python eval/emit_report.py         # TSV/CSV/MD reports

python eval/run_llm_multi_turn.py        # multi-turn  (~10 min for 17, ~50 min for 159)
python eval/validate_llm_multi_turn.py   # (~15 min for 17, ~120 min for 159)
python eval/emit_report.py               # picks up multi-turn too

Explore the results

The interactive explorer lets you filter every example by category, pass/fail and error type, and compare the LLM-generated .nf with the ground truth side-by-side.