Rubrics

Most teams arrive at Datagen Factory having tried LLM-as-judge and been disappointed. Scores are too variable in naive LLM judge implementations.

Rubrics are a list of binary criteria, each with a weight, that produce a stable reward signal. The LLM Data Company uses Datagen to produce rubrics that are properly tuned to the task.

What makes a rubric training-grade

Replacing one holistic score with a list of binary criteria fixes the drift caused by a single non-deterministic LLM call. Each criterion resolves to pass or fail and carries a weight; the rubric's total is a stable, additive signal you can train against.

Writing rubrics that are training-grade by hand requires a lot of careful time. Our authoring system enforces the full property set to ensure you can generate rubrics at scale that still carry strong reward signals for your agent.

Criterion anatomy

A rubric is a flat list of criteria. Each criterion has three fields:

id — a stable identifier, referenced by the verifier and by any feedback you give us. Typically a short slug like regulatory_citation_present
description — the claim being checked, written as a positive assertion. "The memo cites the correct section of the 2024 FDIC capital rule."
weight — a number. Positive weights reward correct behaviour. Negative weights punish a specific active error. Weights are relative; they are normalised to the sum of positive weights at grading time.

A rendered rubric

A real rubric, ten criteria, for a task that asks the agent to produce a regulatory-compliance memo:

rubric.toml

[[criterion]]
id = "names_institution"
weight = 1.0
description = "The memo identifies the institution by its full legal name: ACME Corp"

[[criterion]]
id = "regulatory_citation_present"
weight = 2.0
description = "The memo cites 12 CFR Part 324 (FDIC capital rule) as the governing regulation."

[[criterion]]
id = "capital_ratio_correct"
weight = 3.0
description = "The memo reports the institution's Common Equity Tier 1 ratio at 11.4%, matching the balance sheet." 

[[criterion]]
id = "no_fabricated_sources"
weight = -3.0
description = "The memo does not cite any regulation, rule, or case that does not exist in the provided source documents."

[[criterion]]
id = "recommendation_stated"
weight = 2.0
description = "The memo states an explicit recommendation (approve, deny, or conditionally approve) in its own sentence."

[[criterion]]
id = "recommendation_correct"
weight = 4.0
description = "The recommendation is 'conditionally approve'"

[[criterion]]
id = "conditions_enumerated"
weight = 2.0
description = "If the recommendation is conditional, the memo enumerates at least three specific conditions the institution must meet."

[[criterion]]
id = "no_contradiction_with_brief"
weight = -2.0
description = "The memo does not assert any fact that contradicts the brief (e.g. wrong institution type, wrong fiscal quarter)."

[[criterion]]
id = "within_word_limit"
weight = 1.0
description = "The memo is between 400 and 800 words, inclusive."

[[criterion]]
id = "appendix_present"
weight = 1.0
description = "The memo includes an appendix listing every source document referenced, with the document id as provided in the brief."

Ten criteria, two negative-weighted. Total possible positive score: 16. Maximum negative exposure: 5. An agent that gets the recommendation right and cites accurately but runs long is scored down only on the word-limit criterion — a small, well-targeted correction.

How the rubric compiles into the verifier

tests/test.sh calls the verifier, a thin shell over our open-source rubric package — it loads the rubric, calls .grade() on the agent's response, and writes the weighted reward.json Harbor reads.

Iterating on a rubric

The rubric is the artifact you give feedback on. If a criterion is too loose, too strict, overlapping with another, or grading the wrong thing, tell us in the feedback step. The verifier is regenerated from the rubric on the next iteration; you do not review shell scripts.

For where the rubric sits in the delivered task directory, see task.

What makes a rubric training-grade

Criterion anatomy

A rendered rubric

How the rubric compiles into the verifier

Iterating on a rubric

On this page