Reviewing your preview

What to look at in a three-sample preview, and how to give feedback that produces a better preview.

A preview is three Harbor task directories — the same format the final taskset will deliver.

How to read a preview

datagen datasets preview <dataset_id>

Returns three samples, each containing:

instruction.md — the natural-language task. Check that it's specific enough for a careful human to answer unambiguously, and that the scope is single-task (one request, one response) rather than sprawling.
tests/test.sh — the verifier, where the rubric runs. Read each criterion. Confirm it's binary (pass/fail, no middle ground), atomic (one thing each), and self-contained (gradable without scrolling back).
solution/solve.sh — the Oracle reference solution. Step through it against the instruction. Confirm the solution plausibly satisfies every rubric criterion. The Oracle is a sanity check that the task is solvable.
task.toml — the metadata. [metadata] should describe this specific task, not the taskset in general. [environment] should list the right base image and timeouts for the work the instruction describes.

What a good preview looks like

Instructions are concrete. "Summarize the risk factors in Section 1A of the attached 10-K and identify which risks recur across fiscal years" — specific, scoped, answerable.
Rubric criteria are specific. A criterion reads "The response quotes at least one specific risk factor verbatim from Section 1A" — not "The response is accurate."
Criteria are independent. You can grade criterion 12 without thinking about criterion 11. If two criteria move together, they should be one.
The Oracle solution is realistic. It reads like something a careful analyst would write, not a toy answer.
Tasks feel different from each other. Three preview samples should probe three facets of your Brief, not three near-duplicates.

What a bad preview looks like

Vague instructions. "Analyze the filing and tell me about the risks." This is too broad to grade consistently.
Criteria that can't fail. "The response is thoughtful." An LLM judge can't discern whether something is thoughtful or not-thoughtful without more context. Instead, ask for measurable criteria.
Criteria that can't pass. "The response cites every risk factor Moody's would flag." This criteria is unverifiable from the filing alone.
Oracle that doesn't match. The instruction asks for a cross-section comparison, the solution dumps a single section's contents. The task is misaligned.
Three samples that with high similarity. Prefer samples that can measure multiple dimensions of your Brief to aim for diversity in task distribution.

Approve

If all three samples look right:

datagen datasets approve <dataset_id>

A confirmation prompt appears. Pass --yes to skip it in a script. Full generation starts immediately.

Send feedback

Don't be afraid to iterate via feedback.

datagen datasets feedback <dataset_id> "<what should change>"

Or open the preview in the UI and use the feedback field.

Example feedback that will move the needle:

"The rubric's accuracy criteria should cite the 10-K section each fact came from, not assert facts abstractly."
"Task 2's instruction asks about 'recent' filings — make it specific, either 'the last three fiscal years' or 'since 2020'."
"The Oracle for task 3 skips the comparison step; the instruction requires it."

Example feedback that is difficult to act on:

"Make it better."
"The rubrics are weak."
"I don't like the tone."

Point at a file, a criterion, a behavior. Explain what it should do instead. You can bundle multiple things in one feedback message.

One round of feedback produces a new three-sample preview at the same dataset_id; rerun preview to see it.

What changes between rounds

The revised preview is against the same Brief and the same Resources. Things that get adjusted:

Rubric criteria. Added, removed, reworded, or re-weighted.
Task difficulty and shape. Longer or shorter, more or fewer files involved.
Instruction specificity. Tighter wording, clearer constraints.
Reference solutions. Revised to match the new rubric or instruction.

Changes that warrant a new taskset:

Registered Resources. Adding or removing a database or file corpus.
The brief's core ask. Switching focus from coding to customer support.

How many rounds and how long

Most tasksets settle in one or two feedback rounds. Three is a signal something structural may be off (likely the Brief). Each round may take an hour or two.

Run datagen datasets watch <dataset_id> to follow the progress live, or come back later and run datagen datasets preview <dataset_id> when you're ready.

Resources and sandboxes — when your agent needs more than the Brief to do its work.
Rubrics — a longer look at what makes a rubric criterion work.