Reviewing your preview
What to look at in a three-sample preview, and how to give feedback that produces a better preview.
A preview is three Harbor task directories — the same format the final taskset will deliver.
How to read a preview
datagen datasets preview <dataset_id>Returns three samples, each containing:
instruction.md— the natural-language task. Check that it's specific enough for a careful human to answer unambiguously, and that the scope is single-task (one request, one response) rather than sprawling.tests/test.sh— the verifier, where the rubric runs. Read each criterion. Confirm it's binary (pass/fail, no middle ground), atomic (one thing each), and self-contained (gradable without scrolling back).solution/solve.sh— the Oracle reference solution. Step through it against the instruction. Confirm the solution plausibly satisfies every rubric criterion. The Oracle is a sanity check that the task is solvable.task.toml— the metadata.[metadata]should describe this specific task, not the taskset in general.[environment]should list the right base image and timeouts for the work the instruction describes.
What a good preview looks like
- Instructions are concrete. "Summarize the risk factors in Section 1A of the attached 10-K and identify which risks recur across fiscal years" — specific, scoped, answerable.
- Rubric criteria are specific. A criterion reads "The response quotes at least one specific risk factor verbatim from Section 1A" — not "The response is accurate."
- Criteria are independent. You can grade criterion 12 without thinking about criterion 11. If two criteria move together, they should be one.
- The Oracle solution is realistic. It reads like something a careful analyst would write, not a toy answer.
- Tasks feel different from each other. Three preview samples should probe three facets of your Brief, not three near-duplicates.
What a bad preview looks like
- Vague instructions. "Analyze the filing and tell me about the risks." This is too broad to grade consistently.
- Criteria that can't fail. "The response is thoughtful." An LLM judge can't discern whether something is thoughtful or not-thoughtful without more context. Instead, ask for measurable criteria.
- Criteria that can't pass. "The response cites every risk factor Moody's would flag." This criteria is unverifiable from the filing alone.
- Oracle that doesn't match. The instruction asks for a cross-section comparison, the solution dumps a single section's contents. The task is misaligned.
- Three samples that with high similarity. Prefer samples that can measure multiple dimensions of your Brief to aim for diversity in task distribution.
Approve
If all three samples look right:
datagen datasets approve <dataset_id>A confirmation prompt appears. Pass --yes to skip it in a script. Full generation starts immediately.
Send feedback
Don't be afraid to iterate via feedback.
datagen datasets feedback <dataset_id> "<what should change>"Or open the preview in the UI and use the feedback field.
Example feedback that will move the needle:
- "The rubric's accuracy criteria should cite the 10-K section each fact came from, not assert facts abstractly."
- "Task 2's instruction asks about 'recent' filings — make it specific, either 'the last three fiscal years' or 'since 2020'."
- "The Oracle for task 3 skips the comparison step; the instruction requires it."
Example feedback that is difficult to act on:
- "Make it better."
- "The rubrics are weak."
- "I don't like the tone."
Point at a file, a criterion, a behavior. Explain what it should do instead. You can bundle multiple things in one feedback message.
One round of feedback produces a new three-sample preview at the same dataset_id; rerun preview to see it.
What changes between rounds
The revised preview is against the same Brief and the same Resources. Things that get adjusted:
- Rubric criteria. Added, removed, reworded, or re-weighted.
- Task difficulty and shape. Longer or shorter, more or fewer files involved.
- Instruction specificity. Tighter wording, clearer constraints.
- Reference solutions. Revised to match the new rubric or instruction.
Changes that warrant a new taskset:
- Registered Resources. Adding or removing a database or file corpus.
- The brief's core ask. Switching focus from coding to customer support.
How many rounds and how long
Most tasksets settle in one or two feedback rounds. Three is a signal something structural may be off (likely the Brief). Each round may take an hour or two.
Run datagen datasets watch <dataset_id> to follow the progress live, or come back later and run datagen datasets preview <dataset_id> when you're ready.
Next:
- Resources and sandboxes — when your agent needs more than the Brief to do its work.
- Rubrics — a longer look at what makes a rubric criterion work.