Datagen Factory

Reviewing your preview

What to look at in a three-sample preview, and how to give feedback that produces a better preview.

A preview is three Harbor task directories — the same format the final taskset will deliver.

How to read a preview

datagen datasets preview <dataset_id>

Returns three samples, each containing:

  1. instruction.md — the natural-language task. Check that it's specific enough for a careful human to answer unambiguously, and that the scope is single-task (one request, one response) rather than sprawling.
  2. tests/test.sh — the verifier, where the rubric runs. Read each criterion. Confirm it's binary (pass/fail, no middle ground), atomic (one thing each), and self-contained (gradable without scrolling back).
  3. solution/solve.sh — the Oracle reference solution. Step through it against the instruction. Confirm the solution plausibly satisfies every rubric criterion. The Oracle is a sanity check that the task is solvable.
  4. task.toml — the metadata. [metadata] should describe this specific task, not the taskset in general. [environment] should list the right base image and timeouts for the work the instruction describes.

What a good preview looks like

  • Instructions are concrete. "Summarize the risk factors in Section 1A of the attached 10-K and identify which risks recur across fiscal years" — specific, scoped, answerable.
  • Rubric criteria are specific. A criterion reads "The response quotes at least one specific risk factor verbatim from Section 1A" — not "The response is accurate."
  • Criteria are independent. You can grade criterion 12 without thinking about criterion 11. If two criteria move together, they should be one.
  • The Oracle solution is realistic. It reads like something a careful analyst would write, not a toy answer.
  • Tasks feel different from each other. Three preview samples should probe three facets of your Brief, not three near-duplicates.

What a bad preview looks like

  • Vague instructions. "Analyze the filing and tell me about the risks." This is too broad to grade consistently.
  • Criteria that can't fail. "The response is thoughtful." An LLM judge can't discern whether something is thoughtful or not-thoughtful without more context. Instead, ask for measurable criteria.
  • Criteria that can't pass. "The response cites every risk factor Moody's would flag." This criteria is unverifiable from the filing alone.
  • Oracle that doesn't match. The instruction asks for a cross-section comparison, the solution dumps a single section's contents. The task is misaligned.
  • Three samples that with high similarity. Prefer samples that can measure multiple dimensions of your Brief to aim for diversity in task distribution.

Approve

If all three samples look right:

datagen datasets approve <dataset_id>

A confirmation prompt appears. Pass --yes to skip it in a script. Full generation starts immediately.

Send feedback

Don't be afraid to iterate via feedback.

datagen datasets feedback <dataset_id> "<what should change>"

Or open the preview in the UI and use the feedback field.

Example feedback that will move the needle:

  • "The rubric's accuracy criteria should cite the 10-K section each fact came from, not assert facts abstractly."
  • "Task 2's instruction asks about 'recent' filings — make it specific, either 'the last three fiscal years' or 'since 2020'."
  • "The Oracle for task 3 skips the comparison step; the instruction requires it."

Example feedback that is difficult to act on:

  • "Make it better."
  • "The rubrics are weak."
  • "I don't like the tone."

Point at a file, a criterion, a behavior. Explain what it should do instead. You can bundle multiple things in one feedback message.

One round of feedback produces a new three-sample preview at the same dataset_id; rerun preview to see it.

What changes between rounds

The revised preview is against the same Brief and the same Resources. Things that get adjusted:

  • Rubric criteria. Added, removed, reworded, or re-weighted.
  • Task difficulty and shape. Longer or shorter, more or fewer files involved.
  • Instruction specificity. Tighter wording, clearer constraints.
  • Reference solutions. Revised to match the new rubric or instruction.

Changes that warrant a new taskset:

  • Registered Resources. Adding or removing a database or file corpus.
  • The brief's core ask. Switching focus from coding to customer support.

How many rounds and how long

Most tasksets settle in one or two feedback rounds. Three is a signal something structural may be off (likely the Brief). Each round may take an hour or two.

Run datagen datasets watch <dataset_id> to follow the progress live, or come back later and run datagen datasets preview <dataset_id> when you're ready.


Next:

  • Resources and sandboxes — when your agent needs more than the Brief to do its work.
  • Rubrics — a longer look at what makes a rubric criterion work.