Datagen Factory

The authoring flow

How a brief becomes a dataset, end to end — what you give us, what you see while we work, where the back-and-forth happens, and where it doesn't.

Most teams come in expecting the authoring process to be a chat — describe what you want, we ask clarifying questions, you refine, we come back with more. That's not quite how it works, and the shape is worth getting straight before you submit your first brief.

The authoring flow has two phases: you give us a brief once, and we come back with three preview tasks, and then you iterate against those preview tasks with freeform feedback until they're right. There is no turn-by-turn chat during drafting. The iteration happens against real samples, not against imagined ones.

The full shape

Two loops, not one. The outer loop (brief → draft → preview) runs once per dataset. The inner loop (preview → feedback → revised preview) runs as many times as you need, against real rendered samples each time.

Phase A: the brief

You give us a plain-language description of what you want. One paragraph is enough; a longer brief helps but isn't required. The brief covers three things (you don't have to name them, we'll infer):

  • Who your agent is — "an investment-banking analyst agent", "a code-reviewer for a Python repo", "a shopping assistant."
  • What kind of tasks should grade it — "complex restructuring analysis", "debugging real Python bugs with stack traces", "laptop recommendations under a budget."
  • How strict the rubric should be on specific dimensions — "every numeric claim must cite its source document", "the final recommendation must be a single explicit choice", "runtime under 30 seconds per rollout."

Plus, optionally:

  • Registered Resources — a database schema, a file corpus, anything your agent needs to read from. Register those with datagen resources register (or the Resources tab in the UI) before you submit the brief, and reference them by name in the brief. See Resources and sandboxes.
  • A source URL — a doc, a spec, a one-pager — if the brief makes more sense with context you'd rather link than paste inline.

That's it. No form. No dropdowns for size or difficulty. The brief itself is how you describe those; we pick them up.

Submitting the brief

CLI and UI do the same thing here. The CLI is one command; the UI is a textarea plus an optional source-URL field and an auto-saved draft. Both submit the same request to the same endpoint. Pick by preference.

CLI:

datagen datasets create --brief "Complex restructuring analysis tasks for my investment-banking agent. Each task should require reconciling the disclosure statement against the DIP budget and identifying at least one discrepancy. Rubrics should reward citing specific rows of recovery-analysis.xlsx."

Returns a dataset id. Use datagen datasets watch <id> to follow progress, or datagen datasets get <id> for a one-shot status check.

UI: navigate to New Dataset. You get a textarea labeled "Describe your dataset," an optional source-URL field, an "example brief" side panel with a few sample briefs you can borrow from, and a Cancel button. Your draft is auto-saved as you type, so you can walk away and come back. Submit sends you straight to the dataset detail page where the warm waypoints start ticking over.

What the UI has that the CLI doesn't (and vice versa)

Neither surface has a conversational editor. The UI has a few small affordances the CLI doesn't:

  • Example-brief panel. A handful of sample briefs you can pick apart to calibrate what "a good brief" looks like. Useful the first time; unnecessary once you've submitted a few.
  • Draft auto-save. Close the tab mid-brief, come back tomorrow, the text is still there.
  • A brief_source_url field. You can point us at a doc or spec that accompanies the brief. (The CLI accepts this through --brief-source-url if you want parity.)

The CLI has one thing the UI doesn't:

  • Scriptability. datagen datasets create --format json returns the dataset id on stdout. You can pipe it, poll it, drive it from your CI, or hand the whole flow to a coding agent. See Claude Code workflow for why this matters.

Everything else — the waypoint progress, the preview render, the feedback mechanism, the approval — is feature-identical between the two surfaces. The UI reference walks through each screen; the CLI reference walks through each command.

Phase B: the quiet part

After you submit, you wait. We don't ask follow-up questions; we don't send status emails; we don't prompt you for more context. The warm waypoints are the entire customer-facing signal:

  • "Understanding your ask" — the analyzer is parsing your brief and working out the shape: what kind of tasks, what kind of rubric, what Resources you referenced.
  • "Designing your dataset" — the authoring pass is running. Drafting task instructions, authoring a rubric, wiring the environment, generating a reference solution per sample.
  • "Checking quality" — live gates run the draft tasks end-to-end against real sample rollouts to check the rubric actually grades meaningfully. Each rubric property has a check here.
  • "Preview ready" — three preview task folders are rendered and ready for you to look at.

Typical elapsed time: 5–30 minutes. Simpler tasksets finish faster than ones with a sandbox to provision.

If the automated path can't produce a clean preview — ambiguous brief, unusual agent setup, a Resource that has quirks we haven't seen — the waypoint quietly shifts to "we're still refining this" and a TLDC engineer steps in. You don't have to do anything; you'll hear from us when it's ready.

Why we don't ask clarifying questions during Phase A. Chat-shaped clarification is tempting — "should the rubric be strict about units?" — but it generates worse data than letting the analyzer make a reasonable choice and showing you the consequence. You'll spot "rubric wasn't strict enough about units" in a sample in five seconds; you wouldn't have known the question mattered in the abstract.

Phase C: the preview and feedback loop

This is where iteration actually happens, and it's the part of the flow that does its real work. The preview isn't a mock or a summary; it's three real Harbor task directories rendered exactly as they'll appear in the final dataset. Open one: you get a task.toml with metadata, an instruction.md with the full task statement, an environment/Dockerfile for the runtime, a tests/test.sh verifier implementing the rubric, and a solution/solve.sh Oracle reference answer. For the deep dive on what each file is, see task.

Look at all three. Ask yourself three questions:

  1. Does the instruction describe the work I actually want my agent doing? Specific enough. Right level of difficulty. Touches the Resources it needs to. Produces the artifact I care about (a memo, a JSON, a file).
  2. Does the rubric reward what I care about? Each criterion should feel like a thing a reviewer with the source material would score pass/fail without argument. Weights should feel right. Nothing missing; nothing double-counted.
  3. Does the Oracle solution seem right? It's not a ceiling — your agent can do better — but if the Oracle doesn't solve the task in a reasonable way, the task is probably under-specified.

Then approve or give feedback.

Approving

datagen datasets approve <id> in the CLI, or Approve in the UI on the dataset detail page.

Approval spins up the full run. The authoring configuration that produced the preview is fixed; the same shape gets instantiated across the full task count. The waypoint shifts to "generating full dataset" and, at the end, "complete." Downloadable.

Giving feedback

Freeform text. Describe what's off and what you'd rather see. The factory re-authors against your feedback and produces a revised preview with the adjustments applied.

CLI:

datagen datasets feedback <id> "The rubric criteria on numerical accuracy should cite specific cells in recovery-analysis.xlsx, not just check that a number exists. Also, the instruction is too short — each task should require at least two distinct documents to solve."

UI: Give feedback on the dataset detail page opens a freeform textarea dialog. Same shape, same endpoint.

Good feedback is specific and diagnostic. It describes what to change in what task, and why the current version misses it. The one-sentence version: tell us what the rubric should be rewarding that it isn't, or what the instruction should be pinning down that it isn't.

Useless feedback is vague ("make it better," "this isn't quite right"). We can't act on it, and the round burns a revision cycle.

How many feedback rounds is normal

Most datasets converge in one or two rounds. Three is common. By four, if you're still shaping the rubric or the task shape, the brief itself was probably under-specified — it's worth stepping back and starting a new dataset with a sharper brief rather than pushing further. The CLI and UI don't cap feedback rounds, but the return on round five is usually low and it's cheaper to restart.

Why this shape

The authoring flow is deliberately asymmetric — a one-shot brief followed by preview-driven iteration — because that's where the real quality work happens. You can refine a brief from memory, but refining a rubric from memory is almost impossible. You need to look at a rendered criterion against a rendered task against a rendered Oracle solution, and then your feedback gets specific.

The turn-by-turn clarification shape we don't implement would move iteration into Phase A (before you have anything to react to), make Phase A slower, and produce worse rubrics. Keeping Phase A one-shot and letting Phase C absorb all the iteration is a deliberate design call, not a missing feature.

Where to go next