Task
The atomic unit of a dataset
A task is a single piece of work an agent can attempt and be scored on. A dataset is a collection of tasks; training a model on the dataset means running your agent against each task and using the score as reward signal.
What's in a task
Every task is self-contained — everything needed to run it and grade it travels together:
- An instruction. Input prompt telling the agent what to do. Names the artifact to produce, fixes the inputs, closes off shortcuts.
- An environment. The runtime the agent works inside — a container, optional MCP servers, and optional Resources like a Postgres database or a file corpus.
- A rubric. Binary, weighted criteria the agent's output is graded against. See Rubrics.
A good task is grounded (everything needed to answer is accessible), gradable (a rubric can reliably score the output), and non-trivial (requires synthesis across information planes).
How tasks are delivered
Datagen Factory authors tasks in the Harbor format by default. Each task is a directory:
For a deeper dive on task.toml, MCP wiring, environment limits, and more, see the Harbor task structure docs.
The same tasks can also be exported as JSONL or Parquet. See Dataset formats.
Next
- Rubrics — how the grader inside each task is authored.
- Dataset formats — Harbor directories, JSONL, Parquet, and the HuggingFace mirror.
- Harbor task format docs — the upstream spec.