Dataset formats
A delivered dataset is a small bundle of artifact files — JSONL and Parquet for the items, plus a quality report in Markdown and JSON. An optional HuggingFace mirror is available when you wire up a token.
A delivered dataset is a small bundle of artifact files. The same items are exported in two formats — JSONL for streaming and spot-checking, Parquet for columnar reads — alongside a quality report rendered as both Markdown (for humans) and JSON (for tooling). If you've connected a HuggingFace token, we also push a parquet-backed dataset repo you own.
What you receive
Every delivered dataset ships with four artifacts:
The <dataset_id> prefix is the same identifier returned by datagen datasets list and embedded in each item under the pack_id field. The four files always travel together — there is no separate "summary" file or directory wrapper. datagen datasets download writes one of them at a time; pass --file-format to pick which.
<dataset_id>.jsonl
One JSON object per line, one line per item. Each line is a full dataset item:
{
"id": "item_01J8...",
"pack_id": "ds_01J8F9Z3Q7K2M",
"query": "Draft the regulatory-citation section of the memo...",
"query_metadata": {
"axes": {"jurisdiction": "us", "bank_size": "mid"},
"domain": "regulatory-memo"
},
"golden": {"text": "...", "score": 0.94},
"mid": {"text": "...", "score": 0.71},
"poor": {"text": "...", "score": 0.32},
"rubric": {
"sections": [
{"id": "citation_accuracy", "criteria": [
{"id": "cites_correct_regulation", "weight": 3, "requirement": "..."},
{"id": "cites_correct_section", "weight": 2, "requirement": "..."}
]}
],
"criteria_count": 24,
"p_plus": 0.78
},
"provenance": { /* template-shaped, see below */ }
}Field roles:
id— stable item identifier.pack_id— dataset identifier shared across every item in the same run.query— the prompt the item was scored against.query_metadata.axes— the axis values this item was generated under (jurisdiction, difficulty, document type — whatever your Brief specifies).query_metadata.domainis the dataset domain string.golden/mid/poor— three scored outputs spanning the quality range. Each has the generatedtextand a rubricscorein[0, 1].rubric— the binary-criterion rubric scored against the three outputs, grouped into sections.criteria_countmirrors the total criterion count;p_plusis the empirical pass-plus rate when present.provenance— varies by item template. Axis-based items carryseed_hashandgeneration_models; document- and environment-grounded items carrysource_chunks,source_assets, andhydrated_at; exemplar-based items carryexemplar_id. Hybrid items carry all of the above. Empty fields are omitted.
JSONL is the most permissive format — small, line-streamable, trivially filterable with jq.
<dataset_id>.parquet
The same items in columnar form, written with a fixed PyArrow schema and 1000-row batches. Identical field set to the JSONL, with one shape difference: query_metadata.axes is encoded as a list of [key, value] pairs instead of a map, so it round-trips through engines (DuckDB, Polars, BigQuery) that struggle with arbitrary-key map columns.
Reach for Parquet when you're doing columnar reads, joining against other tables, or loading into a training pipeline at scale.
<dataset_id>_quality.md
The human-readable quality report. Plain Markdown — open it in any viewer. Sections:
- Summary — item count, dataset ID, generated-at timestamp, funnel conversion rate.
- Funnel — per-stage row counts (query → grounding → generation → scoring → final), so you can see how many candidates we generated to land the items you kept.
- Score Distributions — average / median / min / max scores for the golden, mid, and poor tracks.
- Rubric Compliance — fraction of items whose serialized rubric matches its declared criterion count.
- Axis Coverage —
<axis>:<value>counts, sorted by frequency. Tells you which corners of the space are well-covered and which are thin. - Cost Breakdown — per-stage and per-model spend, total spend, cost per item (when cost telemetry is available for the run).
- Sample Walkthroughs — five representative items: highest and lowest golden score, and the items closest to the median on each track. Includes the query, three track scores, the rubric section IDs, and the reason that item was picked.
<dataset_id>_quality.json
The same report, machine-readable. A single JSON object with the fields above (golden_avg_score, mid_avg_score, poor_avg_score, rubric_compliance_rate, funnel_conversion, score_distributions, axis_coverage, funnel_stages, cost_breakdown, cost_total_usd, cost_per_item_usd, sample_walkthroughs, item_count, pack_id, generated_at). Use it when you want to gate a regenerate-with-feedback step on a specific metric without parsing the Markdown.
Downloading
datagen datasets download <dataset_id> writes one artifact at a time. Pass --file-format to pick which one:
--file-format | What you get |
|---|---|
jsonl | <dataset_id>.jsonl |
parquet | <dataset_id>.parquet |
markdown | <dataset_id>_quality.md |
json | <dataset_id>_quality.json |
If you don't pass the flag, the CLI picks the first available artifact, which in practice is the JSONL. Run the command four times — or hit the underlying /datasets/{id}/download endpoint — when you want the full bundle.
HuggingFace mirror
If your account has a HuggingFace token configured, every completed dataset is also pushed to a private dataset repo you own:
- Repo path:
datagen-factory/<your-org-slug>/<domain> data/train.parquet— the same Parquet artifact described aboveREADME.md— an auto-generated dataset card (domain, item count, template type, generated-at, axis coverage table, licence note)
The upload is fire-and-forget after packaging completes — the dataset is delivered whether or not the HuggingFace push succeeds, and uploads retry on transient failures before giving up. Re-running the same dataset overwrites the files in the existing repo.
Configure your HuggingFace token by talking to support; we don't expose a CLI command for it yet.