Result Artifact Contract
AgentV writes each eval invocation as a portable run bundle. The bundle is the source of truth for Dashboard, reports, compare/trend tooling, CI gates, and external adapters.
The contract is run-centric:
summary.jsonowns aggregate run facts.index.jsonlowns row-level discovery and filtering.- Per-case sidecars own detailed payloads such as grading, metrics, transcripts, timing, generated files, and raw provider evidence.
- Dashboard, search, SQLite, HTML reports, and vendor exports are rebuildable projections over the bundle.
Directory Layout
Section titled “Directory Layout”The default local layout is:
.agentv/results/ <run_id>/ summary.json index.jsonl tags.json # optional mutable Dashboard tags <case-or-allocation>/ summary.json # optional per-case aggregate, especially repeats test/ # optional generated test bundle EVAL.yaml targets.yaml files/ graders/ run-1/ result.json grading.json metrics.json timing.json transcript.json transcript-raw.jsonl outputs/ answer.md file_changes.diff run-2/ result.json grading.json metrics.json timing.json transcript.json transcript-raw.jsonl outputs/ answer.md file_changes.diff .indexes/ # reserved rebuildable/local indexes .cache/ # reserved local cache<run_id> is the only committed run-bundle path identity. It helps AgentV put
completed runs somewhere predictable, but readers must not infer semantic truth
from folder names. Use fields in summary.json and index.jsonl for
experiment, target, variant, attempt, eval path, case identity, timing, scores,
and artifact paths.
The run bundle does not add target, model, variant, or cases/ folders below
<run_id>. Per-result directories are allocated from row identity, usually with
a readable test-id or slug prefix plus a short hash suffix, and remain opaque to
consumers.
experiment is metadata: it is how users label a condition such as baseline,
candidate, with_skills, or without_skills. It is recorded in
summary.json and rows, not as a parent directory and not as a runtime-policy
object. If a bundle is copied, combined, published, or imported under a
different directory, its metadata still carries the facts consumers should
query.
Top-level dot-prefixed directories such as .indexes/ and .cache/ are
reserved for rebuildable local state and are skipped by run discovery.
File Roles
Section titled “File Roles”| File or field | Owns | Use it for |
|---|---|---|
summary.json | Aggregate run metadata and rollups: run id, experiment metadata, counts, pass rate, score summaries, duration, token/cost totals, and writer metadata. | Listing runs, CI summaries, quick dashboards, trend cards, and validating that a run is complete enough to inspect. |
index.jsonl | Canonical row index: one row per result, attempt, or case-level aggregate, with identity fields, filter metadata, scores, status, and explicit run-relative paths to sidecars. | Filtering, compare/trend inputs, Dashboard detail routing, rerun/resume lookup, export adapters, and artifact discovery. |
result.json | Compact per-attempt manifest for one attempt directory, including AgentV execution_status and verdict. | Loading one attempt without scanning the whole run index. |
grading.json | Grader outputs, assertions, rubric evidence, execution-metric grader facts, and scoring provenance. | Explaining why a row passed or failed. |
metrics.json | Derived executor behavior summary, such as tool calls, files touched, shell commands, errors, turns, and output sizes. | Dashboard behavior views, metric-style graders, adapter projections, and lightweight analysis. |
outputs/file_changes.diff | Full unified diff of workspace file changes when file changes are captured. | Human review and external artifact inspection; LLM and script graders still receive the same full diff through file_changes. |
timing.json | Duration, token usage, cost usage, and source labels such as provider_reported, token_estimated, aggregate, or unavailable. | Cost/latency reporting and provider-accounting audits. |
transcript.json | AgentV-normalized transcript/timeline document with canonical tool_name values and transcript_summary. | Portable human review, transcript-aware graders, and tool-trajectory analysis. |
transcript-raw.jsonl | Native provider or harness evidence when available. | Parser debugging, forensic review, and preserving source bytes without making provider schemas public AgentV fields. |
test/ | Generated test bundle for the exact eval slice and target settings that produced a row. | Audit, external review, and rerun workflows that should not depend on a mutable source checkout. |
artifact_pointers | Offload indirection for large detached payload bytes. | Finding payloads published outside the primary metadata/control-plane branch, such as transcript bytes on agentv/artifacts/v1. |
summary.json and index.jsonl are complementary, not redundant. A run list
should not scan every row just to show pass rate or total duration, and a row
reader should not parse aggregate summary structures to find one case’s grading
or transcript. Keep aggregate questions on summary.json; keep row and artifact
discovery on index.jsonl.
Row Contract
Section titled “Row Contract”Each index.jsonl line is a JSON object. The exact field set grows as AgentV
adds providers and projections, but stable rows follow these rules:
- Field names are
snake_case. - Identity and filter fields live on the row, not only in directory names.
- Sidecar references are explicit path fields, relative to the run directory.
- Large detached payloads may also have
artifact_pointers, but ordinary sidecars should still be discoverable through path fields. - Unknown fields should be preserved by adapters when they rewrite or project rows.
Example row:
{ "timestamp": "2026-06-30T08:15:00.000Z", "run_id": "2026-06-30T08-15-00-000Z", "experiment": "with_skills", "tags": { "experiment": "with_skills", "team": "support" }, "eval_path": "evals/support/refunds.eval.yaml", "test_id": "refund-eligibility", "target": "codex-gpt5", "variant": "skills-v2", "attempt": 1, "execution_status": "ok", "score": 0.92, "duration_ms": 184200, "result_dir": "refund-eligibility--4f9a7c2d1b6e", "summary_path": "refund-eligibility--4f9a7c2d1b6e/summary.json", "grading_path": "refund-eligibility--4f9a7c2d1b6e/run-1/grading.json", "metrics_path": "refund-eligibility--4f9a7c2d1b6e/run-1/metrics.json", "timing_path": "refund-eligibility--4f9a7c2d1b6e/run-1/timing.json", "transcript_path": "refund-eligibility--4f9a7c2d1b6e/run-1/transcript.json", "transcript_raw_path": "refund-eligibility--4f9a7c2d1b6e/run-1/transcript-raw.jsonl", "transcript_summary": { "total_turns": 4, "tool_calls": { "file_read": 2, "shell": 1, "unknown": 0 }, "files_read": ["src/refunds.ts"], "files_modified": ["src/refunds.ts"], "shell_commands": ["bun test refunds.test.ts"], "web_fetches": [], "errors": [], "thinking_blocks": 1 }, "output_path": "refund-eligibility--4f9a7c2d1b6e/run-1/outputs/answer.md", "answer_path": "refund-eligibility--4f9a7c2d1b6e/run-1/outputs/answer.md", "file_changes_path": "refund-eligibility--4f9a7c2d1b6e/run-1/outputs/file_changes.diff", "test_dir": "refund-eligibility--4f9a7c2d1b6e/test"}Rows can represent repeated attempts, multi-target runs, imported suites,
manual prepare/grade attempts, or imported provider sessions. That is why
experiment, eval_path, test_id, target, variant, attempt, and
source metadata belong in index.jsonl: tools can filter dynamically without
requiring every run to be pre-split into semantic folders.
When a run resolves a promptfoo-shaped tags map (from suite tags, project
config tags, or --tag key=value), the resolved map is emitted as tags on
each row and as summary.json.metadata.tags. Its reserved experiment key
matches the row experiment field, so trend/compare views can group by
tags.experiment.
The run-1/, run-2/, and later folders under a result directory are artifact
attempt/execution folders. Do not treat those folder names as the comparison
dimension. Repeated stochastic samples should be represented by explicit
metadata such as sample_index and sample_count; infrastructure retries
should use retry metadata such as retry_index, retry_count, and
retry_reason when available.
Reader Rules
Section titled “Reader Rules”Consumers should read a bundle in this order:
- Resolve the run directory from either a directory path or an
index.jsonlpath. - Load
summary.jsonfor aggregate metadata and run-level display. - Stream
index.jsonlfor row identity, filters, status, scores, and sidecar paths. - Resolve sidecar paths relative to the run directory.
- Rebuild any local cache, search index, SQLite table, static report, or
vendor projection from
summary.json,index.jsonl, and sidecars.
Do not reconstruct paths from suite, name, test_id, target, or
directory names. result_dir is readable when possible, but it is still an
opaque run-local allocation that may be suffixed or otherwise changed to avoid
collisions.
Do not treat derived artifacts as canonical:
- Dashboard indexes are caches over the run bundle.
- Search indexes are caches over rows and sidecars.
- SQLite databases are query accelerators.
- HTML reports are renderings.
- Vendor-neutral projection bundles are adapter handoffs.
- Phoenix, Langfuse, Opik, or other backend views are external projections or correlations, not AgentV’s source of truth.
User Examples
Section titled “User Examples”Run an eval and inspect the portable bundle:
agentv eval evals/support/refunds.eval.yaml --experiment with_skillsls .agentv/results/<run_id>cat .agentv/results/<run_id>/summary.jsoncat .agentv/results/<run_id>/index.jsonlFind failed rows without loading every sidecar:
jq -r 'select(.execution_status != "ok" or .score < 0.5) | [.eval_path, .test_id, .target, .grading_path] | @tsv' \ .agentv/results/<run_id>/index.jsonlCompare two completed runs by their row indexes:
agentv compare \ .agentv/results/<baseline-run-id>/index.jsonl \ .agentv/results/<candidate-run-id>/index.jsonlGenerate a shareable report from the same canonical bundle:
agentv results report .agentv/results/<run_id>Integration Author Examples
Section titled “Integration Author Examples”An adapter that exports run results should treat index.jsonl as the row
catalog:
import { createReadStream } from "node:fs";import path from "node:path";import { createInterface } from "node:readline";
export async function* rows(runDir: string) { const rl = createInterface({ input: createReadStream(path.join(runDir, "index.jsonl"), "utf8"), crlfDelay: Infinity, });
for await (const line of rl) { if (!line.trim()) continue; yield JSON.parse(line) as Record<string, unknown>; }}
for await (const row of rows(".agentv/results/2026-run")) { const gradingPath = row.grading_path; if (typeof gradingPath === "string") { console.log(path.join(".agentv/results/2026-run", gradingPath)); }}Adapter guidance:
- Preserve unknown row fields when possible.
- Prefer path fields such as
grading_path,metrics_path,timing_path,transcript_path, andtranscript_raw_pathover ad hoc path construction. - Use
artifact_pointersonly for detached payload lookup; do not make pointers the discovery path for ordinary sidecars that are present in the run tree. - If you build a database or search index, store enough source metadata to
rebuild it from the run bundle and invalidate it when
summary.jsonorindex.jsonlchanges. - Keep backend-specific anonymization, upload, and schema mapping in the adapter layer. AgentV’s canonical bundle remains backend-neutral.