Result Artifact Contract

AgentV writes each eval invocation as a portable run bundle. The bundle is the source of truth for Dashboard, reports, compare/trend tooling, CI gates, and external adapters.

The contract is run-centric:

summary.json owns aggregate run facts.
index.jsonl owns row-level discovery and filtering.
Per-case sidecars own detailed payloads such as grading, metrics, transcripts, timing, generated files, and raw provider evidence.
Dashboard, search, SQLite, HTML reports, and vendor exports are rebuildable projections over the bundle.

Directory Layout

The default local layout is:

.agentv/results/
  <run_id>/
    summary.json
    index.jsonl
    tags.json                 # optional mutable Dashboard tags
    <case-or-allocation>/
      summary.json            # optional per-case aggregate, especially repeats
      test/                   # optional generated test bundle
        EVAL.yaml
        targets.yaml
        files/
        graders/
      run-1/
        result.json
        grading.json
        metrics.json
        timing.json
        transcript.json
        transcript-raw.jsonl
        outputs/
          answer.md
          file_changes.diff
      run-2/
        result.json
        grading.json
        metrics.json
        timing.json
        transcript.json
        transcript-raw.jsonl
        outputs/
          answer.md
          file_changes.diff
  .indexes/                  # reserved rebuildable/local indexes
  .cache/                    # reserved local cache

<run_id> is the only committed run-bundle path identity. It helps AgentV put completed runs somewhere predictable, but readers must not infer semantic truth from folder names. Use fields in summary.json and index.jsonl for experiment, target, variant, attempt, eval path, case identity, timing, scores, and artifact paths.

The run bundle does not add target, model, variant, or cases/ folders below <run_id>. Per-result directories are allocated from row identity, usually with a readable test-id or slug prefix plus a short hash suffix, and remain opaque to consumers.

experiment is metadata: it is how users label a condition such as baseline, candidate, with_skills, or without_skills. It is recorded in summary.json and rows, not as a parent directory and not as a runtime-policy object. If a bundle is copied, combined, published, or imported under a different directory, its metadata still carries the facts consumers should query.

Top-level dot-prefixed directories such as .indexes/ and .cache/ are reserved for rebuildable local state and are skipped by run discovery.

File Roles

File or field	Owns	Use it for
`summary.json`	Aggregate run metadata and rollups: run id, experiment metadata, counts, pass rate, score summaries, duration, token/cost totals, and writer metadata.	Listing runs, CI summaries, quick dashboards, trend cards, and validating that a run is complete enough to inspect.
`index.jsonl`	Canonical row index: one row per result, attempt, or case-level aggregate, with identity fields, filter metadata, scores, status, and explicit run-relative paths to sidecars.	Filtering, compare/trend inputs, Dashboard detail routing, rerun/resume lookup, export adapters, and artifact discovery.
`result.json`	Compact per-attempt manifest for one attempt directory, including AgentV `execution_status` and `verdict`.	Loading one attempt without scanning the whole run index.
`grading.json`	Grader outputs, assertions, rubric evidence, execution-metric grader facts, and scoring provenance.	Explaining why a row passed or failed.
`metrics.json`	Derived executor behavior summary, such as tool calls, files touched, shell commands, errors, turns, and output sizes.	Dashboard behavior views, metric-style graders, adapter projections, and lightweight analysis.
`outputs/file_changes.diff`	Full unified diff of workspace file changes when file changes are captured.	Human review and external artifact inspection; LLM and script graders still receive the same full diff through `file_changes`.
`timing.json`	Duration, token usage, cost usage, and source labels such as `provider_reported`, `token_estimated`, `aggregate`, or `unavailable`.	Cost/latency reporting and provider-accounting audits.
`transcript.json`	AgentV-normalized transcript/timeline document with canonical `tool_name` values and `transcript_summary`.	Portable human review, transcript-aware graders, and tool-trajectory analysis.
`transcript-raw.jsonl`	Native provider or harness evidence when available.	Parser debugging, forensic review, and preserving source bytes without making provider schemas public AgentV fields.
`test/`	Generated test bundle for the exact eval slice and target settings that produced a row.	Audit, external review, and rerun workflows that should not depend on a mutable source checkout.
`artifact_pointers`	Offload indirection for large detached payload bytes.	Finding payloads published outside the primary metadata/control-plane branch, such as transcript bytes on `agentv/artifacts/v1`.

summary.json and index.jsonl are complementary, not redundant. A run list should not scan every row just to show pass rate or total duration, and a row reader should not parse aggregate summary structures to find one case’s grading or transcript. Keep aggregate questions on summary.json; keep row and artifact discovery on index.jsonl.

Row Contract

Each index.jsonl line is a JSON object. The exact field set grows as AgentV adds providers and projections, but stable rows follow these rules:

Field names are snake_case.
Identity and filter fields live on the row, not only in directory names.
Sidecar references are explicit path fields, relative to the run directory.
Large detached payloads may also have artifact_pointers, but ordinary sidecars should still be discoverable through path fields.
Unknown fields should be preserved by adapters when they rewrite or project rows.

Example row:

{
  "timestamp": "2026-06-30T08:15:00.000Z",
  "run_id": "2026-06-30T08-15-00-000Z",
  "experiment": "with_skills",
  "tags": { "experiment": "with_skills", "team": "support" },
  "eval_path": "evals/support/refunds.eval.yaml",
  "test_id": "refund-eligibility",
  "target": "codex-gpt5",
  "variant": "skills-v2",
  "attempt": 1,
  "execution_status": "ok",
  "score": 0.92,
  "duration_ms": 184200,
  "result_dir": "refund-eligibility--4f9a7c2d1b6e",
  "summary_path": "refund-eligibility--4f9a7c2d1b6e/summary.json",
  "grading_path": "refund-eligibility--4f9a7c2d1b6e/run-1/grading.json",
  "metrics_path": "refund-eligibility--4f9a7c2d1b6e/run-1/metrics.json",
  "timing_path": "refund-eligibility--4f9a7c2d1b6e/run-1/timing.json",
  "transcript_path": "refund-eligibility--4f9a7c2d1b6e/run-1/transcript.json",
  "transcript_raw_path": "refund-eligibility--4f9a7c2d1b6e/run-1/transcript-raw.jsonl",
  "transcript_summary": {
    "total_turns": 4,
    "tool_calls": { "file_read": 2, "shell": 1, "unknown": 0 },
    "files_read": ["src/refunds.ts"],
    "files_modified": ["src/refunds.ts"],
    "shell_commands": ["bun test refunds.test.ts"],
    "web_fetches": [],
    "errors": [],
    "thinking_blocks": 1
  },
  "output_path": "refund-eligibility--4f9a7c2d1b6e/run-1/outputs/answer.md",
  "answer_path": "refund-eligibility--4f9a7c2d1b6e/run-1/outputs/answer.md",
  "file_changes_path": "refund-eligibility--4f9a7c2d1b6e/run-1/outputs/file_changes.diff",
  "test_dir": "refund-eligibility--4f9a7c2d1b6e/test"
}

Rows can represent repeated attempts, multi-target runs, imported suites, manual prepare/grade attempts, or imported provider sessions. That is why experiment, eval_path, test_id, target, variant, attempt, and source metadata belong in index.jsonl: tools can filter dynamically without requiring every run to be pre-split into semantic folders.

When a run resolves a promptfoo-shaped tags map (from suite tags, project config tags, or --tag key=value), the resolved map is emitted as tags on each row and as summary.json.metadata.tags. Its reserved experiment key matches the row experiment field, so trend/compare views can group by tags.experiment.

The run-1/, run-2/, and later folders under a result directory are artifact attempt/execution folders. Do not treat those folder names as the comparison dimension. Repeated stochastic samples should be represented by explicit metadata such as sample_index and sample_count; infrastructure retries should use retry metadata such as retry_index, retry_count, and retry_reason when available.

Reader Rules

Consumers should read a bundle in this order:

Resolve the run directory from either a directory path or an index.jsonl path.
Load summary.json for aggregate metadata and run-level display.
Stream index.jsonl for row identity, filters, status, scores, and sidecar paths.
Resolve sidecar paths relative to the run directory.
Rebuild any local cache, search index, SQLite table, static report, or vendor projection from summary.json, index.jsonl, and sidecars.

Do not reconstruct paths from suite, name, test_id, target, or directory names. result_dir is readable when possible, but it is still an opaque run-local allocation that may be suffixed or otherwise changed to avoid collisions.

Do not treat derived artifacts as canonical:

Dashboard indexes are caches over the run bundle.
Search indexes are caches over rows and sidecars.
SQLite databases are query accelerators.
HTML reports are renderings.
Vendor-neutral projection bundles are adapter handoffs.
Phoenix, Langfuse, Opik, or other backend views are external projections or correlations, not AgentV’s source of truth.

User Examples

Run an eval and inspect the portable bundle:

agentv eval evals/support/refunds.eval.yaml --experiment with_skills
ls .agentv/results/<run_id>
cat .agentv/results/<run_id>/summary.json
cat .agentv/results/<run_id>/index.jsonl

Find failed rows without loading every sidecar:

jq -r 'select(.execution_status != "ok" or .score < 0.5) |
  [.eval_path, .test_id, .target, .grading_path] | @tsv' \
  .agentv/results/<run_id>/index.jsonl

Compare two completed runs by their row indexes:

agentv compare \
  .agentv/results/<baseline-run-id>/index.jsonl \
  .agentv/results/<candidate-run-id>/index.jsonl

Generate a shareable report from the same canonical bundle:

agentv results report .agentv/results/<run_id>

Integration Author Examples

An adapter that exports run results should treat index.jsonl as the row catalog:

import { createReadStream } from "node:fs";
import path from "node:path";
import { createInterface } from "node:readline";

export async function* rows(runDir: string) {
  const rl = createInterface({
    input: createReadStream(path.join(runDir, "index.jsonl"), "utf8"),
    crlfDelay: Infinity,
  });

  for await (const line of rl) {
    if (!line.trim()) continue;
    yield JSON.parse(line) as Record<string, unknown>;
  }
}

for await (const row of rows(".agentv/results/2026-run")) {
  const gradingPath = row.grading_path;
  if (typeof gradingPath === "string") {
    console.log(path.join(".agentv/results/2026-run", gradingPath));
  }
}

Adapter guidance:

Preserve unknown row fields when possible.
Prefer path fields such as grading_path, metrics_path, timing_path, transcript_path, and transcript_raw_path over ad hoc path construction.
Use artifact_pointers only for detached payload lookup; do not make pointers the discovery path for ordinary sidecars that are present in the run tree.
If you build a database or search index, store enough source metadata to rebuild it from the run bundle and invalidate it when summary.json or index.jsonl changes.
Keep backend-specific anonymization, upload, and schema mapping in the adapter layer. AgentV’s canonical bundle remains backend-neutral.