Layer 7 of 9

Evaluation & verification. metrics turn opinions into evidence.

Prove quality before users feel it.

Automatic metrics (BLEU, ROUGE, perplexity) proxy human judgment until annotators intervene.

Fairness audits stress-test slices — geography, income bands, dialect.

What this layer does

Evidence stack

Human eval harnesses Likert scales or pairwise battles.

Anchoring writes fingerprints — model hash, dataset hash, metric tuple — to Cosmos/Avalanche relays.

The problem without Deadwood

Without custodianship, your team inherits every sharp edge below.

  • Glue spreadsheets and notebooks ad hoc.
  • Lose audit trails when notebooks diverge.
  • Explain governance without cryptographic receipts.

Typical DIY cost

Timeline
2–5 weeks per release gate
Budget
$20k–$80k annotator budget
Expertise
Risk + ML QA leads

Deadwood's solution

Opinionated APIs wire custodied data, runners, and proofs together — no boilerplate archaeology.

from deadwood import Evaluator

evaluator = Evaluator(
    model=finetuned,
    test_set=held_out,
    metrics=["bleu", "rouge", "f1", "fairness"],
)

report = evaluator.run()
print(report.chain_tx)

How Deadwood custodies this layer

Evaluator binds fairness suites to regulatory posture — failing slices block promotion automatically.

Chain TX IDs ride beside dashboards so external auditors reproduce claims.

Next steps

Continue the tour

Follow how custody chains into Deployment & monitoring.

Next: Deployment & monitoring

Run a workload

Provision runners and metered jobs — describe the outcome, not every knob.

Start a job

Talk to custodians

White-glove onboarding for regulated teams and bespoke stacks.

Schedule a demo