Layer 4 of 9

Training algorithm. optimization sculpts noise into skill.

Gradients are fragile currency.

Training minimizes a loss functional — cross-entropy for language modeling, contrastive objectives for embeddings.

Optimizers like AdamW accumulate momentum and adaptive scales; learning rates schedule warmup plus cosine decay.

What this layer does

Mechanics

Backpropagation walks gradients backward through the graph; mixed precision trades numerical noise for throughput.

Gradient accumulation simulates huge batches when VRAM is tight.

The problem without Deadwood

Without custodianship, your team inherits every sharp edge below.

  • Author bespoke PyTorch loops per experiment.
  • Tune LR finder plots manually.
  • Babysit NCCL topology failures nightly.

Typical DIY cost

Timeline
multiple multi-week burn-ins
Budget
$100k–$600k GPU
Expertise
Distributed training specialists

Deadwood's solution

Opinionated APIs wire custodied data, runners, and proofs together — no boilerplate archaeology.

from deadwood import Trainer

trainer = Trainer(
    model=model,
    dataset=clean_data,
    optimizer="adamw",
    precision="bf16",
)

trainer.fit(epochs=3, eval_every="2500steps")

How Deadwood custodies this layer

Trainer pipelines inherit manifests from DataCustodian — no rogue shards slip into gradients.

Automatic checkpoint diffing pairs metrics with on-chain attestations when policies demand it.

Next steps

Continue the tour

Follow how custody chains into Inference engine.

Next: Inference engine

Run a workload

Provision runners and metered jobs — describe the outcome, not every knob.

Start a job

Talk to custodians

White-glove onboarding for regulated teams and bespoke stacks.

Schedule a demo