Layer 4 of 9

Training algorithm. optimization sculpts noise into skill.

Gradients are fragile currency.

Training minimizes a loss functional — cross-entropy for language modeling, contrastive objectives for embeddings.

Optimizers like AdamW accumulate momentum and adaptive scales; learning rates schedule warmup plus cosine decay.

See Deadwood's solution Next layer: Inference engine

What this layer does

Mechanics

Backpropagation walks gradients backward through the graph; mixed precision trades numerical noise for throughput.

Gradient accumulation simulates huge batches when VRAM is tight.

The problem without Deadwood

Without custodianship, your team inherits every sharp edge below.

Author bespoke PyTorch loops per experiment.
Tune LR finder plots manually.
Babysit NCCL topology failures nightly.

Typical DIY cost

Timeline: multiple multi-week burn-ins
Budget: $100k–$600k GPU
Expertise: Distributed training specialists

Deadwood's solution

Opinionated APIs wire custodied data, runners, and proofs together — no boilerplate archaeology.

from deadwood import Trainer

trainer = Trainer(
    model=model,
    dataset=clean_data,
    optimizer="adamw",
    precision="bf16",
)

trainer.fit(epochs=3, eval_every="2500steps")

How Deadwood custodies this layer

Trainer pipelines inherit manifests from DataCustodian — no rogue shards slip into gradients.

Automatic checkpoint diffing pairs metrics with on-chain attestations when policies demand it.

Next steps

Continue the tour

Follow how custody chains into Inference engine.

Next: Inference engine

Run a workload

Provision runners and metered jobs — describe the outcome, not every knob.

Start a job

Talk to custodians

White-glove onboarding for regulated teams and bespoke stacks.

Schedule a demo

← Previous: Weights