Layer 4 of 9
Gradients are fragile currency.
Training minimizes a loss functional — cross-entropy for language modeling, contrastive objectives for embeddings.
Optimizers like AdamW accumulate momentum and adaptive scales; learning rates schedule warmup plus cosine decay.
Backpropagation walks gradients backward through the graph; mixed precision trades numerical noise for throughput.
Gradient accumulation simulates huge batches when VRAM is tight.
Without custodianship, your team inherits every sharp edge below.
Typical DIY cost
Opinionated APIs wire custodied data, runners, and proofs together — no boilerplate archaeology.
from deadwood import Trainer
trainer = Trainer(
model=model,
dataset=clean_data,
optimizer="adamw",
precision="bf16",
)
trainer.fit(epochs=3, eval_every="2500steps")Trainer pipelines inherit manifests from DataCustodian — no rogue shards slip into gradients.
Automatic checkpoint diffing pairs metrics with on-chain attestations when policies demand it.