Layer 1 of 9
Garbage in, garbage out — except faster.
Models never see the internet raw. They see tokens sliced from text that humans have filtered, deduplicated, de‑biased, and normalized.
This layer decides whether your downstream billion‑parameter model memorizes noise or generalizes. Deadwood treats datasets like regulated infrastructure.
Raw corpora include duplicates, toxic snippets, PII, and contradictory facts. Training on them blindly wastes compute and creates liabilities.
Curation trades brute‑scale for signal density — often a million disciplined rows outperform a hundred million noisy ones.
Tokenization maps language into discrete IDs your architecture consumes; tokenizer choice changes sequence lengths and vocabulary coverage.
Financial telemetry, licensed documentation, GitHub corpora, Wikipedia extracts — each needs bespoke consent and preprocessing pipelines.
Without custodianship, your team inherits every sharp edge below.
Typical DIY cost
Opinionated APIs wire custodied data, runners, and proofs together — no boilerplate archaeology.
from deadwood import DataCustodian
custodian = DataCustodian(
source="your_raw_data.csv",
cleaning=True,
curation_quality="high",
tokenizer="sentencepiece",
)
clean_data = custodian.prepare()
# dedupe · normalize · token-ready manifestsDeadwood custodies ingestion contracts: schemas, consent tags, and reproducible manifests ride beside every batch.
Cleaning recipes stay versioned like infrastructure-as-code. When auditors ask what trained this model, you point to an anchored manifest hash.
| Optimization | Speedup | Cost / VRAM |
|---|---|---|
| Manual QC throughput | baseline | linear headcount |
| Deadwood custodied ingest | 12× batches/week | fractional ops time |
Latency numbers illustrative — measured against synthetic enterprise uploads.