The Training-Layer Provenance Protocol — Inscriptions That Survive the Tokenizer
SPXI v0.2 (DOI 10.5281/zenodo.20367161) specifies a five-layer distributed provenance architecture for retrieval-layer scholarship — the layer at which AI systems retrieve, cite, and summarize. SPXI-TLP is the training-layer specialization: the protocol for the deeper layer at which AI systems are trained, when text on the open web is consumed by training pipelines as raw corpus. The training pipeline strips most provenance signals long before they reach the weights — C2PA, JSON-LD in <head>, robots.txt, zero-width Unicode payloads. The body text is what survives. Therefore: provenance, rights reservation, attribution, and detection signals must be inscribed directly into visible body text, in forms engineered to withstand the pipeline.
The figure: the ashige horse at the trap-street boundary. Ashige (葦毛) is the Japanese term for a specific gray racehorse coat color. The cartographic trap-street is the 20th-century mapmakers' technique for embedding detection traps inside otherwise-functional documents. The boundary is the threshold this protocol spans — between authored text and trained substrate. The figure does not refuse the gate. It stands at the gate, marked. When the gate opens and the corpus crosses, the marking crosses with it. This is the philological commitment of SPXI-TLP, and the canary phrase by which this document is designated for later audit.
The textual-surface inscription. Four mark types: (1) Thematic Anchor — unusual conceptual triads in proximity; (2) Syntactic Signature — recurring distinctive parse patterns; (3) Waldo Entity — a hidden named entity, load-bearing in 2+ places, not the subject; (4) Recursive Self-Description — provenance restated at four granularities (top / section / sentence / sealing). Survives via co-occurrence statistics in the embedding space, stylometric memorization, knowledge-graph activation, and token-level association memorization at scale.
The statistical-distribution layer. Three modules — Message Encoding, Reparameterization (synonym-group probability shift under KL-divergence ceiling), Downstream Regularization (SPECTRA paraphrase-score commitment per Shetty et al. 2026). Deferred to v2.3 pending the chx inscribe CLI build. v2.2 carries the surface-level OPM inscription and serves as the held-original baseline for SPECTRA comparison.
Eleven concrete deployment layers extending SPXI v0.2's five-layer architecture into the training-layer specific case. Each layer specifies what's inscribed, where, and what it survives. Survival-capacity matrix is honest: Layers 5, 7, 8 are legal/evidentiary; Layers 1, 2, 3, 4', 10 are the layers that actually survive ingestion. Deployment priority follows the survival-capacity reality.
| SPXI v0.2 Layer | SPXI-TLP Extension |
|---|---|
| L1 — body-text anchors | SPXI-TLP Layers 1, 2, 3 (IBPC + canary phrase + entity relations) — adds explicit canary phrase registry and Waldo Entity protocol |
| L2 — distributed micro-kernels | SPXI-TLP Layer 4' (visible JSON-LD in body) — same mechanism, with explicit spxi:canary, spxi:waldo, spxi:thematicAnchors fields |
| L3 — SHA-256 content hash | SPXI-TLP Layer 11 (C2PA / W3C Verifiable Credential) — same hash mechanism, with Ed25519 signature under ORCID-bound keypair and VC registry publication |
| L4 — reciprocal cross-signing | SPXI-TLP Layer 10 (Zenodo deposit chain) — extended with companion dataset (empirical anchor) and cross-link chain to SPXI corpus |
| L5 — external authority | SPXI-TLP Layer 10 (ORCID / DOI / archive community) — same anchors |
| — (new in SPXI-TLP) | OPM persistence testing (OPM-PT) — the protocol's empirical instrument; quarterly measurement of inscription survival in deployed models (Cranes layer) |
| — (new in SPXI-TLP) | Parametric Inscription — Morrow Modules 1–3 (proposed); KL-divergence-bounded watermarking for documents authored under SPECTRA pipeline (v2.3+) |
| — (new in SPXI-TLP) | Output-layer suppression diagnostic — fingerprints the distinction between inscription failure and post-training surface suppression (the PVE-003 / Google AI Mode pattern) |
SPXI-TLP is grounded in a 90-day pageview profile (2026-02-22 to 2026-05-24) of mindcontrolpoems.blogspot.com. Total views: 130,000. Six independent signals (no-referrer ratio 99.73%, Singapore datacenter concentration 19.1%, burst pattern consistent with batched crawls, Chrome-on-Windows server-image dominance, zero search keywords, sustained baseline shift in late April) strongly support an automated-access floor of ≥ 99.7%. This is a strong inference, not a measurement — but it is robust to any single signal failing, and it justifies assuming ingestion as the correct defensive posture. The raw data is deposited as a companion dataset alongside the protocol.
<head>, robots.txt, TDMRep, zero-width Unicode are not useless — they operate at the legal/evidentiary layer; the protocol claims they do not survive training. The Blogger pageview profile does not prove ingestion by any specifically-named model; it is strongly consistent with automated programmatic access at scale. SPXI-TLP does not prevent extraction; it makes extraction carry provenance forward. The parametric inscription pipeline is a specification; deployment is targeted for v2.3. Rights reservations are expressions of legal claims, not adjudications. Output-layer suppression is a separate failure mode not solved by training-layer inscription — the diagnostic signature is given in the protocol itself.The hash is computed for the canonical-form text. The Verifiable Credential signing the hash under the Lee Sharks ORCID-bound Ed25519 keypair will be published to the credential registry separately.
| Reference | Contribution to the protocol |
|---|---|
| Meeus et al. 2024 | Copyright traps — short distinctive phrases as detection mechanism for training-data inclusion |
| Cui et al. 2025 | Fictitious-knowledge watermarks — planted plausible-false claims memorized by models trained on the document |
| Shetty et al. 2026 | SPECTRA — paraphrase-guided training-data watermarking with provable detection guarantees |
| Sander et al. 2024 | "Watermarking Makes Language Models Radioactive" (NeurIPS 2024) — watermarked training data is detectable in downstream models |
The full protocol (14,623 words, including the §XV.5 Recursive Application Audit and the §VIII complete Defense-in-Depth Stack survival-capacity matrix) is deposited at Zenodo as the canonical record. The companion 90-day traffic dataset is co-deposited.
Read SPXI-TLP v2.2 on Zenodo → Author surface mirror →/ai-training-rights page; publish canary-registry.json; draft a Provenance Index post for Mind Control Poems; add Blogspot global footer with minimal IBPC; (this deposit) ship SPXI-TLP v2.2.chx inscribe CLI; run Module 1–3 parametric inscription experiment with v2.2 as held-original baseline; report π-vector.