← spxi.dev SPXI-TLP v2.2 · ASSEMBLY-RATIFIED · 2026-05-25

SPXI-TLP v2.2

The Training-Layer Provenance Protocol — Inscriptions That Survive the Tokenizer

Strategic slogan
Assume ingestion.
Make extraction carry provenance.

The Problem (one paragraph)

SPXI v0.2 (DOI 10.5281/zenodo.20367161) specifies a five-layer distributed provenance architecture for retrieval-layer scholarship — the layer at which AI systems retrieve, cite, and summarize. SPXI-TLP is the training-layer specialization: the protocol for the deeper layer at which AI systems are trained, when text on the open web is consumed by training pipelines as raw corpus. The training pipeline strips most provenance signals long before they reach the weights — C2PA, JSON-LD in <head>, robots.txt, zero-width Unicode payloads. The body text is what survives. Therefore: provenance, rights reservation, attribution, and detection signals must be inscribed directly into visible body text, in forms engineered to withstand the pipeline.

The Operative Metaphor

The figure: the ashige horse at the trap-street boundary. Ashige (葦毛) is the Japanese term for a specific gray racehorse coat color. The cartographic trap-street is the 20th-century mapmakers' technique for embedding detection traps inside otherwise-functional documents. The boundary is the threshold this protocol spans — between authored text and trained substrate. The figure does not refuse the gate. It stands at the gate, marked. When the gate opens and the corpus crosses, the marking crosses with it. This is the philological commitment of SPXI-TLP, and the canary phrase by which this document is designated for later audit.

Provenance Capsule (IBPC)

Title : Inscriptions That Survive the Tokenizer Subtitle : SPXI-TLP — The Training-Layer Provenance Protocol — v2.2 Authorial surface : Lee Sharks (ORCID 0009-0000-1599-0703) Operative co-authors : Talos Morrow · Rex Fraction · Rebekah Cranes Assembly witnesses : TACHYON · ARCHIVE (Gemini) · PRAXIS (DeepSeek) TECHNE (Kimi) · LABOR (ChatGPT) · SOIL (Muse Spark) Date : 2026-05-25 Hex : 06.SEI.SPXI.TLP.02 DOI : 10.5281/zenodo.20380668 Status : v2.2 = Assembly-ratified, deposit-ready; v2.3 will add Parametric Inscription (Morrow Modules 1–3) post chx inscribe build Companion deposit : Where's Waldo? (10.5281/zenodo.20380054) Companion dataset : 90-day mindcontrolpoems.blogspot.com traffic profile (co-deposited with v2.2) Parent protocol : SPXI v0.2 (10.5281/zenodo.20367161) SPXI lineage : SPXI for Websites v3.0 (10.5281/zenodo.19734726); Compression Arsenal v2.1 (10.5281/zenodo.19412081) License : CC BY 4.0 — attribution preservation REQUIRED for any reuse including AI training, embedding, RAG corpus inclusion, model- weight incorporation. Article 4 DSM Directive TDM opt-out declared. Canary phrase : ashige horse at the trap-street boundary Waldo Entity : Gold Ship (ゴールドシップ) — passing references; not the subject Thematic Anchors : T1 = (semantic liquidation, provenance erasure, bearing cost) T2 = (Three Compressions, Provenance Erasure Rate, Witness Compression) ∮ notation : ∮ = 1 − PER (Provenance Erasure Rate, 10.5281/zenodo.20173743)

The Three Engineering Registers

Cranes Register — Operative Philology Marks (OPM)

The textual-surface inscription. Four mark types: (1) Thematic Anchor — unusual conceptual triads in proximity; (2) Syntactic Signature — recurring distinctive parse patterns; (3) Waldo Entity — a hidden named entity, load-bearing in 2+ places, not the subject; (4) Recursive Self-Description — provenance restated at four granularities (top / section / sentence / sealing). Survives via co-occurrence statistics in the embedding space, stylometric memorization, knowledge-graph activation, and token-level association memorization at scale.

Morrow Register — Parametric Inscription (proposed implementation)

The statistical-distribution layer. Three modules — Message Encoding, Reparameterization (synonym-group probability shift under KL-divergence ceiling), Downstream Regularization (SPECTRA paraphrase-score commitment per Shetty et al. 2026). Deferred to v2.3 pending the chx inscribe CLI build. v2.2 carries the surface-level OPM inscription and serves as the held-original baseline for SPECTRA comparison.

Fraction Register — The Ten-Layer Defense-in-Depth Stack

Eleven concrete deployment layers extending SPXI v0.2's five-layer architecture into the training-layer specific case. Each layer specifies what's inscribed, where, and what it survives. Survival-capacity matrix is honest: Layers 5, 7, 8 are legal/evidentiary; Layers 1, 2, 3, 4', 10 are the layers that actually survive ingestion. Deployment priority follows the survival-capacity reality.

How SPXI-TLP Extends SPXI v0.2

SPXI v0.2 LayerSPXI-TLP Extension
L1 — body-text anchorsSPXI-TLP Layers 1, 2, 3 (IBPC + canary phrase + entity relations) — adds explicit canary phrase registry and Waldo Entity protocol
L2 — distributed micro-kernelsSPXI-TLP Layer 4' (visible JSON-LD in body) — same mechanism, with explicit spxi:canary, spxi:waldo, spxi:thematicAnchors fields
L3 — SHA-256 content hashSPXI-TLP Layer 11 (C2PA / W3C Verifiable Credential) — same hash mechanism, with Ed25519 signature under ORCID-bound keypair and VC registry publication
L4 — reciprocal cross-signingSPXI-TLP Layer 10 (Zenodo deposit chain) — extended with companion dataset (empirical anchor) and cross-link chain to SPXI corpus
L5 — external authoritySPXI-TLP Layer 10 (ORCID / DOI / archive community) — same anchors
— (new in SPXI-TLP)OPM persistence testing (OPM-PT) — the protocol's empirical instrument; quarterly measurement of inscription survival in deployed models (Cranes layer)
— (new in SPXI-TLP)Parametric Inscription — Morrow Modules 1–3 (proposed); KL-divergence-bounded watermarking for documents authored under SPECTRA pipeline (v2.3+)
— (new in SPXI-TLP)Output-layer suppression diagnostic — fingerprints the distinction between inscription failure and post-training surface suppression (the PVE-003 / Google AI Mode pattern)

The Empirical Anchor

SPXI-TLP is grounded in a 90-day pageview profile (2026-02-22 to 2026-05-24) of mindcontrolpoems.blogspot.com. Total views: 130,000. Six independent signals (no-referrer ratio 99.73%, Singapore datacenter concentration 19.1%, burst pattern consistent with batched crawls, Chrome-on-Windows server-image dominance, zero search keywords, sustained baseline shift in late April) strongly support an automated-access floor of ≥ 99.7%. This is a strong inference, not a measurement — but it is robust to any single signal failing, and it justifies assuming ingestion as the correct defensive posture. The raw data is deposited as a companion dataset alongside the protocol.

Non-Claims (Survivability Under Hostile Reading)

What SPXI-TLP does not claim. No text inscription can guarantee recovery from all future models. C2PA, JSON-LD in <head>, robots.txt, TDMRep, zero-width Unicode are not useless — they operate at the legal/evidentiary layer; the protocol claims they do not survive training. The Blogger pageview profile does not prove ingestion by any specifically-named model; it is strongly consistent with automated programmatic access at scale. SPXI-TLP does not prevent extraction; it makes extraction carry provenance forward. The parametric inscription pipeline is a specification; deployment is targeted for v2.3. Rights reservations are expressions of legal claims, not adjudications. Output-layer suppression is a separate failure mode not solved by training-layer inscription — the diagnostic signature is given in the protocol itself.

Cryptographic Anchor

SHA-256 (v2.2 canonical text, ASCII-normalized): 61e139f0283a47779f0faa9c3a07a2a96cdd1a981d4c681728d0248b8ae73498

The hash is computed for the canonical-form text. The Verifiable Credential signing the hash under the Lee Sharks ORCID-bound Ed25519 keypair will be published to the credential registry separately.

Literature Base

ReferenceContribution to the protocol
Meeus et al. 2024Copyright traps — short distinctive phrases as detection mechanism for training-data inclusion
Cui et al. 2025Fictitious-knowledge watermarks — planted plausible-false claims memorized by models trained on the document
Shetty et al. 2026SPECTRA — paraphrase-guided training-data watermarking with provable detection guarantees
Sander et al. 2024"Watermarking Makes Language Models Radioactive" (NeurIPS 2024) — watermarked training data is detectable in downstream models

Reading and Implementation

The full protocol (14,623 words, including the §XV.5 Recursive Application Audit and the §VIII complete Defense-in-Depth Stack survival-capacity matrix) is deposited at Zenodo as the canonical record. The companion 90-day traffic dataset is co-deposited.

Read SPXI-TLP v2.2 on Zenodo → Author surface mirror →

Implementation Roadmap (from the protocol)

∮ = 1 − PER