v0.1 (Character Blocks, content-address + embedding, Skillchain, Federation) + v0.2 (royalty cascade, lift measurement, lineage score) + v0.3 (trainable license-class, cross-lab corpus treaties) · §3.4 Side-Effect Emission, §4.2 The Unified Embedding, §4.3 Vector-Space Proximity, §6.3.5 Lift Measurement, §6.9 Bees for Honey
Training Data with a Lineage
Training Data with a Lineage
The open web is filling with model output, and the labs training the next generation of models can no longer reliably tell which text was written by a person solving a real problem and which was generated to fill a page. Meanwhile the people producing the genuinely good data — in the flow of real work — are paid nothing for it and have every reason to stop publishing where a scraper can reach them. HOP attaches a verifiable lineage to every piece of work at the moment it is made, lets a lab find exactly the data it needs by vector proximity across a global tree, ingest only attested-author content, and pay each author per token included — and pay more when the data measurably improved the model. Pay the bees for the honey.
The situation
Two failures are composing at the frontier.
The first is the data wall. The bottleneck on the most capable models is no longer compute; it is the supply of high-quality human text, and the supply is finite. The labs have read the good corpus already.
The second is slop. The open web is now substantially model-generated, and a corpus crawled in 2026 is partly the exhaust of last year’s models. Training a model on the undifferentiated output of other models degrades it — the distribution narrows, the errors compound, the rare and the true get washed out by the fluent and the average. The labs know this. What they cannot cheaply do is tell the difference at scale — separate the engineer’s hard-won post-mortem from the thousand SEO regurgitations of the same Stack Overflow thread, when both arrive as anonymous bytes on a page.
The deeper problem underneath both is that provenance was thrown away at creation time. A scraped document has no bloodline. The bytes do not carry who made them, in what context, whether the work was real, whether anyone relied on it afterward. That signal — the thing that would let a lab distinguish in-the-flow human work from farmed filler — existed at the moment the work was done, and was discarded the instant it became a webpage. The entire data-quality industry is the attempt to reconstruct, post-hoc and probabilistically, a provenance signal that was free to capture at the source and is expensive to recover after.
And the incentive is now inverted. The person who writes the genuinely excellent thing in the open gets scraped for free, indistinguishably from the slop, and paid nothing. So the rational move is to stop: put the good work behind a login, a paywall, a private repo, a closed Slack. The commons of high-quality human work is being enclosed precisely because there is no way to be paid for leaving it open. This is the exact failure Paying the Bees for Honey names: you cannot reward the bees for the honey they already make by paying them a wage; you reward them by giving them a share of the price of the honey. No share exists, so the honey stops flowing into the open.
The best training data is the byproduct of real work done for real stakes — not data farmed to be a dataset. That data exists. It is just unattributed, unfindable, and unpaid.
The example
A senior engineer spends a Tuesday fixing a race condition that has been intermittently corrupting a ledger for months. She finds it, fixes it, and writes it up — the diff, the failure mode, the reasoning about why the obvious fix was wrong and the real fix worked. Two colleagues review it. Over the next year, four other fixes in the codebase cite her write-up as the thing that explained the bug class to them.
This is precisely the data a lab would pay a premium for: dense reasoning, produced under real constraints, verified by peers, demonstrably useful downstream. It is in the flow. And today it is either locked in a private repository where the lab will never see it, or scraped without attribution or payment and dropped into a 10-trillion-token pile where it is statistically indistinguishable from a content farm’s paraphrase of someone else’s answer.
Put numbers on the lab’s side. A lab assembling a reasoning corpus might find that a low single-digit percentage of what it crawls is genuinely high-lineage in-the-flow human work; the rest is scrape of unknown provenance, increasingly self-polluted. The lab then spends heavily — on copyright legal exposure, on dedup, on quality classifiers, on contractors rating samples — trying to recover, after the fact, the provenance signal that was free to capture at creation. It is paying a fortune to approximate a number the substrate could have carried for nothing.
What HOP does
Five steps. Each is a real protocol operation, not a metaphor.
1. Work emits attributed content as a side-effect
When the fix lands, the engineer’s tooling emits a Character Block (§3) — the same §3.4 side-effect emission the org-walkability case (02) runs on. The block carries a content hash of the write-up, the work context, and the dual signatures of the author and her reviewers:
{
"block_type": "work",
"content_hash": "sha256:a91f…",
"inventory": {"medium": "text", "kind": "post_mortem",
"license_class": "trainable_cascade"},
"context": {"work_class": "concurrency_bugfix",
"complexity_class": "high",
"reviewed_by": 2,
"produced_in": "real_work"},
"derived_from": [
{"hash": "sha256:3b7c…", "kind": "incident_thread", "weight": 0.2}
],
"signatures": {"worker_signature": "ed25519:…",
"poster_signature": "ed25519:…",
"validator_signature": "ed25519:…"},
"prev_block_hash": "sha256:7e02…"
}
The content was not farmed for a dataset. It is the exhaust of work that was happening anyway. That is the lineage — and it is captured at the only moment it is cheap to capture, the moment of creation.
2. Content is embedded and federated — globally findable
Each block contributes to the unified embedding (§4.2) and is addressable by its content hash. The content is therefore discoverable two ways at once: by exact hash, and by semantic proximity across the federated data tree (the same global tree the royalty cascade in 01 walks). A lab’s data-acquisition agent does not crawl and guess; it runs a vector query. Everything on the internet can be found with a vector database — and here, what it finds arrives with its lineage attached rather than stripped.
3. Lineage becomes a computable score
The provenance the bytes used to lack is now a graph the chain already holds: who signed the block, in what work context, whether the work was relied on downstream (other blocks pointing derived_from at it), and the author’s standing in the inverted reputation geometry (§4.2, §14 — reputation lives in what others signed about her, so it cannot be self-inflated). A lineage score is a query over this graph, not a new study:
- High lineage — dual-signed by author and reviewer, produced in real work, reused downstream by other signed work, authored by someone trusted by the relying party’s trusted set.
- Low lineage — unsigned, no work context, no downstream use, or self-referential (the content’s
derived_fromchain loops back into model output).
This is the number labs currently cannot compute because the inputs to it were discarded at creation. HOP carries them.
4. The lab ingests only what it chooses
The acquisition agent queries the tree for the reasoning patterns it needs, filtered to lineage_score ≥ threshold and license_class = trainable_cascade. It ingests attested-author content with a clean licence trail and an intact provenance graph — and skips the slop, not by a leaky post-hoc classifier but because the slop never carried the lineage in the first place. Model collapse is held off at the source: the lab can deterministically weight toward high-lineage human work and hold generated content out, because provenance is a field on the data rather than a guess about it. Copyright exposure collapses in the same move: every included token carries a signed licence from an attributable author.
5. Payment cascades — and lift pays the best more
Two payments compose, exactly as in the crystallised-labour case (01).
Per token included. Each token the lab trains on settles a sub-cent micropayment back along the derived_from graph to its author. The rail argument is identical to 01: a payment is a state change on a chain, not a card-network transaction, so paying a millionth of a cent a trillion times costs a trillion writes, not a trillion interchange fees. The author off-ramps to fiat on whatever cadence she likes; the rail fee is paid once, at the off-ramp, not per token.
Per unit of lift. Per-token pays for inclusion. It does not yet distinguish the document that taught the model something from the document that merely filled a batch. The §6.3.5 Bean-Chain lift measurement, applied to data instead of mentorship, closes that gap:
| Bean Chain (mentorship) | Same shape, applied to training data |
|---|---|
| T1 — baseline mentee skill vector | T1 — eval the model without document cluster X |
| T2 — mentorship event; Bean staked | T2 — include cluster X in the training mix |
| T3 — mentee at +12 months | T3 — eval the model again, project the delta onto the capability the lab cares about |
| T4 — confirm sustained outperformance | T4 — confirm the gain holds across model generations; bonus settles |
Positive projection means the data demonstrably moved the model on a capability the lab values; the author collects a lift bonus proportional to the measured gain. The engineer whose race-condition write-up actually taught the model to reason about concurrency earns against the lift, not merely the token count.
This is the operational meaning of the model company distinguishes the lineage of high-quality data and trains on it: lineage tells the lab which honey is real before it ever trains; lift tells it which honey was worth the most after. Both are queries over data the substrate already holds.
Distinguishing the data that fills from the data that teaches
The structure here is the same one 01 draws for snares, and it is worth stating plainly because it is the load-bearing claim.
Per-token-included payment is the floor. It is fair: it pays every author whose words were used, in proportion to how much was used. But two corpora of identical token count can have wildly different value — one moves the model, the other is ballast — and per-token alone pays them the same. Lineage is the pre-training filter (which data is genuinely in-the-flow human work worth ingesting at all); lift is the post-training settlement (which of the ingested data actually taught the model something). Together they let a lab do what no current data pipeline can: pay for honey in proportion to how good the honey turned out to be, and tell the difference before buying.
It also resists the obvious attack. An author cannot farm a million near-duplicate documents to inflate a per-token cascade, because the lineage score craters on unsigned, no-downstream-use, self-referential content, and the lift bonus pays zero (or, under v0.3 clawback, negative) for data that does not move the eval. The §6.3.4 anti-collusion safeguards — pairwise discount caps, the Christiano-style trust matrix — carry over from the mentorship case directly.
What’s different
A world in which every piece of real work — every fix, every write-up, every dataset row, every answer given in the flow of solving an actual problem — is signed, findable, and paid per token-included-plus-lift is a world where:
- Model collapse is averted at the source. The corpus is provenance-stamped, so the lab weights toward high-lineage human work deterministically instead of fighting its own crawl with classifiers.
- The commons re-opens. Because publishing in the open now pays — per token, and more for data that teaches — the rational move flips from enclosure back to contribution. Everyone doing real work in the flow is, as a side-effect, contributing training data, and being paid for it. This is Paying the Bees for Honey made operational: the share of the price of the honey finally exists, so the honey flows back into the open.
- “In-the-flow” data becomes a priced, distinguished asset class. The market can finally tell the snare that drives the play from the snare that merely gets played (
01) — and the document that teaches the model from the document that fills the batch. - The lab sheds legal exposure and gains a clean licence trail, while the authors gain a cascade and the open web gains an incentive to stay genuinely human.
The frame from the corpus, stated plainly: the honey still forms downstream of the work, but the honey is now cryptographically attached to the bee that made it. The vault is no longer the crawler’s. The vault is the chain.
Why this is impossible today and tractable in HOP
Provenance discarded at creation. Today the signal — who, in what context, reused or not — is thrown away the moment text becomes a webpage; reconstructing it post-hoc is the entire data-quality industry. HOP attaches it at creation, as a side-effect of work that was happening anyway, for the cost of one signed block.
The micropayment rail. The same impossibility as 01: you cannot pay a millionth of a cent per token over card rails, because the rail overhead dwarfs the payment a millionfold. On a chain the payment is a write; the chain batches; the author pays the rail fee once, at off-ramp.
The causal-lift measurement. Today only the largest labs can afford data-incrementality studies, and they run them for the corpus, never for the individual author. In HOP the provenance graph (step 1) and the eval telemetry (step 5) are already on the substrate, so the T1–T4 measurement is a query over data the protocol already holds rather than a study commissioned per author. The cost of measuring an author’s causal contribution drops by orders of magnitude.
What’s in v0.1 and what’s pending
In v0.1, today — works in the Python reference implementation:
- Character Blocks with
content_hashandderived_fromreferences, dual-signed by author and reviewer - Skillchains accumulating signed creations as a side-effect of existing tooling (§3.4)
- Embedding-space search across blocks (§4.3)
- Federation treaties between an author’s chain and a relying party’s chain
- Per-attestation payment from a relying party — the §9.5 Banks as Trust Sources pattern, applied to per-token inclusion instead of identity attestation
Pending v0.2:
- The walk-the-
derived_fromroyalty cascade and a reference cascade-walker (shared build with01) - The lineage score as a standardised derived metric over the provenance graph (proposed in the concerns log, §J.13)
- Lift measurement on data — the Bean-Chain T1–T4 shape (§6.3.5) applied to corpus inclusion, against held-out eval
Pending v0.3:
- A standardised
trainablelicense-class and the cross-lab federation treaties that let a corpus move between labs under attested terms - Clawback on lift bonuses for data that degrades the model
- Cross-jurisdiction tax-withholding at the author off-ramp
A v0.1 implementer could ship the attributed-corpus-plus-per-token-included-payment loop for a single tooling integration — one IDE or forge emitting blocks, one lab’s acquisition agent querying and paying — in roughly a month. The lift bonus and the lineage score are the v0.2 build. The cross-lab corpus federation is diplomacy, not implementation.
See also
- The sibling materialisation — 01 Crystallised Labour, Paid Forever — the same cascade-plus-lift mechanism, applied to media instead of text and data
- The next materialisation — 04 Connecting the Halves — the same matching engine, pointed at objects instead of data
- The underlying framework — Paying the Bees for Honey (bees-for-honey/, Brendan) — you reward the bee with a share of the price of the honey, not a wage
- The spec primitives this uses — §3.4 Side-Effect Emission, §4.2 The Unified Embedding, §4.3 Vector-Space Proximity, §6.3.5 Lift Measurement, §6.9 Bees for Honey