Over half the content on the open web is now generated by models. The next training run is going to scrape it. That sentence should bother you more than it does.
The hall of mirrors
The failure mode has a name: model collapse. Train a model on the output of an earlier model and you lose the tails of the distribution first. The rare and surprising patterns go, and what is left drifts toward a bland statistical mean. Researchers at Oxford and Cambridge showed this with iterative training back in 2023: each generation narrower than the last, more confident about less. It is a photocopy of a photocopy. The tenth copy is still legible. It has just lost the detail that made it worth keeping.
You cannot scale out of it
The comfortable response is to scale. More tokens, more parameters, better benchmark scores. That worked while the bottleneck was genuinely scale and the data was overwhelmingly human. We are not in that regime any more. Throwing compute at a dataset that is feeding on itself does not fix the degradation. It amplifies it. Benchmarks hide this, because they measure capability on narrow tasks, not the diversity of the distribution. We optimise the metric and use the metric to reassure ourselves.
So the interesting question stopped being how big the model is. It became where the data came from, and whether a model has already touched it.
Provenance, the part nobody wants to build
That is a provenance question, and provenance is the part nobody wants to build. Everyone wants the frontier model. Almost nobody wants the plumbing that makes the frontier model trustworthy. Detection classifiers are losing an arms race to the generators they are trying to catch. Watermarking has no universal standard, no adoption incentive, and is strippable after the fact. The metadata layer of the internet was never designed to record whether a sentence was written by a person or a machine.
Provenance is checkable
Here is the part I want to argue for. Treat provenance as a recording problem solved at the source and it becomes tractable. The mistake is to treat it as a detection problem solved after the fact, once the text is already in the corpus and indistinguishable.
You cannot reliably tell, from the text alone, whether a paragraph came from a person or a model. You can record, at the moment data is produced or ingested, where it came from and how. A cited source is checkable. A signed, timestamped record of what was collected and from where is checkable. A pipeline that refuses to interpolate, that logs a gap instead of guessing, produces data you can audit later. None of that depends on winning a classifier arms race. It depends on treating provenance as an engineering discipline rather than a marketing checkbox.
This is the principle we build on at Nativerse. Our ecosystem observatory, GAWK, holds one rule above the rest: cited, not invented. Every figure links to the source it came from, and when a source goes down the pipeline keeps the last verified value and marks it stale rather than inventing a plausible number. That is a small example. It is also the shape of the discipline the data supply needs at scale: a known origin and a recorded method, with no synthetic guesswork, kept as a record you can re-check.
The same instinct runs through verification more broadly. When you can produce a signed receipt of what actually ran, or what was actually collected, you stop arguing about trust and start checking it. That is the difference between "this dataset is clean, believe me" and "here is where every part of it came from."
None of this is solved. Building real data-provenance infrastructure for training corpora is a hard systems problem, and most of it is unbuilt. But the direction is clear, and it does not run through bigger models. If the foundation is rotting, it does not matter how tall the building is. The serious work now is on the data: where it came from, how it was made, and whether we can prove it.
Prompted by "Model Collapse Is Already Happening, We Just Pretend It Isn't" (CACM, March 2026).