Commissioning a new frontier dataset versus current AI revenue
This working note traces one unresolved tension in the calculator: replacement-cost estimates for commissioned LLM-scale data climb quickly, while current public frontier-AI revenue anchors remain much smaller.
This post is a placeholder for a longer walkthrough tying the commissioning scenario to public revenue anchors.
It should eventually explain where the biggest uncertainties come from, which assumptions matter most, and why the gap between replacement cost and current revenue is politically important.
Public asks
What we want made legible
Concrete asks that would make these estimates easier to ground in public evidence.
Publish deal values and renewal terms in ways researchers and journalists can cite.
Make data-licensing deal terms legible enough that outsiders can compare bargaining power, royalties, and renewal dynamics over time.
Break out AI-specific revenue from total company revenue whenever possible.
Separate AI revenue from broader cloud, subscription, or advertising lines so public comparisons are not forced to rely on rough estimates.
Disclose pretraining size, post-training size, and mix composition as first-class metrics.
Treat corpus size, data mix, and post-training volume as standard disclosures rather than occasional exceptions in model cards or lawsuits.
Make contributor compensation for benchmarks and labeling more legible.
We need clearer reporting on what evaluators, labelers, and domain experts are paid when their labor becomes critical model infrastructure.
Treat data access, licensing, and opt-out infrastructure as measurable public-interest indicators.
Data access rules shape both model performance and creator leverage, so they should be tracked like other core AI governance metrics.
External sources
Papers, trackers, and filings
Public sources tied to the shared inputs, scenarios, and surrounding policy argument.
Position paper plus public tracker on AI data deals, useful for grounding bargaining-power and transparency questions in a concrete set of disclosed agreements.
Useful for: Count of public AI data deals with disclosed counterparties; Share of deals with public pricing or royalty terms; Renewal and exclusivity terms across publisher-platform agreements
Urvashi Kandpal and Colin Raffel, ICML 2025 position paper
Strong anchor for the claim that scarce, high-quality data should be treated as the costly input in the AI stack rather than an effectively free byproduct.
Useful for: Share of total model-development cost attributable to training data; Cost of replacing high-quality corpora with commissioned data; Public count of high-value data suppliers versus model builders
Early and still-useful argument that fair use is not automatic for model training, especially when developers can foresee substitution, market harm, or memorization.
Useful for: Rates of regurgitation or memorization from copyrighted sources; Degree of source substitution for downstream users; Availability of technical mitigations for copyrighted material
Clear legal-policy proposal for paying authors without making ex ante licensing the only path for model training.
Useful for: Royalty basis for AI-generated outputs that substitute for books; Revenue share allocated to author-remuneration pools; Administrative cost of output-based versus input-based compensation
Important economics paper arguing that training-data rules feed back into future creator incentives and therefore into the long-run supply of high-quality data.
Useful for: Elasticity of creator participation under different compensation rules; Change in data supply under opt-out, licensing, or royalty regimes; Time lag between policy changes and observable corpus shrinkage
Best empirical anchor I found for the claim that the open web is becoming less permissive for AI training, with large-scale evidence from Common Crawl and robots restrictions.
Useful for: Share of C4 URLs carrying AI-restrictive terms or robots rules; Growth rate of AI-specific restrictions over time; Share of top websites that newly block AI crawlers
Useful empirical check on whether "remove the bad data later" is technically realistic once a model has already been trained.
Useful for: Success rate of post hoc unlearning or takedown methods; Retained capability loss after copyright-focused removal; Residual memorization after takedown attempts
2024 empirical paper on copyrighted-character generation
Strong evidence that generative models can reproduce protected characters or recognizable copyrighted elements even without direct prompt copies.
Useful for: Rate of recognizable copyrighted-character generation; Prompt sensitivity for protected-character outputs; Effect of mitigation methods on copyrighted-content generation
Primary filing for the public record on Meta's position and for exhibits discussing LibGen provenance, preprocessing, and book-data use.
Useful for: Whether LibGen and related book datasets were used in training; Which preprocessing steps removed copyright markers or deduped rows; What internal evidence exists about torrenting, seeding, or data provenance
Useful primary source for how Meta employees compared licensed books, public-domain books, and legal risk in practice.
Useful for: Approximate token count for Books3 plus Gutenberg; Internal estimates of book and journal counts in a sample licensing dataset; Internal comparisons between licensing costs, speed, and legal risk
Important roadmap to the plaintiffs' theory of the case and to which exhibits may matter most for empirical questions about book-data use.
Useful for: Plaintiffs' theory of copying across Books3, LibGen, Z-Library, and Internet Archive; Claimed overlaps between internal model-development milestones and copied corpora; Exhibits that may contain benchmark or ablation evidence about books
Parallel case that helps show how another major lab valued books and reasoned about acquisition cost.
Useful for: Internal valuation of books as training data; Comparison between licensing costs and alternative acquisition paths; Evidence about how companies rank books relative to other training sources