Resources
Project notes first, source material underneath.
This page keeps our own essays and asks near the top, then collects the external papers,
trackers, and filings that anchor the calculator’s inputs and arguments.
This post is a placeholder for a longer walkthrough tying the commissioning scenario to public revenue anchors.
It should eventually explain where the biggest uncertainties come from, which assumptions matter most, and why the gap between replacement cost and current revenue is politically important.
Publish deal values and renewal terms in ways researchers and journalists can cite.
Make data-licensing deal terms legible enough that outsiders can compare bargaining power, royalties, and renewal dynamics over time.
Break out AI-specific revenue from total company revenue whenever possible.
Separate AI revenue from broader cloud, subscription, or advertising lines so public comparisons are not forced to rely on rough estimates.
Disclose pretraining size, post-training size, and mix composition as first-class metrics.
Treat corpus size, data mix, and post-training volume as standard disclosures rather than occasional exceptions in model cards or lawsuits.
Make contributor compensation for benchmarks and labeling more legible.
We need clearer reporting on what evaluators, labelers, and domain experts are paid when their labor becomes critical model infrastructure.
Treat data access, licensing, and opt-out infrastructure as measurable public-interest indicators.
Data access rules shape both model performance and creator leverage, so they should be tracked like other core AI governance metrics.
Position paper plus public tracker on AI data deals, useful for grounding bargaining-power and transparency questions in a concrete set of disclosed agreements.
Strong anchor for the claim that scarce, high-quality data should be treated as the costly input in the AI stack rather than an effectively free byproduct.
Early and still-useful argument that fair use is not automatic for model training, especially when developers can foresee substitution, market harm, or memorization.
Clear legal-policy proposal for paying authors without making ex ante licensing the only path for model training.
Important economics paper arguing that training-data rules feed back into future creator incentives and therefore into the long-run supply of high-quality data.
Best empirical anchor I found for the claim that the open web is becoming less permissive for AI training, with large-scale evidence from Common Crawl and robots restrictions.
Useful empirical check on whether "remove the bad data later" is technically realistic once a model has already been trained.
Strong evidence that generative models can reproduce protected characters or recognizable copyrighted elements even without direct prompt copies.
Primary filing for the public record on Meta's position and for exhibits discussing LibGen provenance, preprocessing, and book-data use.
Useful primary source for how Meta employees compared licensed books, public-domain books, and legal risk in practice.
Important roadmap to the plaintiffs' theory of the case and to which exhibits may matter most for empirical questions about book-data use.
Parallel case that helps show how another major lab valued books and reasoned about acquisition cost.