Resources

Project notes first, source material underneath.

This page keeps our own essays and asks near the top, then collects the external papers, trackers, and filings that anchor the calculator’s inputs and arguments.

Our writing

Essays and working notes

Interpretive notes, open questions, and longer-form walkthroughs connected to the calculator.

Commissioning a new frontier dataset versus current AI revenue

This working note traces one unresolved tension in the calculator: replacement-cost estimates for commissioned LLM-scale data climb quickly, while current public frontier-AI revenue anchors remain much smaller.

This post is a placeholder for a longer walkthrough tying the commissioning scenario to public revenue anchors.

It should eventually explain where the biggest uncertainties come from, which assumptions matter most, and why the gap between replacement cost and current revenue is politically important.

Project asks

What we want made legible

Short, concrete asks that would make these estimates easier to ground in public evidence.

Publish deal values and renewal terms in ways researchers and journalists can cite.

Make data-licensing deal terms legible enough that outsiders can compare bargaining power, royalties, and renewal dynamics over time.

Break out AI-specific revenue from total company revenue whenever possible.

Separate AI revenue from broader cloud, subscription, or advertising lines so public comparisons are not forced to rely on rough estimates.

Disclose pretraining size, post-training size, and mix composition as first-class metrics.

Treat corpus size, data mix, and post-training volume as standard disclosures rather than occasional exceptions in model cards or lawsuits.

Make contributor compensation for benchmarks and labeling more legible.

We need clearer reporting on what evaluators, labelers, and domain experts are paid when their labor becomes critical model infrastructure.

Treat data access, licensing, and opt-out infrastructure as measurable public-interest indicators.

Data access rules shape both model performance and creator leverage, so they should be tracked like other core AI governance metrics.

External sources

Papers, trackers, and filings

Public sources that matter for the shared inputs, the calculator scenarios, or the surrounding policy argument.

Trackers and live references

1 references

Ruoxi Jia et al., NeurIPS 2025 position paper

A Sustainable AI Economy Needs Data Deals That Work for Generators

Open source

Position paper plus public tracker on AI data deals, useful for grounding bargaining-power and transparency questions in a concrete set of disclosed agreements.

  • Count of public AI data deals with disclosed counterparties
  • Share of deals with public pricing or royalty terms
  • Renewal and exclusivity terms across publisher-platform agreements

Position papers

1 references

Urvashi Kandpal and Colin Raffel, ICML 2025 position paper

The Most Expensive Part of an LLM Should be its Training Data

Open source

Strong anchor for the claim that scarce, high-quality data should be treated as the costly input in the AI stack rather than an effectively free byproduct.

  • Share of total model-development cost attributable to training data
  • Cost of replacing high-quality corpora with commissioned data
  • Public count of high-value data suppliers versus model builders

Policy and legal analysis

2 references

Peter Henderson et al., 2023

Foundation Models and Fair Use

Open source

Early and still-useful argument that fair use is not automatic for model training, especially when developers can foresee substitution, market harm, or memorization.

  • Rates of regurgitation or memorization from copyrighted sources
  • Degree of source substitution for downstream users
  • Availability of technical mitigations for copyrighted material

Martin Senftleben, 2024

Win-win: How to Remove Copyright Obstacles to AI Training While Ensuring Author Remuneration

Open source

Clear legal-policy proposal for paying authors without making ex ante licensing the only path for model training.

  • Royalty basis for AI-generated outputs that substitute for books
  • Revenue share allocated to author-remuneration pools
  • Administrative cost of output-based versus input-based compensation

Empirical work

4 references

Alexander Peukert et al., 2025

AI and the Dynamic Supply of Training Data

Open source

Important economics paper arguing that training-data rules feed back into future creator incentives and therefore into the long-run supply of high-quality data.

  • Elasticity of creator participation under different compensation rules
  • Change in data supply under opt-out, licensing, or royalty regimes
  • Time lag between policy changes and observable corpus shrinkage

Shayne Longpre et al., 2024

Consent in Crisis: The Rapid Decline of the AI Data Commons

Open source

Best empirical anchor I found for the claim that the open web is becoming less permissive for AI training, with large-scale evidence from Common Crawl and robots restrictions.

  • Share of C4 URLs carrying AI-restrictive terms or robots rules
  • Growth rate of AI-specific restrictions over time
  • Share of top websites that newly block AI crawlers

Alexander Wei et al., 2024

Evaluating Copyright Takedown Methods for Language Models

Open source

Useful empirical check on whether "remove the bad data later" is technically realistic once a model has already been trained.

  • Success rate of post hoc unlearning or takedown methods
  • Retained capability loss after copyright-focused removal
  • Residual memorization after takedown attempts

2024 empirical paper on copyrighted-character generation

Fantastic Copyrighted Beasts and How (Not) to Generate Them

Open source

Strong evidence that generative models can reproduce protected characters or recognizable copyrighted elements even without direct prompt copies.

  • Rate of recognizable copyrighted-character generation
  • Prompt sensitivity for protected-character outputs
  • Effect of mitigation methods on copyrighted-content generation

Primary-source filings

4 references

Meta filing, January 8, 2025

Kadrey v. Meta opposition filing on copyright and LibGen

Open source

Primary filing for the public record on Meta's position and for exhibits discussing LibGen provenance, preprocessing, and book-data use.

  • Whether LibGen and related book datasets were used in training
  • Which preprocessing steps removed copyright markers or deduped rows
  • What internal evidence exists about torrenting, seeding, or data provenance

CourtListener Exhibit C, filed February 20, 2025

Kadrey v. Meta Exhibit C on book-data acquisition emails

Open source

Useful primary source for how Meta employees compared licensed books, public-domain books, and legal risk in practice.

  • Approximate token count for Books3 plus Gutenberg
  • Internal estimates of book and journal counts in a sample licensing dataset
  • Internal comparisons between licensing costs, speed, and legal risk

Plaintiffs' filing, March 19, 2025

Kadrey v. Meta plaintiffs' partial summary judgment motion

Open source

Important roadmap to the plaintiffs' theory of the case and to which exhibits may matter most for empirical questions about book-data use.

  • Plaintiffs' theory of copying across Books3, LibGen, Z-Library, and Internet Archive
  • Claimed overlaps between internal model-development milestones and copied corpora
  • Exhibits that may contain benchmark or ablation evidence about books

Anthropic filing, June 23, 2025

Anthropic books copyright filing

Open source

Parallel case that helps show how another major lab valued books and reasoned about acquisition cost.

  • Internal valuation of books as training data
  • Comparison between licensing costs and alternative acquisition paths
  • Evidence about how companies rank books relative to other training sources