Reading list

Project notes, asks, and source material.

This page keeps the project writing together with the papers, trackers, and filings behind the calculator.

Project writing

Essays and working notes

Notes, open questions, and longer walkthroughs connected to the calculator.

Public asks

What we want made legible

Concrete asks that would make these estimates easier to ground in public evidence.

External sources

Papers, trackers, and filings

Public sources tied to the shared inputs, scenarios, and surrounding policy argument.

Trackers and live references

1 references
  • A Sustainable AI Economy Needs Data Deals That Work for Generators

    Ruoxi Jia et al., NeurIPS 2025 position paper

    Position paper plus public tracker on AI data deals, useful for grounding bargaining-power and transparency questions in a concrete set of disclosed agreements.

    Useful for: Count of public AI data deals with disclosed counterparties; Share of deals with public pricing or royalty terms; Renewal and exclusivity terms across publisher-platform agreements

Position papers

1 references
  • The Most Expensive Part of an LLM Should be its Training Data

    Urvashi Kandpal and Colin Raffel, ICML 2025 position paper

    Strong anchor for the claim that scarce, high-quality data should be treated as the costly input in the AI stack rather than an effectively free byproduct.

    Useful for: Share of total model-development cost attributable to training data; Cost of replacing high-quality corpora with commissioned data; Public count of high-value data suppliers versus model builders

Policy and legal analysis

2 references
  • Foundation Models and Fair Use

    Peter Henderson et al., 2023

    Early and still-useful argument that fair use is not automatic for model training, especially when developers can foresee substitution, market harm, or memorization.

    Useful for: Rates of regurgitation or memorization from copyrighted sources; Degree of source substitution for downstream users; Availability of technical mitigations for copyrighted material

Empirical work

6 references
  • Humanity's Last Exam

    Phan et al. / CAIS and Scale AI, 2025-2026

    Central benchmark source for the calculator's expert-question size and prize-pool-derived per-question compensation assumptions.

    Useful for: Finalized public question count; Expert-contributor prize pool; Question review and filtering process

  • GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Rein et al., 2023

    Expert-written benchmark source that anchors smaller eval-set sizes and the idea of hard, domain-specific question production.

    Useful for: Expert-written multiple-choice question counts; Domain-expert validation process; Smaller comparison point for HLE-scale evaluation sets

  • AI and the Dynamic Supply of Training Data

    Alexander Peukert et al., 2025

    Important economics paper arguing that training-data rules feed back into future creator incentives and therefore into the long-run supply of high-quality data.

    Useful for: Elasticity of creator participation under different compensation rules; Change in data supply under opt-out, licensing, or royalty regimes; Time lag between policy changes and observable corpus shrinkage

  • Consent in Crisis: The Rapid Decline of the AI Data Commons

    Shayne Longpre et al., 2024

    Best empirical anchor I found for the claim that the open web is becoming less permissive for AI training, with large-scale evidence from Common Crawl and robots restrictions.

    Useful for: Share of C4 URLs carrying AI-restrictive terms or robots rules; Growth rate of AI-specific restrictions over time; Share of top websites that newly block AI crawlers

  • Evaluating Copyright Takedown Methods for Language Models

    Alexander Wei et al., 2024

    Useful empirical check on whether "remove the bad data later" is technically realistic once a model has already been trained.

    Useful for: Success rate of post hoc unlearning or takedown methods; Retained capability loss after copyright-focused removal; Residual memorization after takedown attempts

  • Fantastic Copyrighted Beasts and How (Not) to Generate Them

    2024 empirical paper on copyrighted-character generation

    Strong evidence that generative models can reproduce protected characters or recognizable copyrighted elements even without direct prompt copies.

    Useful for: Rate of recognizable copyrighted-character generation; Prompt sensitivity for protected-character outputs; Effect of mitigation methods on copyrighted-content generation

Primary legal sources and filings

5 references
  • Copyright Act statutory damages

    17 U.S.C. 504(c)

    Primary legal benchmark for comparing settlement-style per-work payments against ordinary statutory copyright damages.

    Useful for: Ordinary statutory damages range per work; Lower-bound legal damages comparison; Context for settlement-style per-work benchmarks

  • Kadrey v. Meta opposition filing on copyright and LibGen

    Meta filing, January 8, 2025

    Primary filing for the public record on Meta's position and for exhibits discussing LibGen provenance, preprocessing, and book-data use.

    Useful for: Whether LibGen and related book datasets were used in training; Which preprocessing steps removed copyright markers or deduped rows; What internal evidence exists about torrenting, seeding, or data provenance

  • Kadrey v. Meta Exhibit C on book-data acquisition emails

    CourtListener Exhibit C, filed February 20, 2025

    Useful primary source for how Meta employees compared licensed books, public-domain books, and legal risk in practice.

    Useful for: Approximate token count for Books3 plus Gutenberg; Internal estimates of book and journal counts in a sample licensing dataset; Internal comparisons between licensing costs, speed, and legal risk

  • Kadrey v. Meta plaintiffs' partial summary judgment motion

    Plaintiffs' filing, March 19, 2025

    Important roadmap to the plaintiffs' theory of the case and to which exhibits may matter most for empirical questions about book-data use.

    Useful for: Plaintiffs' theory of copying across Books3, LibGen, Z-Library, and Internet Archive; Claimed overlaps between internal model-development milestones and copied corpora; Exhibits that may contain benchmark or ablation evidence about books

  • Anthropic books copyright filing

    Anthropic filing, June 23, 2025

    Parallel case that helps show how another major lab valued books and reasoned about acquisition cost.

    Useful for: Internal valuation of books as training data; Comparison between licensing costs and alternative acquisition paths; Evidence about how companies rank books relative to other training sources