Annualized revenue (OpenAI)
Reported annualized revenue run rate for OpenAI.
Interactive calculator
Interactive napkin math, back-of-the-envelope estimates, and ballpark figures for training-data value, compensation, and distribution questions. Start with a concrete public benchmark, then inspect the assumptions and share the exact state you used.
The full workspace brings the calculator, inputs library, inspector, trust filters, and shareable URLs into one place.
Section 1
Start with plain numbers from public sources, grouped by the kind of question you want to ask. This home page surfaces a few anchor inputs; the full editable library lives in the calculator workspace.
I'm curious about money
Revenue anchors, deal values, labour rates, and inference pricing.
Reported annualized revenue run rate for OpenAI.
Reported yearly value of the Google-Reddit data licensing deal.
I'm curious about training sizes
Token counts, example counts, benchmark sizes, and other scale assumptions.
Total tokens used to pre-training a model
Approximate number of tokens used in the released OLMo 3 7B pretraining mix.
A rule-of-thumb conversion between English words and tokens.
The number of public benchmark questions in Humanity's Last Exam.
Number of supervised fine-tuning examples in the public Tulu 3 SFT mixture.
I'm curious about people or audience size
Population, audience, workforce, and contributor counts used in per-person math.
Average daily active unique users on Reddit.
Projected world population.
I'm curious about data mix
Composition shares, source slices, document size, and related dataset structure.
Approximate share of Dolma v1.6 tokens that come from web crawls.
Need the full set? Open the library to search by scenario, filter to official or primary sources, and inspect every cited assumption in one place.
Section 2
Try the live scenarios right here, then open the full workspace when you want the entire inputs library, source filters, inspector details, and shareable state.
Best path for first-time readers: pick a starter scenario, tweak one or two assumptions, then inspect the source notes or switch to the inputs library for deeper work.
Change the assumptions directly on each card. Inline comparison menus let you swap in related public benchmarks without leaving the page, and every input card links to its cited source.
Build a custom formula from the shared input library when the curated scenarios do not quite match the question you want to ask.
Enter a formula to start exploring.
Keep this advanced builder tucked away until the curated scenarios stop being enough.
Section 3
Keep the calculator close to the evidence: working notes, highlighted papers, live indicators, and concrete asks all sit in one supporting section instead of competing with the calculator.
This working note traces one unresolved tension in the calculator: replacement-cost estimates for commissioned LLM-scale data climb quickly, while current public frontier-AI revenue anchors remain much smaller.
At the current defaults, the commissioning estimate is roughly 59.6x the combined OpenAI-plus-Anthropic revenue anchor. That mismatch is the core tension this placeholder post should unpack.
Reading list
These are the sources doing real work for the numbers on this site.
Ruoxi Jia et al., NeurIPS 2025 position paper
Position paper plus public tracker on AI data deals, useful for grounding bargaining-power and transparency questions in a concrete set of disclosed agreements.
Urvashi Kandpal and Colin Raffel, ICML 2025 position paper
Strong anchor for the claim that scarce, high-quality data should be treated as the costly input in the AI stack rather than an effectively free byproduct.
Peter Henderson et al., 2023
Early and still-useful argument that fair use is not automatic for model training, especially when developers can foresee substitution, market harm, or memorization.
Martin Senftleben, 2024
Clear legal-policy proposal for paying authors without making ex ante licensing the only path for model training.
Alexander Peukert et al., 2025
Important economics paper arguing that training-data rules feed back into future creator incentives and therefore into the long-run supply of high-quality data.
Shayne Longpre et al., 2024
Best empirical anchor I found for the claim that the open web is becoming less permissive for AI training, with large-scale evidence from Common Crawl and robots restrictions.
Watch list
The calculator stays useful only if disclosures and public reporting improve around it.