Total pre-training tokens (Llama 3)
Total tokens used to pre-training a model
Exploring AI in public numbers
This homepage now moves from raw inputs, to calculations, to snapshot-style takeaways, to outside reading, to the indicators and disclosures we still want to watch.
Section 1
Start with plain numbers from public sources. These are the least interpretive pieces of the site: counts, rates, sizes, and benchmark totals that anchor the rest of the math.
Total tokens used to pre-training a model
Approximate number of tokens used in the released OLMo 3 7B pretraining mix.
A rule-of-thumb conversion between English words and tokens.
The number of public benchmark questions in Humanity's Last Exam.
Average daily active unique users on Reddit.
Projected U.S. population.
Projected world population.
Number of supervised fine-tuning examples in the public Tulu 3 SFT mixture.
Section 2
This is the interactive layer. Take the inputs above, swap comparable benchmarks, and see how quickly the outputs move when the assumptions change.
Change the assumptions directly on each card. Inline comparison menus let you swap in related public benchmarks without leaving the page.
Section 3
Snapshot cards for the current default calculations. These are the kinds of numbers we can quickly turn into short posts, explainers, or shareable references.
A replacement-cost thought experiment for commissioning new pretraining data rather than scraping it.
A top-down framing of AI revenue at population scale. Useful when people talk about broad-based benefit sharing.
A concrete licensing example: what happens when a disclosed platform deal gets spread across the people who made the platform valuable?
A labor-cost framing for expert benchmarks, using a public eval as the anchor point.
Section 4
A short reading list of outside sources that do real work for the numbers on this site: benchmark papers, model cards, and dataset cards that add context beyond the calculator itself.
Total public questions (Humanity's Last Exam)
HLE's finalized public benchmark contains 2,500 questions.
Pretraining mix share from academic papers (Dolma v1.6)
Derived from the published PeS2o paper count in Dolma v1.6.
Post-training SFT examples (Tulu 3)
The released Tulu 3 SFT mixture contains 939,344 examples.
Total pre-training tokens (Llama 3)
Meta reports that Llama 3 was pretrained on about 15 trillion multilingual tokens.
Section 5
The calculator is only as good as the disclosures around it. This section keeps the emphasis on what we still need from companies, publishers, and public reporting.