The Teapot Test — Orange Pill Wiki
CONCEPT

The Teapot Test

Gopnik's laboratory finding that when children and LLMs are asked how to draw a circle without a compass, children suggest teapots (round objects in the real world) and LLMs suggest rulers (statistically close to compasses in training text) — the sharpest empirical demonstration of the difference between discovery and imitation.

The teapot test is a class of experimental tasks developed in Gopnik's Berkeley laboratory to distinguish genuine innovation from sophisticated imitation. When asked to solve problems requiring novel solutions — not the application of known methods to familiar problems, but the generation of new methods for problems the training data did not contain — children and large language models produce systematically different answers. Children draw on their engagement with the physical, causal structure of the real world; LLMs draw on the statistical regularities of text. The children innovate. The machines imitate. The finding crystallizes in a single experiment what the entire cultural technology thesis argues at the theoretical level.

In the AI Story

Hedcut illustration for The Teapot Test
The Teapot Test

Asked how to draw a circle without a compass, the language models suggested rulers — because in the statistical landscape of training data, rulers are close to compasses (both are drawing instruments, both appear in mathematics and geometry contexts, both are mentioned together in tutorials on how to draw shapes). The models were doing what they are trained to do: producing outputs consistent with the patterns in their corpus. By this standard, 'ruler' is a perfectly reasonable completion.

The children suggested teapots. Not because teapots appear frequently in texts about drawing circles — they do not — but because children live in the physical world and know, from direct causal interaction, that teapots are round. You can trace around a teapot and produce a circle. The solution did not come from statistical association. It came from the child's causal model of the physical world: round objects leave round traces when you draw around them. The reasoning was transferable to any round object the child had handled.

This difference maps precisely onto the distinction Gopnik's framework draws between exploration and exploitation. The children explored — they generated a solution from their engagement with physical reality. The machines exploited — they deployed the statistical regularities of their training corpus. Both produced outputs. Only one produced a genuine innovation.

The teapot test has been replicated and extended across multiple tasks in Gopnik's laboratory and in related research programs. The consistent finding is that LLMs perform impressively on tasks whose solutions are well-represented in training data and systematically fail on tasks requiring reasoning about causal structure not encoded in text. The failures are not random; they have the specific shape predicted by the cultural technology framework. What LLMs lack is not scale or sophistication but a particular kind of cognitive operation — the construction and testing of causal hypotheses against the world rather than against a corpus.

Origin

The experiments originated in Gopnik's Berkeley laboratory as part of a broader research program on children's causal innovation. The specific compass-drawing task and its variants were published in papers by Yang et al. and Didolkar et al. in the early 2020s. Gopnik has used the teapot finding in public talks, in her Wall Street Journal column, and in her 2025 Science paper as the single clearest empirical demonstration that LLMs are imitation engines rather than discovery engines.

Key Ideas

Innovation requires causal reasoning. Generating genuinely novel solutions requires engagement with the causal structure of the world, not just statistical patterns in text.

Children draw on physical knowledge. The teapot answer comes from direct interaction with round objects, not from text describing circles.

LLMs draw on statistical proximity. The ruler answer comes from what is close to 'compass' in the training corpus.

Imitation can look like innovation. Both outputs are plausible-sounding; only careful testing reveals which kind of cognitive operation produced them.

Empirical anchor for the cultural technology thesis. The teapot test makes the abstract distinction between imitation and discovery operational and measurable.

Appears in the Orange Pill Cycle

Further reading

  1. Yang, E., Griffiths, T. L., Gopnik, A. et al. 'Children's innovation in tool use compared with large language models.' (working paper, Berkeley, 2024)
  2. Farrell, H., Gopnik, A., Shalizi, C., and Evans, J. 'Large AI Models Are Cultural and Social Technologies.' Science (2025)
  3. Didolkar, A., Didolkar, V. et al. 'Comparing children and large language models in tool innovation.' (working paper, 2023)
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT