In-context learning is the phenomenon by which a pretrained language model, given a handful of examples in its prompt, generalizes to new inputs of the same pattern — without any weight updates. A model that has never been trained on a specific translation task can, given three English–French pairs in the prompt, produce reasonable French translations of a fourth English sentence. The behavior looks like learning because the model's output adjusts to the examples; it is not learning in the parameter-updating sense because nothing in the model changes. It is, mechanically, a consequence of how the attention layers route information during inference. It is, operationally, one of the most-used and least-understood properties of contemporary LLMs.
In-context learning was introduced as a named phenomenon by Brown et al.'s Language Models are Few-Shot Learners (GPT-3, 2020), though it had been observed in smaller models earlier. The paper's headline finding was that a large enough model could do competitive task performance purely from prompt examples, often matching or exceeding fine-tuned smaller models. This changed the economics of AI deployment: you no longer needed a specialized model for each task; one large model plus prompt engineering could cover a wide range of applications.
The mechanism has been partially worked out through mechanistic interpretability. Olsson et al.'s In-context Learning and Induction Heads (2022) identified specific attention patterns — "induction heads" — that implement a simple form of in-context learning (completing patterns like A→B…A→B). More elaborate forms of in-context reasoning involve more elaborate circuits, and the field's understanding remains incomplete. What is clear is that in-context learning is a real computational behavior implemented by identifiable structures, not a conceptual illusion.
The practical importance is hard to overstate. Chain-of-thought prompting, few-shot demonstrations, retrieval-augmented generation, agent scaffolding — all depend on the model's capacity to generalize from prompt content. Every prompt-engineering discipline, every RAG pipeline, every agentic workflow is built on the assumption that careful prompt construction will produce appropriately specialized behavior. The assumption is largely validated in practice, with characteristic brittleness at the edges.
The limits are becoming visible. In-context learning works well for pattern-continuation tasks and less well for tasks requiring novel strategic reasoning. It scales with context length (longer prompts can include more examples and more instruction), but with decreasing returns. It fails in characteristic ways — order-sensitivity (permutation of examples changes output), distribution-sensitivity (examples drawn from different distributions interfere), and format-sensitivity (small changes to presentation shift behavior substantially). Understanding when in-context learning will succeed and when it will fail is one of the live practical problems for any team deploying LLMs.
Brown et al. (2020) named the phenomenon and demonstrated it at scale. Akyurek et al.'s What Learning Algorithm Is In-Context Learning? (2022) and von Oswald et al.'s Transformers Learn In-Context by Gradient Descent (2022) proposed theoretical frameworks. Olsson et al. (2022) identified the mechanistic implementation. The research program continues to develop.
No weight updates required. The model generalizes from prompt content alone; nothing in the parameters changes.
Induction heads are a partial mechanism. Specific attention structures implement pattern-completion versions of in-context learning.
Prompt engineering depends on it. Every modern prompting discipline exploits in-context generalization.
The limits are specific and brittle. Order, distribution, and format sensitivity produce characteristic failure modes.