Transformer Architecture — Orange Pill Wiki
TECHNOLOGY

Transformer Architecture

The 2017 neural network architecture, built around self-attention, that replaced recurrent networks for sequence modeling and became the substrate of every large language model since.

The transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani and colleagues at Google Brain. Its distinguishing mechanism — self-attention — allows each position in a sequence to weight its relationship to every other position directly, replacing the sequential processing of earlier recurrent neural networks. Every major LLM of the current era (GPT, Claude, Gemini, LLaMA) is a transformer variant.

In the AI Story

Hedcut illustration for Transformer Architecture
Transformer Architecture

The transformer is the enabling technical precondition of the Orange Pill Cycle's subject matter. The jump from pre-2017 language models (recurrent, slower to train, shorter-context) to post-2017 (transformer-based, parallelizable, scalable to internet-sized corpora) was what made the current LLM era possible.

The Orange Pill Asimov volume uses the transformer sparingly — Asimov's framing is more about neural networks as a category — but the implicit comparison is between the positronic brain's rule-following and the transformer's distribution-learning. The transformer is the thing that is neither designed nor inspectable but nevertheless works at scale.

The transformer's historical significance is that it decoupled sequence length from serial computation. Prior recurrent architectures processed tokens one at a time; a long sequence required a long chain of sequential operations. The transformer processes the entire sequence in parallel through self-attention layers, turning the problem from one of depth (how many sequential steps) into one of width (how much parallel compute). This made it possible to train on sequences orders of magnitude larger than was feasible with RNNs, which in turn made training on the scale required by modern LLMs practical.

Origin

Introduced in Vaswani, A. et al. "Attention Is All You Need" (NeurIPS 2017). Designed initially for machine translation at Google. Adopted rapidly across NLP, then across vision (vision transformers), audio, and multimodal models.

Key Ideas

Self-attention. Each position attends to every other position, producing a weighted representation in a single step.

Parallelizable training. Unlike RNNs, transformers can process sequence positions in parallel, enabling efficient use of GPU hardware.

Encoder and decoder stacks. The original paper has both; modern LLMs typically use decoder-only (autoregressive) stacks.

Positional encoding. Because self-attention is permutation-invariant, position information is added explicitly.

Not the only architecture, but the dominant one. Structured state-space models (Mamba, RWKV), mixture-of-experts variants, and diffusion-transformer hybrids all challenge transformer dominance in particular regimes. But as of 2026 the transformer remains the default for frontier language modeling, largely because its failures are well-understood and its engineering ecosystem is deep.

Appears in the Orange Pill Cycle

Further reading

  1. Vaswani, A. et al. "Attention Is All You Need." NeurIPS 2017.
  2. Rush, A. "The Annotated Transformer" — implementation walkthrough.
  3. Phuong, M. & Hutter, M. "Formal Algorithms for Transformers" (2022) — concise mathematical treatment.
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
TECHNOLOGY