
The data-centric insight is the cycle’s most precise account of why Software 2.0’s promise that “the dataset is the source code” carries both opportunity and danger. The opportunity: anyone who can curate the right data can shape the program without writing a line of traditional code. The danger: a system trained on a biased, incomplete, or distorted dataset will faithfully reproduce the biases, gaps, and distortions of its diet—and the distortions are often invisible to users who see only the polished output.
The frontier models that now dominate the public imagination are extreme cases of this dependency. A large language model is, in effect, a vast distillation of the textual record it was trained on; its capabilities and pathologies are both inherited from that corpus. The questions that increasingly preoccupy those who build and govern these systems—what was in the training data, whose perspectives it represents, what it absorbed that it should not have—are data-centric questions in exactly Ng’s sense. The orange pill cycle’s portrait of AI as an amplifier that faithfully reproduces the quality of its inputs is, at the technical level, a data-centric observation.
Ng developed the data-centric framework across his years deploying AI systems in real-world settings, most clearly articulated around 2020 when he launched a campaign within the machine learning community to shift its emphasis from model-centric to data-centric thinking. The catalyst was his observation, drawn from deploying AI in industries from manufacturing to healthcare, that the gap between research performance and production performance was overwhelmingly a data problem rather than a model problem. In research, the dataset is clean and fixed, the metric agreed upon, and success means beating the previous number. In deployment, the data arrives messy and shifting, the metric that matters is whether the system actually helps someone, and success means functioning reliably amid all the friction the controlled setting was designed to eliminate.
Ng connected this practical observation to a deeper point about the profession’s incentive structure. Designing a novel architecture feels like science; it produces papers, citations, and the satisfaction of novelty. Cleaning a dataset feels like janitorial work; it produces no papers and little prestige. The field was structured to reward the former and ignore the latter. Data-centric AI was, in part, an argument about misaligned incentives—that the field had optimized for what was publishable rather than what was useful, and that the gap between the two was widest precisely in the data work nobody wanted to do.
The dataset is the real source code. An AI system is composed of two things: code and data. The code is the model—the architecture, the algorithm, the framework. The data is everything the model learns from. The model-centric tradition holds the data fixed and improves the code; data-centric AI proposes the opposite. In most real-world applications the leverage is overwhelmingly on the data, because the architecture is already a commodity and the data is whatever the institution happens to have—unique, idiosyncratic, and full of problems that no architectural improvement can compensate for.
Consistent labeling is precision engineering. One of Ng’s most practical contributions is the emphasis on label consistency over label quantity. A dataset in which ten annotators each label examples according to slightly different implicit criteria produces a signal that is systematically confused rather than merely noisy. The remedy is not more data but clearer labeling guidelines, enforced consistently across the dataset. A handful of carefully corrected labels can do more for system performance than thousands of additional examples labeled with inconsistent criteria—a finding that practitioners discover repeatedly and that the benchmark culture of the field systematically obscures.
Debugging is data collection. In the model-centric tradition, when a system fails you look at the model. In the data-centric tradition, when a system fails you look at the data: which region of the input space is the system handling badly, and what data would need to exist for it to do better? Debugging becomes a process of identifying gaps in the training distribution and curating data to fill them rather than tuning hyperparameters or modifying architecture. This is the data engine philosophy that Andrej Karpathy developed independently at Tesla Autopilot and that Ng generalized as a field-wide recommendation.
The bias lives in the data. The biases a system exhibits, the gaps in what it knows, the failure modes that surprise its builders—these are very often properties of the data rather than of the model. The system that produces a prejudiced output has not developed an opinion; it has a dataset, and the dataset reflected a world in which certain patterns were overrepresented and others absent. To control the system, you must control its diet. This conclusion reorients AI safety and fairness work: the question is not primarily what values to encode in the model but what histories, perspectives, and patterns are present or absent in the data.