CONCEPT

Data-Centric AI

Andrew Ng’s campaign to invert the prevailing emphasis of the machine learning field: rather than holding the data fixed and improving the model—the dominant research paradigm—hold the model fixed and systematically improve the data, because in most real-world deployments the dataset is the decisive variable that the field has trained itself to disdain.

For most of the modern history of machine learning, prestige and attention flowed to the model. Researchers competed to design more sophisticated architectures, progress was measured by improvements on benchmarks everyone held fixed while iterating on the algorithm, and the overwhelming majority of papers were model-centric in exactly this sense. Andrew Ng spent years arguing that this emphasis is, for most real-world applications, precisely backward. The data-centric AI campaign proposes the opposite: hold the model fixed and systematically improve the data—acquire more of it where it is thin, better of it where it is noisy, and remove what is mislabeled, ambiguous, or irrelevant. Engineering the data with the same rigor the field had reserved for the model typically delivers larger performance gains than any architectural tweak, especially in the settings where AI actually has to work: a hospital, a factory, a bank, with a few hundred examples of a defect photographed inconsistently, labeled by people who disagreed about what counted as a defect. The concept connects directly to the dependency that the Google Brain “cat” experiment first revealed: a neural network learns whatever it is fed, which means the composition of the diet determines the character of the result, and improving the diet is hard, unglamorous, human-judgment-laden work that the glamour of the architecture tends to obscure.

In the [YOU] on AI Field Guide

The data-centric insight is the cycle’s most precise account of why Software 2.0’s promise that “the dataset is the source code” carries both opportunity and danger. The opportunity: anyone who can curate the right data can shape the program without writing a line of traditional code. The danger: a system trained on a biased, incomplete, or distorted dataset will faithfully reproduce the biases, gaps, and distortions of its diet—and the distortions are often invisible to users who see only the polished output.

The frontier models that now dominate the public imagination are extreme cases of this dependency. A large language model is, in effect, a vast distillation of the textual record it was trained on; its capabilities and pathologies are both inherited from that corpus. The questions that increasingly preoccupy those who build and govern these systems—what was in the training data, whose perspectives it represents, what it absorbed that it should not have—are data-centric questions in exactly Ng’s sense. The orange pill cycle’s portrait of AI as an amplifier that faithfully reproduces the quality of its inputs is, at the technical level, a data-centric observation.

Origin

Ng developed the data-centric framework across his years deploying AI systems in real-world settings, most clearly articulated around 2020 when he launched a campaign within the machine learning community to shift its emphasis from model-centric to data-centric thinking. The catalyst was his observation, drawn from deploying AI in industries from manufacturing to healthcare, that the gap between research performance and production performance was overwhelmingly a data problem rather than a model problem. In research, the dataset is clean and fixed, the metric agreed upon, and success means beating the previous number. In deployment, the data arrives messy and shifting, the metric that matters is whether the system actually helps someone, and success means functioning reliably amid all the friction the controlled setting was designed to eliminate.

Ng connected this practical observation to a deeper point about the profession’s incentive structure. Designing a novel architecture feels like science; it produces papers, citations, and the satisfaction of novelty. Cleaning a dataset feels like janitorial work; it produces no papers and little prestige. The field was structured to reward the former and ignore the latter. Data-centric AI was, in part, an argument about misaligned incentives—that the field had optimized for what was publishable rather than what was useful, and that the gap between the two was widest precisely in the data work nobody wanted to do.

Key Ideas

The dataset is the real source code. An AI system is composed of two things: code and data. The code is the model—the architecture, the algorithm, the framework. The data is everything the model learns from. The model-centric tradition holds the data fixed and improves the code; data-centric AI proposes the opposite. In most real-world applications the leverage is overwhelmingly on the data, because the architecture is already a commodity and the data is whatever the institution happens to have—unique, idiosyncratic, and full of problems that no architectural improvement can compensate for.

Consistent labeling is precision engineering. One of Ng’s most practical contributions is the emphasis on label consistency over label quantity. A dataset in which ten annotators each label examples according to slightly different implicit criteria produces a signal that is systematically confused rather than merely noisy. The remedy is not more data but clearer labeling guidelines, enforced consistently across the dataset. A handful of carefully corrected labels can do more for system performance than thousands of additional examples labeled with inconsistent criteria—a finding that practitioners discover repeatedly and that the benchmark culture of the field systematically obscures.

Debugging is data collection. In the model-centric tradition, when a system fails you look at the model. In the data-centric tradition, when a system fails you look at the data: which region of the input space is the system handling badly, and what data would need to exist for it to do better? Debugging becomes a process of identifying gaps in the training distribution and curating data to fill them rather than tuning hyperparameters or modifying architecture. This is the data engine philosophy that Andrej Karpathy developed independently at Tesla Autopilot and that Ng generalized as a field-wide recommendation.

The bias lives in the data. The biases a system exhibits, the gaps in what it knows, the failure modes that surprise its builders—these are very often properties of the data rather than of the model. The system that produces a prejudiced output has not developed an opinion; it has a dataset, and the dataset reflected a world in which certain patterns were overrepresented and others absent. To control the system, you must control its diet. This conclusion reorients AI safety and fairness work: the question is not primarily what values to encode in the model but what histories, perspectives, and patterns are present or absent in the data.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Related Entries

Further Reading