Training data as colonial archive applies Spivak's foundational analysis of the British administrative record in India to the corpora that produce contemporary language models. The claim is not merely analogical. It is structural: both archives are meticulous, comprehensive, and violent in a specific sense — not because they record atrocities (though they do) but because the act of recording transforms what it records. Practices that had been fluid become fixed. Identities that had been contextual become categorical. Knowledge embedded in relationships becomes extractable as data. The archive does not describe the world; it produces a world legible to the apparatus that builds the archive, organized according to categories the apparatus finds useful, stripped of everything that does not serve its purposes.
The quantitative asymmetry of the training data is well documented. English, spoken natively by approximately five percent of the world's population, accounts for more than half of all internet content. The top ten languages by internet presence — all European or East Asian — account for more than ninety percent. The remaining six thousand-plus living languages share the scraps. The world's digitized knowledge is not a neutral sample of the world's knowledge; it is a sample shaped by five centuries of asymmetric knowledge production.
The qualitative asymmetry runs deeper. The knowledge systems well-represented in the training data are organized according to Western Enlightenment principles: propositional (knowledge as statable claims), universalist (knowledge that holds across contexts), textual (knowledge transmitted through writing, peer review, citation, archival preservation). These are not neutral features but the products of a specific intellectual tradition that carries within it specific assumptions about what knowledge is, how it is produced, and who has authority to produce it.
Knowledge systems poorly represented in the training data operate on different principles. Aboriginal Australian songlines encode navigational, ecological, and cosmological information in performed narratives inseparable from the landscape. Andean ayllu systems organize economic and ecological knowledge through principles of reciprocity that do not map onto Western economic categories. West African griot traditions transmit historical and genealogical knowledge through performance whose authority depends on the performer's lineage. These are not primitive versions of Western knowledge; they are different kinds of knowledge whose internal organization the model's architecture cannot accommodate.
The comparison to the colonial archive is sharpened by what happens when the model discusses non-Western traditions. It does not refuse to discuss them. It discusses them fluently, drawing on anthropological studies, comparative analyses, and the secondary literature that Western academic institutions have produced. The fluency is the problem. The user who receives the output experiences comprehension. What she does not experience is the translation that made comprehension possible — the conversion of a knowledge system organized on its own terms into a knowledge system organized on the archive's terms. The fluency conceals the conversion, and the conversion is the violence.
The framework draws on Spivak's extended analysis of the East India Company archive, developed in The Rani of Sirmur and A Critique of Postcolonial Reason. Parallel work by Bernard Cohn, Nicholas Dirks, and others has documented the specific mechanisms by which colonial record-keeping produced the categories that became India's subsequent reality.
The application to AI training corpora has been developed across a growing literature including Kate Crawford's Atlas of AI, Shakir Mohamed and colleagues' work on decolonial AI, and the 2024 ArXiv study on epistemic injustice in generative AI. The Spivak volume's contribution is to make explicit the structural continuity between the two archival regimes.
Comprehensiveness as illusion. The training data's vastness creates the appearance of completeness while its categories systematically exclude knowledge systems that do not organize themselves in the archive's preferred form.
Structural, not accidental. The asymmetries are not bugs to be patched by adding more languages; they are features of a system whose architecture presupposes a specific epistemology.
The archive becomes the world. Once encoded in the model, the categories of the training data become the basis for subsequent knowledge production, overwriting the alternatives in the cultural space where future knowledge will be produced.
The model as archive-in-use. Unlike a passive record, the language model actively generates output from its archive, which means every response reproduces the archive's categorical structure at civilizational scale.
Defenders of current AI development argue that the analogy to colonial archives is overdrawn — that the training data reflects available digital material rather than active exclusion, and that coverage of underrepresented languages is improving. The Spivak framework's response is that the distinction between passive reflection and active exclusion is precisely what the colonial archive also claimed, and that structural analysis requires looking past stated intent to institutional effect. Coverage improvements address the quantitative asymmetry without touching the qualitative one.