The structural reframing that reads the large language model's training corpus through the lens of Spivak's analysis of the colonial archive — an apparently comprehensive record whose categories enact the exclusions they claim to overcome.
Training data as colonial archive applies Spivak's foundational analysis of the British administrative record in India to the corpora that produce contemporary language models. The claim is not merely analogical. It is structural: both archives are meticulous, comprehensive, and violent in a specific sense — not because they record atrocities (though they do) but because the act of recording transforms what it records. Practices that had been fluid become fixed. Identities that had been contextual become categorical. Knowledge embedded in relationships becomes extractable as data. The archive does not describe the world; it produces a world legible to the apparatus that builds the archive, organized according to categories the apparatus finds useful, stripped of everything that does not serve its purposes.
Training Data as Colonial Archive
In The You On AI Field Guide
The quantitative asymmetry of the training data is well documented. English, spoken natively by approximately five percent of the world's population, accounts for more than