Epistemic decolonization is the project of dismantling the hierarchies of knowledge that colonial modernity established — the arrangement by which Western, male, Anglophone, written, institutionally-certified knowledge has been treated as universal while other forms have been treated as local, partial, or merely cultural. The project long predates AI; it runs through Frantz Fanon, Ngũgĩ wa Thiong'o, Sylvia Wynter, the Rhodes Must Fall movement, and the broader decolonial turn in higher education. What the AI era adds is urgency: the large language model trained on the current corpus is the single most powerful mechanism ever built for inscribing a particular epistemic hierarchy into the infrastructure of future knowledge production.
The training corpus of a frontier model is, in effect, the curriculum of a universal university. What is in the corpus is taught; what is absent is not. The current corpora are dominated by English-language, Western-authored, digitized text — a specific archive that represents perhaps 15 percent of the world's languages and a tiny fraction of its living knowledge traditions. The knowledge that exists in oral form, in minority languages, in unpublished manuscripts, in living practice rather than in text, is functionally absent from the model.
This matters more than most AI discussions acknowledge, because the model does not merely reflect the corpus; it operationalizes it. When a user asks the model a question about, say, agricultural practice, the answers that are readily available to it are the answers that were common in its training data. Answers that come from traditions underrepresented in the data are either absent, distorted, or retrieved only through heroic prompting. Over time, as the model becomes the default interface to knowledge, the corpus's biases compound: users get Western answers, they incorporate them into their own work, that work becomes training data for the next generation of models, and the hierarchy deepens.
Epistemic decolonization in the AI context therefore requires intervention at multiple levels. At the corpus level, it means systematically assembling training data from underrepresented traditions — which requires not just scraping more text but working with communities to digitize oral traditions in ways that respect their protocols. At the model level, it means architectures that can hold multiple epistemic frames without collapsing them into the dominant one. At the infrastructure level, it means the development of AI systems built from and for communities whose knowledge the metropolitan systems have erased. Projects like Masakhane (for African languages), Mozilla Common Voice (for underrepresented speech), and various indigenous-language initiatives are early examples of what this work can look like.
None of this is straightforward. The risk of bringing indigenous knowledge into training corpora without adequate governance is that it becomes another form of extraction — the community contributes its knowledge, the corporation extracts the value, the community receives nothing. Epistemic decolonization therefore requires institutional innovation: data sovereignty frameworks, community consent mechanisms, revenue-sharing structures, and governance arrangements in which the communities whose knowledge is used have meaningful control over how it is used.
The concept draws on a long tradition: Frantz Fanon's The Wretched of the Earth (1961), Ngũgĩ wa Thiong'o's Decolonising the Mind (1986), Walter Mignolo's work on the colonial difference, and Sylvia Wynter's analysis of the overrepresentation of Man. Its application to AI has been developed by scholars and practitioners including Abeba Birhane, Timnit Gebru, Rediet Abebe, and the Masakhane and Te Hiku Media communities.
The corpus is a curriculum. What is in the training data is taught; what is absent is erased.
Compounding bias. Models trained on biased corpora produce outputs that become the training data for subsequent models, deepening the hierarchy over time.
Multiple levels of intervention. Decolonization must operate at corpus, model, and infrastructure levels simultaneously.
Data sovereignty matters. Including underrepresented knowledge without governance arrangements reproduces extraction under a new name.
Alternative infrastructure. The most radical dimension of the project is the construction of AI systems built from and for communities whose knowledge the dominant systems have erased.