The mechanism is structurally unlike the network effects Katz and Shapiro formalized in 1985. In the direct effect, each user adds value by being reachable or present on the network. In the indirect effect, each user adds value by attracting complementary goods producers. In the data effect, each user adds value by teaching the model — providing the implicit and explicit signal that shapes future capability through RLHF, capability gap identification, and domain-specific pattern accumulation.
The competitive consequence is severe. A platform with a billion user interactions has a model refined by a billion interactions' worth of behavioral signal. A new entrant begins with whatever capability its initial training provides. The quality gap between incumbent and entrant widens with every interaction on the incumbent's platform. This inverts the dynamic of most markets, where incumbent advantages erode as competitors learn and improve. In the data network effect, the incumbent learns faster by virtue of having more users from whom to learn.
Hal Varian identified this dynamic in his 2018 NBER working paper Artificial Intelligence, Economics, and Industrial Organization, a chapter originally conceived as a joint project with Shapiro. Varian's analysis of data access and returns to scale in AI markets became one of the earliest formal economic treatments of exactly the dynamics now playing out in frontier model competition.
The data network effect interacts with traditional forms to produce compound feedback: a better model (from data effects) attracts more users (strengthening direct effects), which attracts more complementary goods developers (strengthening indirect effects), which makes the platform more valuable, which attracts more users, which generates more training signal. Each circuit through the three-way loop makes the next circuit faster and stronger.
The concept emerged from the empirical observation in the 2010s that machine learning systems improved with scale of training data and from the theoretical work of Varian and others applying industrial organization theory to AI markets. The term gained traction in the early 2020s as it became clear that large language models improved not merely through pretraining but through iterative refinement based on deployment.
Usage teaches the model. Every interaction — prompts accepted, responses modified, sessions abandoned — generates signal that shapes future model capability through post-training refinement.
The advantage compounds. Unlike most incumbent advantages, which erode as competitors catch up, the data advantage widens with every interaction that occurs on the incumbent's platform and not the entrant's.
Local effects create market segmentation. Within professional domains, specialized usage creates domain-specific model improvements that benefit practitioners of that profession more than general users.
Mitigation requires structural intervention. Data portability mandates do not address the data network effect because the improvement is embedded in the model, not in user data.
Some scholars argue the data network effect is weaker than often claimed — that marginal training data beyond a certain volume produces diminishing returns, and that model improvements from post-training innovations may outweigh those from additional user data. The empirical question remains open, but the structural mechanism — incumbents learning from their installed base in ways competitors cannot — is unambiguous.