Introduction
This paper argues that the dominant framing of large language models as "general purpose AI" with emergent reasoning capabilities mischaracterizes the nature of these systems. The empirical evidence supports a different interpretation: LLMs function as probabilistic caches that compress statistical regularities from training data and retrieve outputs associated with similar inputs during inference. They are autoregressive models of their input data, nothing more and nothing less.
This framing is not merely terminological. It generates specific, testable predictions: strong performance on interpolation within training distributions, weak performance on extrapolation beyond them, sensitivity to surface-level perturbations, and systematic failures on knowledge underrepresented in training data. I attempt to present the evidence here that confirms these predictions across multiple domains. You are free to contact me if you have opposing evidence.
The implications of such a conceptualization extend beyond technical assessment. If LLMs are fundamentally retrieval systems rather than reasoning engines, then their capabilities are bounded by their training data, their apparent intelligence is transferred rather than generated, and the human labor underlying that data becomes central rather than incidental to understanding what these systems are.
I. The Fragility of Benchmarks
I.I Surface Modifications Degrade Performance
The evidence for surface-level fragility is extensive and consistent across studies. The GSM-Symbolic study (Mirzadeh et al., 2024) tested the effect of minimal modifications to mathematical word problems, such as changing names, adding irrelevant information, or altering numerical values, and found performance degradation of up to 65%, with variance between best and worst performance on semantically identical problems exceeding 15%. The researchers concluded that they "found no evidence of formal reasoning in language models" and that "the behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results." This fragility extends beyond mathematics: a NeurIPS 2025 workshop study on surface-form brittleness demonstrated that meaning-preserving paraphrases produce significant accuracy differences, indicating that models respond to linguistic surface features rather than semantic content. The SpuriVerse benchmark (2025) confirmed this pattern in multimodal settings, finding that even the best-performing closed-source model achieved only 37.10% accuracy when tested on susceptibility to misleading correlations, because models rely on dominant correlations frequent in training data but not causally relevant to the task. An ICLR 2025 study provided a mechanistic explanation: as correlation strength in training data increases, models shift from deep structure to surface structure, recognizing patterns rather than solving problems.
Chain-of-Thought prompting, often presented as enabling genuine reasoning, does not survive empirical scrutiny. Zhao et al. (2025) investigated whether CoT reflects reasoning or learned patterns and concluded that "CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions," with deviations in task structure, length, or format eliminating the apparent effect. A theoretical analysis (2025) reinforced this finding, arguing that "CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate," constraining models to produce outputs structurally similar to training examples without inducing actual reasoning. Even long-CoT reasoning models such as OpenAI o1 fail on pattern-based in-context learning benchmarks, as the Curse of CoT study (2025) demonstrated, challenging "existing assumptions regarding the universal efficacy of CoT."
I.II Benchmark Scores Are Systematically Inflated
The benchmarks used to evaluate these models are themselves unreliable. An analysis of 640 LLM-for-software-engineering papers (2017-2025) documented systematic reproducibility failures, with problems including missing dependency specifications, unpinned library versions, and vague references such as "latest model release," increasing from 12.5% in 2022 to over 40% in 2024-2025. A NeurIPS study examining whether prompt engineering techniques replicate found dramatic inconsistencies across Zero-Shot CoT, ExpertPrompting, and EmotionPrompting, describing the situation as "A Looming Replication Crisis in Evaluating Behavior in Language Models." Of 35 benchmark results for coding models, only 12 were successfully reproduced, with failures attributable to different model variants, prompt formats, and evaluation harnesses.
The problem extends to the benchmarks themselves. Forty-two researchers from Oxford, Stanford, Berkeley, and Yale examined 445 AI benchmarks and found that only 16% employed statistical methods when comparing models, concluding that "without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to." Andrej Karpathy summarized the situation in 2025: "Training on the test set is a new art form," describing "general apathy and loss of trust in benchmarks." A manual review of 5,700 MMLU questions found that 6.5% contained errors, meaning models are evaluated against flawed tests. Benchmark contamination compounds the problem: removing contaminated examples from GSM8K reduces accuracy by at least 13%, and models perform approximately three times better on SWE-bench Verified than on equivalent unpublished benchmarks. High benchmark scores may therefore reflect memorization rather than generalizable capabilities.
II. Hallucination as Symptom
The scaling strategy of LLM development is straightforward: increase data exposure to reduce hallucination, with the objective that almost any query matches something in the training corpus. This is not algorithmic improvement but brute-force coverage. The core algorithm, transformer architecture with attention mechanisms and backpropagation, has remained essentially unchanged since 2017; what scales is data. If models genuinely learned by extracting abstract principles applicable to new situations, improved understanding should require less data, not more. Instead, every scaling study shows that performance correlates primarily with data volume and model size: Llama 3 trained on 1,875 tokens per parameter, a tenfold increase from Llama 1.
If LLMs are probabilistic caches, then hallucination is not a bug but the expected behavior when the cache misses, and the evidence consistently supports this interpretation. Research has shown that "LLMs are especially prone to hallucinations when dealing with long-tail knowledge that appears infrequently in the training data." Kandpal et al. (2023) demonstrated that "QA accuracy declines for entities with limited presence in the pretraining corpus." A study on bibliographic recommendations (2025) made this relationship precise: citation count correlated strongly with factual accuracy, with highly-cited papers "almost verbatimly memorized" while less-cited papers triggered hallucination, leading the authors to conclude that "hallucination and memorization are not opposite errors but two sides of the same probabilistic process, determined by the density of knowledge in the pretraining corpus." The asymmetry extends to entity recognition more broadly: a 2025 study found that LLMs exhibit systematic asymmetry when recognizing equivalent facts, often identifying information from high-frequency to low-frequency entities but failing at the inverse. "Entity frequency in pre-training induces asymmetry in LLMs," with facts expressed from frequent to rare entities recognized more reliably than the reverse. Another study described "knowledge overshadowing," where dominant facts suppress rare ones following a log-linear law: the rarer the knowledge, the higher the hallucination rate.
Xu et al. (2024) formalized the inevitable conclusion: under open-world conditions where possible queries are unbounded, hallucination is mathematically inevitable for any computable LLM, "regardless of model architecture, learning algorithms, prompting techniques, or training data." This is not a technical limitation that better engineering can overcome but a consequence of the architecture itself. The industrial response is not algorithmic improvement but data accumulation: more samples annotated, more web text scraped, more synthetic data generated, hoping that a sufficiently large cache will almost always hit. This is pseudo-eliminative induction in Mill's sense, covering enough cases to minimize errors, but Mill's method works only in closed worlds. The real world is open, and each percentage point of hallucination reduction costs exponentially more data, compute, and human annotation labor. This is not a solution; it is an arms race against mathematics.
III. The Hidden Core: The Model as Aggregated Human Capacity
Behind "general purpose AI" stands a global industry of precarious labor. Data labeling, the manual annotation that enables pattern recognition, is predominantly outsourced to the Global South: Kenya, Philippines, India, Pakistan, Venezuela, Colombia. The wage differentials are stark: in Venezuela, 90 cents to 2 dollars per hour; in the Philippines, often below minimum wage; in the United States, 10 to 25 dollars for comparable work. Working conditions include crowded, dusty environments, no employment contracts, no grievance mechanisms, and psychological trauma from content moderation. Some providers employ child labor. Workers often do not know which systems their labor supports; an investigation found that Kenyan data labelers for Remotasks were unaware that the platform is a subsidiary of ScaleAI, which supplies major technology companies. But its sold as “Artificial Intelligence” - begging the question whats artificial about human labelled data and whos intelligence it involves tbh. (see https://www.404media.co/ai-is-african-intelligence-the-workers-who-train-ai-are-fighting-back/)
The demand for training data has now extended beyond annotation to direct behavioral capture. In April 2026, Meta installed tracking software on employee computers under a program called the Model Capability Initiative, recording mouse movements, keystrokes, and screen activity to train AI agents capable of performing knowledge work autonomously. According to leaked audio from an internal meeting, CEO Mark Zuckerberg explained that the system "learns from watching really smart people do things," characterizing elite engineers as superior training subjects to outside contractors. The following month, approximately 8,000 employees received layoff notices. The sequence is instructive: workers are compelled to generate the training data that enables their own replacement.
The pattern extends to the gig economy. In March 2026, DoorDash launched a standalone app called Tasks, paying its 8 million U.S. delivery couriers to film themselves washing dishes, folding clothes, and making beds. The purpose is not to improve food delivery but to generate training data for humanoid robots. Notably, the program excluded California, New York City, Seattle, and Colorado, jurisdictions with stricter data privacy protections and gig worker regulations. Similar programs have emerged globally: Scale AI and Encord recruit data recorders across more than 50 countries; California-based Sunday Robotics ships "skill capture gloves" that record motion data during household tasks; in China, the government has funded 40 dedicated robot training centers where human trainers repeat motions like folding clothes hundreds of times daily alongside humanoid robots.
Workers in Nigeria and India strap smartphones to their foreheads and film themselves doing chores for companies like Micro1, which sells the data to robotics firms. When interviewed, workers reported that they did not know how their data would be stored, shared, or passed to third parties. Requests for data deletion went unanswered. The economic logic is familiar: workers in positions of structural weakness, whether employed and fearing termination or precariously self-employed and seeking income, generate the data assets that accrue to corporations.
Every capability of an LLM or robotic system results from massive human labor. The model is not inherently multi-purpose; each new capability requires new training data, new annotation, new human work. The strategy is to show enough samples that the model encounters almost nothing new, reducing error rates as data coverage increases. This is not intelligence but brute-force memorization with statistical smoothing - sold back to us as artificial intelligence: https://www.404media.co/ai-is-african-intelligence-the-workers-who-train-ai-are-fighting-back/
IV. Corporate Interests
LLM companies have commercial interests in the credibility of their capability claims, a structural observation that distorts public discourse. Benchmark scores are presented as measures of intelligence, but in practice they measure how well a model performs on a specific test, under specific prompt conditions, with a specific evaluation harness. Changes in prompting, scaffolding, tool access, temperature, or evaluation method shift rankings substantially. When a model finds a security vulnerability, this is presented as reasoning, but the methodology involves hundreds or thousands of parallel agent sessions, iteration through known exploit techniques, systematic input fuzzing, and massive compute resources. Automated fuzzing has existed since the 1980s; what LLMs add is a more flexible input generator, while the methodology of exhaustive search with validation remains unchanged.
Infrastructure determines performance more than the model itself. Top SWE-bench configurations use thousands of parallel attempts, multiple models, and elaborate orchestration, and these setups are expensive. The question rarely asked is who has access to this infrastructure, given that compute resources are concentrated in a small number of corporations. When a company announces a benchmark success, the relevant question is not only "What does this measure?" but "Under what conditions, and who can replicate it?" If AI capability increasingly depends on infrastructure rather than model quality, the problem is not AI regulation but monopoly regulation. The solution is not better benchmarks but open infrastructure.
V. Memorization Rather Than Generalization
If LLMs genuinely generalized by learning abstract principles transferable to new situations, we would expect smaller models with better algorithms to achieve more. Instead, scaling laws show the opposite: performance correlates primarily with model size and data volume. The Chinchilla study (Hoffmann et al., 2022) established that improving performance requires scaling both dimensions simultaneously. Llama 1 trained with 142 tokens per parameter; Llama 2 doubled to 284; Llama 3 reached 1,875 tokens per parameter. One study estimates that by 2028, the total available high-quality text on the web will be exhausted at approximately 4×10¹⁴ tokens. This is the behavior of a system that memorizes, not one that generalizes; a system learning genuine abstraction would become more efficient, achieving more with less.
The evidence suggests that larger models may memorize more, not less. Carlini et al. (2022) found "higher prevalence of verbatim recall in larger models," and scaling laws for fact memorization (2024) quantify that memorizing all Wikidata triples would require 1,000 billion non-embedding parameters, with knowledge capacity scaling directly with parameter count. Another study distinguished memorization and generalization neurons at the cell level, finding that "memorization and generalization activate distinct neuron subsets within the same LLM," indicating that memorization is not a bug that disappears with better training but a separate function the model executes in parallel. Bender et al. (2021) captured this with the term "stochastic parrot": LLMs "stitch together sequences of linguistic forms... observed in vast training data, according to probabilistic information about how they combine, but without any reference to meaning." A recent PhysiCo study (2025) tested this empirically, finding that LLMs lag 40% behind humans in physical concept understanding: "The stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language." They can discuss gravity but do not understand what gravity is; they have memorized linguistic patterns, not the concept.
VI. Basically A Better Search Engine
If LLMs primarily memorize rather than generalize, they are functionally equivalent to a sophisticated search engine that returns not links but the most probable answer reconstructed from the training corpus. Sam Altman stated that "the right way to think of the models that we create is a reasoning engine, not a fact database," but the evidence suggests the opposite: they are primarily a fact database with elaborate retrieval. Retrieval-Augmented Generation makes this equivalence explicit by combining LLMs with classical search because the LLM alone is not reliable. The telling phrase from RAG literature: "By augmenting the LLM with a search engine, we no longer need to fine-tune the LLM to reason about our particular data." The sentence reveals the truth: the LLM never reasoned; it retrieved. When retrieval fails, a real search engine is required.
If the learning algorithm contributes little to performance and scaling is the primary variable, then the actual value lies not in the model but in the data: the thousands of hours of annotation, the millions of RLHF feedback samples, the precarious labor of workers in Kenya, the Philippines, and Venezuela producing training samples for 2 dollars per hour. This is the actual asset. The model itself is interchangeable; any transformer with sufficient parameters can learn what the data provides. The data is the scarce resource, and the data is human labor, compressed and sold as "artificial intelligence."
VII. The Probabilistic Cache And Its Implications
A theoretical framework emerges: LLMs are probabilistic caches that compress statistical regularities from training data and return outputs associated with similar inputs during training. This framework generates predictions, all confirmed by the evidence: strong interpolation within training distributions, weak extrapolation beyond them, performance degradation on surface modifications, and hallucination on underrepresented queries. The strategy of LLM developers is pseudo-eliminative induction: show enough samples that almost all relevant cases are covered. This is not genuine generalization but approximate coverage that breaks down when the world is more open than the training data.
Grokking studies (Power et al., 2022) showed that neural networks can suddenly generalize after extended training, raising the question of whether this will scale to LLMs. The evidence is discouraging. Grokking was demonstrated on small algorithmic datasets such as XOR and modular addition, and the mechanisms remain poorly understood. A systematic study (2025) found that "grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization." These are correlative observations on toy datasets, not explanations for real-world generalization. LLMs showing genuine generalization remains a distant prospect.
LLMs are useful tools within their limitations: tasks similar to training data, static and well-documented contexts, situations where errors are tolerable. For open problems, novel situations, and high-stakes decisions, they are unsuitable. AGI is not in sight; what we observe is pattern matching at scale, not reasoning, and the architecture itself, predicting the next token based on statistical patterns, bounds what is possible.
Behind every apparently intelligent output stands human labor: annotation, curation, evaluation. This labor is invisible, precarious, and underpaid. This is not a peripheral aspect; it is the core. The problem is not AI but concentration of data, compute, and capital. The solution is not AI ethics but structural intervention: open infrastructure, labor rights along global supply chains, and transparency about what these systems actually are.
References
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623.
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramèr, F., & Zhang, C. (2022). Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646. Published at ICLR 2023.
Hoffmann, J., et al. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Kandpal, N., Deng, H., Roberts, A., Wallace, E., & Raffel, C. (2023). Large language models struggle to learn long-tail knowledge. Proceedings of the 40th International Conference on Machine Learning (ICML).
Mirzadeh, S. I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229. Presented at ICLR 2025.
Niimi, J. (2025). Hallucinations in bibliographic recommendation: Citation frequency as a proxy for training data redundancy. arXiv preprint arXiv:2510.25378.
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177.
Vaugrante, L., Niepert, M., & Hagendorff, T. (2024). A looming replication crisis in evaluating behavior in language models? Evidence and solutions. arXiv preprint arXiv:2409.20303.
Villalobos, P., et al. (2022). Will we run out of data? Limits of LLM scaling based on human-generated data. arXiv preprint arXiv:2211.04325.
Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817.
Yang, Y., Lee, C. P., Feng, S., Zhao, D., Wen, B., Liu, A. Z., Tsvetkov, Y., & Howe, B. (2025). Escaping the SpuriVerse: Can large vision-language models generalize beyond seen spurious correlations? Proceedings of ICML 2025.
Yao, S., et al. (2025). Supposedly equivalent facts that aren't? Entity frequency in pre-training induces asymmetry in LLMs. arXiv preprint arXiv:2503.22362.
Yu, M., et al. (2025). The stochastic parrot on LLM's shoulder: A summative assessment of physical concept understanding. arXiv preprint arXiv:2502.08946.
[Sources on data annotation labor]:
The Conversation (2026). AI is a multi-billion dollar industry. It's underpinned by an invisible and exploited workforce.
MIT Technology Review (2022). How the AI industry profits from catastrophe.
Brookings Institution (2025). Reimagining the future of data and AI labor in the Global South.
Privacy International. Humans in the AI loop: The data labelers behind some of the most powerful LLMs' training datasets.