M1E1: From Turing to the First AI Winter: Promises and Limits of Symbolic AI

Episode 1: From Turing to the First AI Winter — Promises and Limits of Symbolic AI

The Question That Launched a Field

There is a particular kind of philosophical move that changes everything by refusing to answer the question it was asked. Alan Turing made that move in 1950. His paper, published in Mind, considers the question "Can machines think?" — and then immediately sidesteps it. Turing argued that since the words "think" and "machine" cannot clearly be defined, we should "replace the question by another, which is closely related to it and is expressed in relatively unambiguous words." What he proposed instead became one of the most consequential framings in intellectual history: the imitation game.

Rather than trying to determine if a machine is thinking, Turing asked whether the machine can win a game. The imitation game involves three players: a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart from the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. In the variant that matters to AI history, a machine replaces one player. If, after a set period, the interrogator fails to distinguish one respondent from the other, the computer wins — and such a machine could be said to think.

The elegance of this move is worth pausing on, because its consequences extend directly into contemporary debates about AI regulation and evaluation. Turing didn't ask what intelligence is. He asked whether a machine could perform intelligence indistinguishably from a human — replacing an ontological question about the nature of mind with a behavioral one about what intelligence looks like from the outside. Turing argued this test avoids issues of physical form and focuses on intellectual ability. That substitution is brilliant and also deeply problematic. Any system sufficiently skilled at imitation passes the test, regardless of whether anything like understanding is occurring. The test is, fundamentally, about appearances.

Turing himself predicted that in about fifty years it would be possible to program computers to play the imitation game so well that an average interrogator would have no more than a seventy percent chance of correct identification after five minutes of questioning. He was roughly right on the timeline, though the systems that eventually passed various versions of the test — Joseph Weizenbaum's ELIZA in 1966, the chatbot Eugene Goostman in 2014, and eventually large language models far exceeding any prior benchmark — did so through mechanisms that would have surprised him. They didn't reason. They generated plausible text. Whether that distinction matters is a debate that remains unresolved.

What matters for this course is that the behavioral framing Turing established in 1950 still structures how governments, regulators, and executives talk about AI capability today. When a policy briefing asks whether a model "performs at human level" on some benchmark, or whether a chatbot can "pass as human," it is operating inside the conceptual frame Turing drew. The imitation game was a liberating intellectual move that freed AI from unanswerable questions about machine consciousness — and it simultaneously created a trap, because it made AI evaluation about performance on observable outputs rather than about the underlying mechanism. That trap has never stopped being relevant.


Dartmouth 1956: The Conjecture That Became a Program

Six years after Turing's paper, a group of mathematicians and computer scientists gathered at Dartmouth College in New Hampshire for what has since been called the founding event of AI as a formal research discipline. The Dartmouth Summer Research Project on Artificial Intelligence — a 1956 summer workshop widely considered the moment the field was named and constituted — has been called "the Constitutional Convention of AI." The project's four organizers were Claude Shannon, John McCarthy, Nathaniel Rochester, and Marvin Minsky.

The proposal they submitted, authored in August 1955, contains a single sentence that encodes everything that would go wrong with symbolic AI for the next four decades. The workshop was based on the conjecture that "every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it." This is not an empirical claim about what had been demonstrated. It is a conjecture — an act of faith that intelligence is decomposable, formalizable, and ultimately programmable. McCarthy, Minsky, Shannon, and Rochester were not naive. They were operating at the absolute frontier of what computers could do, and they were making a bet that turned out to be only partially correct.

As McCarthy put it in the conference proposal, the aim was to find out "how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans." The ambition was total: not narrow task completion, but the full range of human cognitive capability, approached through the systematic encoding of knowledge as explicit symbols and rules. Several directions are considered to have been initiated or encouraged by the workshop: the rise of symbolic methods, systems focused on limited domains, and deductive versus inductive systems.

The intellectual core of what emerged from Dartmouth — and dominated AI for the next thirty years — was the belief that the right representation could capture the right knowledge, and that an inference engine operating on that representation could reproduce expert performance. This is not trivially wrong. It is wrong in a specific and instructive way. The Dartmouth conjecture gets the mechanism of intelligence backwards. It assumes that intelligence is a system of rules that humans possess but haven't yet written down. What the subsequent decades demonstrated is that most of what humans know is not available to introspection in rule form — and that the parts that are available, when written down, turn out to be far more fragile than anyone expected.

The optimism among participants regarding timelines was striking. Herbert Simon stated in 1965 that "machines will be capable, within twenty years, of doing any work a man can do." Minsky predicted in 1967 that "within a generation the problem of creating artificial intelligence will substantially be solved." These predictions were not outliers from fringe figures. They came from the people who knew the field best. The confidence was structural — it followed directly from the assumption that intelligence was explicit, formalizable, and therefore encodable. If you genuinely believe that intelligence is just rules humans haven't written down yet, then writing those rules down seems like a finite, tractable engineering project.

The gap between that belief and reality would define the first two decades of AI's crisis.


Expert Systems: The Art of Bounded Competence

Before the crisis came a genuine achievement, and understanding that achievement is essential to understanding why the collapse was so devastating and so instructive.

Beginning in the late 1960s and accelerating through the 1970s, a group of researchers — led principally by Edward Feigenbaum at Stanford — pursued a more modest and ultimately more productive version of the Dartmouth program. Instead of trying to build general intelligence, they focused on encoding the knowledge of specific human experts in specific narrow domains. Feigenbaum introduced expert systems that mimicked the decision-making process of a human expert: the program would ask an expert in a field how to respond in a given situation, encode that response as a rule, and once this was done for virtually every anticipated situation, it became a rule-engine from which non-experts could receive advice.

The flagship examples are worth knowing in detail, because they demonstrate both what symbolic AI could do and where its ceiling was.

DENDRAL, begun in 1965, was an early collaboration between Feigenbaum and Nobel laureate Joshua Lederberg, designed to identify organic molecular structures from mass spectrometry data — the technique of bombarding molecules with electrons and measuring the resulting fragments to infer structure. DENDRAL's performance was superior to most human experts in its domain. Mass spectrometry interpretation requires navigating enormous combinatorial spaces of possible molecular structures against measured fragmentation patterns, exactly the kind of bounded, rule-governed problem where symbolic AI excelled.

MYCIN, developed at Stanford starting in 1972, tackled bacterial infection diagnosis and antibiotic treatment recommendation. It encoded approximately 600 rules derived from clinical expertise and achieved a success rate of sixty-nine percent on bacterial meningitis cases — better than human specialists in that specific domain. The result astonished the medical community, because it demonstrated that encoded rules could match and sometimes exceed specialist clinical judgment within a tightly bounded problem space. Yet MYCIN never reached clinical deployment. Physician resistance, liability concerns, and the brittleness that emerged as soon as the system encountered cases outside its training distribution combined to keep it in the laboratory. The gap between laboratory performance and real-world deployment would prove to be a permanent feature of the expert system era.

Then came the commercial success that made everyone believe the paradigm had finally cracked the problem. Digital Equipment Corporation's XCON (eXpert CONfigurer) — also called R1 — was an expert system designed to handle the complexity of computer hardware configuration. Deployed in 1980, it was implemented in a rule-based programming language and by 1989 contained more than 15,000 rules. XCON was in daily use at Digital, configuring more than ninety percent of all their computer orders and saving $25 million annually. Here was an expert system that worked — not in a laboratory, but in production at scale, replacing a process that had required teams of highly trained engineers and still produced errors.

The expert systems boom of the early 1980s was real. Companies across industries hired "knowledge engineers" to interview domain experts and translate their expertise into rule bases. By the mid-1980s, expert systems had become a significant industry, with corporations spending billions of dollars building, deploying, and maintaining them. The configuration problem that XCON solved was exactly what symbolic AI was built for: clear inputs, deterministic constraints, verifiable outputs, and a finite rule space that a human expert could articulate and a programmer could encode.

The question that everyone avoided asking was: how many real-world problems look like that?


The First AI Winter: When Promises Met Measurement

The first AI winter was not caused by the expert systems boom — it preceded it. Understanding the first winter correctly requires separating its two components: a crisis in government funding driven by specific report-triggered skepticism, and a deeper structural crack in symbolic AI's theoretical foundations.

The pattern began in 1966 when the ALPAC report appeared criticizing machine translation efforts. ALPAC — the Automatic Language Processing Advisory Committee, a body convened by the U.S. government to evaluate progress in computational linguistics — produced a report formally titled Language and Machines: Computers in Translation and Linguistics. Machine translation had been one of the most publicly visible and heavily funded AI applications since the early 1950s, driven by Cold War logic: automatic translation of Russian scientific literature would maintain intelligence advantage. The ALPAC report concluded that machine translation was slower, less accurate, and twice as expensive as human translation, and that no immediate or predictable prospect of useful machine translation existed. After spending $20 million, the National Research Council ended all support.

The ALPAC finding was damaging not just because it cut funding to machine translation, but because it crystallized a broader skepticism about whether symbolic approaches could handle the inherent ambiguity and context-dependence of natural language. A symbolic system can parse grammar. It cannot easily resolve the difference between "I saw the man with the telescope" meaning I used a telescope to observe a man, versus I observed a man who had a telescope. Human language is saturated with this kind of ambiguity, and resolving it requires world knowledge, contextual inference, and uncertainty handling that rule-based systems handle poorly.

Then came the Lighthill Report. In 1973, Professor Sir James Lighthill of Cambridge University produced "Artificial Intelligence: A General Survey" to help the British Science Research Council evaluate requests for support in AI research. Lighthill stated that "in no part of the field have discoveries made so far produced the major impact that was then promised." The paper led the British government to end support for AI research in most British universities.

The Lighthill Report attacked symbolic AI on three grounds that remain analytically important. First, AI had systematically overestimated the tractability of complex tasks by demonstrating success on toy problems that didn't scale. A chess program that plays well in a restricted environment tells you little about whether the same approach will work in real-world problem domains with open-ended state spaces. Second, the combinatorial explosion problem: the number of possible states in any reasonably complex problem grows exponentially, and no amount of clever rule-writing could tame that growth within the computational resources of 1973. Third, the report questioned whether the whole project of symbolic AI was grounded in a realistic model of how human intelligence works.

DARPA was disappointed with researchers working on the Speech Understanding Research program at Carnegie Mellon and canceled an annual grant of three million dollars. By 1974, funding for AI projects was hard to find.

The first AI winter — roughly 1974 to 1980 — was not a complete shutdown. Academic research continued at reduced scale. But the institutional confidence that had characterized the decade after Dartmouth evaporated, and with it the resources that had sustained ambitious projects. What ended the first winter was not a breakthrough in symbolic AI's underlying capabilities. It was the pivot to expert systems — a tactical retreat from general intelligence to narrow domain mastery that temporarily made symbolic AI look viable again. That retreat bought another decade. It didn't fix the underlying problems. It moved the failure mode to a different location.


The Second Winter: When Commercial Deployment Found the Ceiling

The term "AI winter" first appeared in 1984 as the topic of a public debate at the annual meeting of the AAAI — the Association for the Advancement of Artificial Intelligence, the field's principal professional organization. Roger Schank and Marvin Minsky — two leading AI researchers who had lived through the 1970s contraction — warned the business community that enthusiasm for AI had spiraled out of control and that disappointment would certainly follow. They described a chain reaction, similar to a "nuclear winter," that would begin with pessimism in the AI community, followed by pessimism in the press, followed by a severe cutback in funding, followed by the end of serious research.

Three years later, the billion-dollar AI industry began to collapse.

The second AI winter had a hardware trigger. It began with the sudden collapse of the market for specialized AI hardware in 1987. Through the early and mid-1980s, companies like Symbolics had built a business around purpose-built hardware for running Lisp-based AI systems — machines that cost six figures and were optimized for the symbolic processing that expert systems required. Desktop computers from IBM and Apple were steadily gaining market share, and 1987 became the turning point when Apple's and IBM's computers became more powerful and cheaper than those specialized Lisp machines. An entire industry worth half a billion dollars collapsed in a single year.

The hardware collapse exposed a deeper crisis in the expert system model itself. XCON, the paradigm success story, illustrated this clearly. The system celebrated for its 15,000 rules and $25 million in annual savings became a maintenance nightmare as Digital's product line evolved. Adding new products, retiring old components, accommodating new configurations — all of it required expensive knowledge engineers to manually revise and extend the rule base. The system couldn't update itself. It couldn't generalize from new examples. It required constant human intervention to remain accurate, and the cost of that intervention steadily eroded the economics that had justified building it.

From the geopolitical stage came a more dramatic collapse. In 1981, the Japanese Ministry of International Trade and Industry set aside $850 million for the Fifth Generation Computer Project. Their objectives were to write programs and build machines that could carry on conversations, translate languages, interpret pictures, and reason like human beings. By 1991, the list of goals set in 1981 had not been met. The Fifth Generation project bet on parallel processing and logic programming — specifically the PROLOG language rather than LISP — and was technically interesting and trained hundreds of engineers. But it didn't produce commercial products. The world had moved toward open systems and general-purpose hardware, and Japan's specialized approach was obsolete before it launched. The project quietly wound down in 1992 without achieving any of its headline objectives.

At DARPA, the reaction was clinical and decisive. Jack Schwarz, who took over leadership of IPTO — the Information Processing Techniques Office, the primary government funder of AI research — in 1987, dismissed expert systems as "clever programming" and cut AI funding sharply, effectively ending the Strategic Computing Initiative. The phrase "clever programming" is more analytically precise than it sounds like a dismissal. Schwarz's point was that expert systems had not discovered any principles of intelligence. They had cleverly encoded specific knowledge in a specific form for specific problems. That wasn't a technology platform with generalization potential.

The second AI winter had arrived. In some ways it was more demoralizing than the first — because the field had genuinely believed, with good reason, that it had learned from its mistakes.


Why Symbolic AI Failed: The Structural Account

Funding cuts and hardware economics explain the timing of the winters. They do not explain why symbolic AI could not solve the problems that would have prevented the funding cuts. The structural account matters more, because it explains what had to change before AI could become what it is now.

The first structural failure is the knowledge acquisition bottleneck. Complex systems such as DENDRAL, MYCIN, or PROSPECTOR (a mineral exploration expert system that in the late 1970s correctly predicted the location of a molybdenum deposit worth roughly $100 million) could demand over thirty person-years to complete — and any improvement of performance depended on further attention from expensive developers. Every rule had to be extracted from a human expert, validated, formalized, and encoded. That process was slow, expensive, and epistemically limited in a specific way: it could only capture knowledge that experts could articulate. Most expert knowledge is tacit. A radiologist who reliably identifies a subtle finding on a chest X-ray often cannot explain the rules by which they do so. They have internalized patterns from thousands of cases in a way that doesn't decompose into explicit if-then propositions. When asked to explain their reasoning, they produce a post-hoc narrative that only partially captures what happened during recognition. Expert systems could only encode the narrative, not the underlying competence.

The second failure is brittleness outside the training distribution — in the terminology of that era, the "qualification problem." A symbolic rule fires when its conditions are precisely met and does nothing otherwise. Real-world inputs are noisy, incomplete, and novel in ways that no finite rule base can fully anticipate. A physician encountering an atypical presentation of a known disease, or a genuinely novel pathogen, can bring contextual reasoning, analogical inference, and probabilistic judgment to bear. MYCIN encountered an atypical presentation and either fired the wrong rules or fired no rules at all — producing outputs that were confidently wrong or uselessly silent. Technologies that worked brilliantly on carefully selected test cases failed when confronted with real-world complexity.

The third failure is the inability to handle uncertainty in a principled way. Early symbolic AI worked with crisp logic: propositions were true or false, rules fired or didn't. But real medical diagnosis, real engineering configuration, real natural language interpretation all involve degrees of confidence, competing hypotheses, and probabilistic inference over incomplete evidence. MYCIN added certainty factors — hand-tuned numerical weights on rules — but this was a patch over the structural problem, not a solution to it. Judea Pearl's work on Bayesian networks in the late 1980s and 1990s would show what principled uncertainty reasoning looked like, and it looked nothing like a rule base.

Underlying all three failures was the same fundamental mistake: the belief that intelligence could be fully captured as explicit, human-readable rules. The Dartmouth conjecture — that every feature of intelligence can in principle be precisely described — turned out to be false in a specific way. Not all intelligence can be precisely described by the people who possess it. Much of it lives in pattern recognition below the level of verbal articulation, in statistical regularities accumulated over vast experience, in the weighted combination of countless subtle signals that experts integrate automatically but cannot explain.

This is precisely where connectionism — the other tradition in AI, the one that Minsky and Papert had tried to kill in 1969 with their critique of perceptrons, the one that had been working quietly through the dark years of both winters — was different in kind. Connectionism didn't try to encode intelligence as rules. It tried to learn statistical patterns from examples. The new AI spring was powered by increasing computational capacity combined with the availability of large digital datasets and with the replacement of formal logic-based approaches by various kinds of machine learning algorithms. The backpropagation revival of the mid-1980s, the neural network renaissance of the 1990s, and the deep learning explosion of the 2010s were all versions of the same answer to the same question: if intelligence can't be encoded as rules, can it be learned from data?

The answer, it turned out, was yes — at sufficient scale, with sufficient data, and with sufficient compute. None of those conditions were met during the winter years. The connectionist researchers who were right about the fundamental mechanism were wrong about the feasibility timeline, and their ideas sat largely dormant until hardware and data caught up.


The Map of Where You Are

What this history gives you is a conceptual map with three marked locations.

The first is the original sin of AI: the conflation of behavioral performance with internal mechanism. Turing's imitation game was useful precisely because it avoided the unanswerable question of what intelligence is — but that avoidance created a research culture that optimized for looking intelligent rather than understanding what intelligence requires. Expert systems looked intelligent inside their training distribution. Large language models look intelligent on benchmarks. Neither observation tells you what the underlying system is doing or where it will fail.

The second is the knowledge acquisition bottleneck as a structural property, not an implementation detail. Symbolic AI didn't fail because researchers weren't smart enough to write enough rules. It failed because human expertise is not primarily rule-based. The tacit knowledge problem — the gap between what experts know and what they can articulate — is real, and it means that any AI approach depending on humans explicitly encoding knowledge will hit a ceiling. The approaches that broke through that ceiling were the ones that could extract structure from data without requiring humans to describe that structure first.

The third is the recurring pattern of hype, retreat, and structural reckoning. Both winters followed the same logic: a genuine success in a narrow domain generated extrapolated confidence about general applicability; that confidence attracted resources and expectations that the underlying technology couldn't satisfy; when the gap between expectation and performance became undeniable, institutional confidence collapsed faster than the underlying research had warranted. The main causes of the AI failures can be traced to excessive hype relative to what was known, established, and proven in terms of theoretical capabilities and empirical performance.

That pattern did not end with the second AI winter.

The question worth carrying into the rest of this module is not whether current AI systems are impressive — they are, measurably, on benchmarks from MMLU (Massive Multitask Language Understanding, a standard AI benchmark spanning dozens of academic subjects) to GPQA (Graduate-Level Google-Proof Q&A, a benchmark of expert-level science questions) to SWE-bench (a benchmark testing whether AI can resolve real software engineering issues from GitHub), in ways that would have seemed like science fiction during the winters. The question is whether the structural constraints that ended symbolic AI — brittleness outside distribution, tacit knowledge, uncertainty handling, the gap between task performance and genuine understanding — have been solved, or merely moved to a different location in the architecture.

The answer to that question will define what decisions you can safely delegate to AI systems, and what decisions you cannot.