The syntax of English is undecidable

The question of parsing English and other natural languages has come up in the course of my work on the Marpa parser. As in the case of Perl, I first posed the question of whether any algorithm running on a Turing machine can parse the target language. This post contains what I hope the reader will find to be a rigorous demonstration that the syntax of the English language is undecidable.

When I say “undecidable”, I mean that term in the strict sense. Undecidability is not vagueness or uncertainty -- undecidability is the certainty that a “decision” of the matter is not possible. I will give a specific example of an English-language sentence which is unsyntactic if and only if it is syntactic.

The Demonstration

The sentence hinges on the distinction between sentences in the passive voice, and sentences which contain a copular verb (in this case, “to be”) and an adjectival complement. For example,

(A) “The door was opened”
is a passive voice sentence. In the sentence
(B) “The door was open”,
on the other hand, “open” is an adjectival complement, and “was” is a copular verb. (While the passive voice of English is the subject of considerable discussion, the terminology and analysis of this post follows Quirk et al., A comprehensive grammar of the English language, pp. 159-171. and Language Log, and sticks to terrain which should not be controversial, even for non-specialists.)

Now consider the following sentence.

(C) “The window was closed.”
Is this a sentence with a predicate in the passive voice, or is “was” a copular verb with an adjectival complement? In standard dictionaries, you will find “closed” is listed both as an adjective and as the past participle of the verb “close”, so that either could be the case. Further information would be needed to decide the syntax of (C).

Next I make two observations. The first observation is that a sentence can be syntactically correct, but not meaningful. The traditional example is

(G) “Green dreams sleep furiously”.
Another example is
(H) “The first even prime greater than two crossed the street symmetrically”.
Both these sentences allow easy syntactic analysis: each has a subject, and a predicate. Nouns, adjectives, adverbs and verbs can be readily identified and given fixed locations in a formal syntactic structure. But neither sentence has any meaning, except in a poetic or highly figurative sense.

The second observation is that, while we can have correct syntax without a semantics, we cannot have meaning without syntax. That is, sentences like

(I) “Airplane from to the and for of or pilot up”,
while they might contain words which could be selected and rearranged to be meaningful, do not mean anything.

We now return to the passive-versus-adjective decision, where we observed that

(P) “The window was closed”
could be an example of either a verb in the passive voice or of a copular verb with an adjectival complement. In the sentence
(Q) “The window was closed, but the door was open”,
the question is resolved by coordination. Since in (Q) “open” can only be an adjectival complement, the expectation is that “closed” will also be. Similarly, in the sentence
(R) “The window was closed, but the door was opened”,
coordination with “opened” decides the matter in favor of the passive voice. In a sentence where the coordination is closer, and includes a by-phrase to show agency for both verbs,
(S) “The window was closed and the door opened by the same person”,
the presumption that both verbs are in the passive voice becomes a certainty.

We are now in a position to consider our undecidable sentence,

(U) “The window was closed and the door opened by the burglar after he discovered that the window was in fact a beautifully executed trompe d'oeil mural.”
The window is fake. It therefore could not have been closed by the burglar, which makes its syntax wrong. The verb “was closed”, before semantic feedback is taken in account, clearly is in the passive voice. After the semantic feedback to the syntax is considered, however, it is clear that “was closed” cannot be in the passive voice, and that therefore (U) is unsyntactic.

But if a sentence does not have correct syntax, it cannot have a semantics. Since the only problem with the syntax of (U) was a result of the semantic feedback, if semantics is not considered (U) has correct syntax. So (U) is incorrect syntactically if and only if it has correct syntax.

We cannot decide (U) to have incorrect syntax, because in that case it becomes meaningless and the syntax becomes unobjectionable. But neither can we decide (U) to have correct syntax, because that makes (U) meaningful. When (U) is meaningful, it has incorrect syntax, because the passive voice cannot be correct given the meaning.

Our only choice is to say that the correctness of the syntax of (U) cannot be decided. It is not true that (U) has correct syntax, but neither is it true that (U) has incorrect syntax. This concludes the demonstration.

A Thought Problem

To a reader unconvinced by the preceding demonstration, I would urge that, as a thought problem, she consider a computer program attempting to parse English. Such a program would have to know at least some semantics, and would have to feed this knowledge back to the parsing process. But the preceding demonstration shows that such a feedback loop, if completely effective, will encounter sentences like (U), where it cannot decide whether the syntax of the sentence is correct or incorrect. I believe that, by reflecting on this thought problem, the candid reader will be able to convince herself that a human being parsing English sentences like (U) will have the same problems, and for similar reasons.

About Syntax and Semantics

It needs to be emphasized that the diagonalization in this post's demonstration is not about semantic truth and has nothing to do with semantic paradox. This post is about the feedback from semantics into syntax, and the relationship between them. By way of contrast, consider these sentences:

(X) “Two plus two is five.”
(Y) “This sentence has incorrect syntax.”
(Z) “This sentence is false.”
(X) is false; (Y) attempts a contradiction but fails; and (Z) achieves a semantic contradiction. But in none of them is the semantics an issue for the syntax, and all three are correct syntactically. In each case, even when there are semantic issues, the feedback that the semantics provide the syntax is unproblematic.


This is a wrong approach to showing the undecidability of English (NB: in the framework of a Chomsky grammar!) because the problem here is simply ambiguity, something natural language parsers have to be able to deal with anyway, and because it unnecessarily mixes syntax and semantics.
To be able to successfully parse natural language syntax, a parser only has to come up with a finite number of alternative parse trees in finite time. E.g. the sentence "They saw the man with the binoculars" has four different readings: "saw" can be past tense of "to see" or present of "to saw" while "with the binoculars" can be an object attribute to "the man" (the man has binoculars) or an adverbial phrase (the binoculars are the instrument he is seen/sawed with). The seeing variants are both completely plausible and there's no way to tell wich is the correct one without further context, so a syntactic parser simply has to produce alternative parse trees. The sawing ones are at least dubious. Can you saw a person? The Spanish Inquisition would definitely say yes, and just this morning my my 2yo tried to saw me with his plastic saw. Can binoculars be used for sawing? Maybe less so, but then again when I look at my son and what kind of objects can double as a saw for him, I'm not so sure. In any case that's not for the syntax to decide. That exactly was Chomsky's point with "Colorless green ideas sleep furiously"---it *is* grammatical even though it makes no sense.
Of course there are other ambiguities that we're usually not even aware of but a parser would have to deal with them anyway. Take "The train left at 6 and all the coaches were full". "train" can be a verb in English and a parser's lexicon would at first supply both the verb and the noun reading. The preceding definite article immediately disambiguates this lexical ambiguity though. The sentence's second clause could also mean "all the sports team leaders were drunk", but again the syntax doesn't have to concern itself with that; if a semantic analysis is wanted, the first clause would have to supply the disambiguating context.

The undecidability of English grammar is usually argued for based on the problem of reference resolution in "Bach-Peters sentences" of the form "The man who deserves it gets the reward that he works for", where "it" refers to "the reward that he works for" and "he" to "The man who deserves it"---this would clearly send a naïve parser into an endless recursion. It's controversial whether this can be worked around with some (usually not clearly formalized) modifications to a context-sensitive parser.

I don't think you use the same definition of 'undecidable' as I was thought. Wikipedia gives this definition of undecidable:

In computability theory and computational complexity theory, an undecidable problem is a decision problem for which it is impossible to construct a single algorithm that always leads to a correct yes-or-no answer.

and then

a decision problem is a question in some formal system with a yes-or-no answer, depending on the values of some input parameters

your example is not a decision problem - because the 'yes-or-no' answer does not exist - this is different from the impossibility of algorithm solving the problem.

We cannot have meaning without syntax.

Sure can. Limiting meaning yes inevitable. Unnecessary however syntax anyway.

In other words, syntax in natural language serves as a form of redundancy that narrows down the possible inferences. A sentence that follows no syntax doesn’t have no meaning, but rather has almost any meaning. Syntax constrains the space of meanings instead of defining it. A highly expressive, finely nuanced language is thus not so because it allows the expression of meaning with high precision but because it allows its qualification to the exclusion of less specific interpretations.

Hi. Long-time reader, first-time poster. Absolutely looking forward to what Marpa can do for my field, which is linguistics.

I take issue with only one point of this post, but it turns out to be a crucial one:

"The window is fake. It therefore could not have been closed by the burglar, which makes its syntax wrong."

No. At the most, that makes its syntax a bad match for your intended semantics. "The window was closed ... by the burglar" is fine syntax. It's just that it demands change-of-state semantics, in which the window is closable and therefore real. It's quite irrelevant that "the window was closed" in isolation is also susceptible to stative semantics, because it is not in isolation here. "By the burglar" resolves its syntax and semantics before "the window was in fact a ... mural" can get to it.

I'm willing to believe that that resolution is via semantic feedback, and/or perhaps by formulating the grammar of passives with a special role for "by"-phrases separate from ordinary prepositional-phrase modifiers. The semantic account must dispose of the spatial reading of "by the burglar," which does allow stative semantics, but that's easily done. At any rate, I am no longer entertaining the adjectival parse of "the window was closed" eight words later.

So sentence (U) parses fine, and its undecidability is just semantic after all -- in the contradiction between the real window of "closed ... by the burglar" and the fake one of "was in fact a ... mural." And the fact that it does not represent the all-stative semantics you intended is peripheral. It also does not represent the semantics of "My dog has fleas," but that's not the syntax's fault.

"It is a yes/no decision problem. Syntactic/Unsyntactic." - you have just shown that the for the sentence that you constructed the answer cannot be neither.

As a full-time computational and sometime regular linguist, I feel the need to comment here. As several people have pointed out, the error here lies in the conflation of syntax and semantics.

The syntax of (U) is perfectly valid. It has two coordinated passives, with the agent PP distributing over the coordinator, and a temporal subordinate clause. The semantics certainly are iffy, but that doesn't change the validity of the syntax.

As far as I know, no mainstream theory of language has semantic feedback into syntax. Information flows from syntax into semantics (it'd be hard to envision a theory where that isn't the case, to be hones), but you don't really need the inverse flow.

Meaning informs parsing. Meaning, and prior experience, clearly have a big impact on how our brains interpret a series of words. But it's not absolutely necessary. We can see unfamiliar words, intuit a part of speech for them, and move on. We can take deliberately constructed nonsense and come up with a parse for it. We can read and enjoy Jabberwocky. The parse may be more ambiguous when you have no idea what the utterance actually means, but it still seems clear to me that underlying structure provides part of the information, and semantics another part, and that the whole process is a probabilistic one.

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.