## Error Analysis of the Week: Ratnaparkhi on POS tagging

This is the first of a planned series of posts showcasing good error analysis for NLP. This is motivated by a conversation with Lori Levin about "linguists who improve your score".  It is intended as a positive response to the common lament that linguistics is getting squeezed out of modern NLP. I think that error analysis is an undervalued skill, and that this is an area where trained linguists can make an especially useful contribution.

The plan is that the series will cover several different areas of computational linguistics, and perhaps also be useful to NLP researchers seeking to deepen their understanding of how language works. The coverage is necessarily going to be sparse. For a more thorough introduction to linguistic concepts, aimed at an NLP audience, I recommend Emily Bender's Linguistic Fundamentals for Natural Language Processing. For a general audience introduction to the ways computers can do things to language, I self-servingly recommend Dickinson, Brew and Meurers' Language and Computers.

We start with part-of-speech tagging, which might seem to be an easy case (it isn't!). Modern statistical NLP relies on linguistically annotated corpora, which are used to train and test machine learning models. The shallowest layer of linguistic annotation assigns parts of speech, such as noun and verb, to each word of each sentence. You probably learned something about parts of speech at school, but as with many things, the devil is in the detail. For some languages it is unclear that even the distinction between nouns and verbs makes sense. So it should be no surprise that when the annotators working to build the famous Penn Treebank decided to assign parts of speech, they ran into not only detailed technical issues but also genuine scientific concerns.

The treebank annotators were by no means the first to label corpora with parts of speech. They chose to use a smaller set of part-of-speech tags than had been used in previous projects, including 36 POS tags and 12 other tags (for punctuation, currency symbols and the like).  Previous projects had used much richer tag sets. The reasons for the Penn Treebank design are outlined in a journal article by Santorini et al. The process that the treebank annotators used to produce the final corpus was an initial automated annotation using a computer program, followed by careful correction by human analysts.  As we will see, this gave rise to interesting patterns of error.

The analysis that we will focus on is part of a paper by Adwait Ratnaparkhi, who was a student at Penn, and had the benefit of insider access to the treebank annotators. The primary concern of the paper is a so-called maximum entropy model, which is shown to produce what was then state-of-the-art performance. The claim of the paper is that the maximum entropy framework succeeds because of its flexibility and ability to design and use linguistically informed features. Since the paper was written in 1996, and statistical NLP has progressed since then,  it should surprise no-one that its raw performance is lower than that achieved by the latest systems. However,  the paper does include an exemplary error analysis: one that could usefully be emulated in present day papers.

### Error analysis

The point of error analysis is to understand the behaviour of the system with which one is working. Sometimes this leads directly to improvements in the system, but sometimes it instead identifies errors that are likely to be hard or impossible to fix. In either case the analysis is useful, because it helps to direct effort toward changes that might be useful and away from ones that will not be.

By the numbers, part of speech taggers are pretty good. Ratnaparkhi's, system got nearly 97% of the tags right. It does much better on words that it has seen in the training set than on words it has never seen before.

CorrectFractionMistakes
sentence3,804/8,0000.4764,196
token185942/192,8260.9636,884
known180,676/1867190.9686,043
unknown5,266/6,1070.862841

The counts for Ratnaparkhi's system are reconstructed from what is in the paper, which only gives percentages.

There's also a more detailed error analysis, indicating which words caused problems.

WordCorrectPredictionFrequency
thatDTIN389
moreRBRJJR221
upINRB187
thatWDTIN184
asRBIN176
upINRP176
moreJJRRBR175
thatINWDT159
thatINDT127
outRPIN126
thatINWDT123
muchJJRB118
yenNNNNS117
chiefNNJJ116
upRPRB114
agoINRB112
muchRBJJ111
outINRP109

These numbers matter because of their general trend. If we could fix the top two categories without breaking anything else, we would have removed more than 10% of the system's errors. They also matter because of the linguistic patterns. RB(R) (adverb) JJ(R) (adjective) and IN (preposition) parts-of-speech  seem to be tricky. But why, and what can be done about it?

### Rich Features

A potential advantage of Ratnaparkhi's technique is that he has more options in choosing features than are available in (say) the Hidden Markov Model taggers that were current at the time he was writing. So, what is a feature then? Basically, anything that is measurable on the history up to the point where a decision needs to be made. In Ratnaparkhi's case, the history consists of the word for which the decision is needed, the previous two part of speech tags, the two previous words, and the two following words. A feature is a predicate that relates a particular choice of tag ($$t_i$$) to properties of the history. We call them rich features, because they can refer to any measurable property of the history.

The history is $$h_i = \{ w_i, w_{i+1},w_{i+2},w_{i-1},w_{i-2}, t_{i-1},t_{i-2}\}$$ and predicates are defined over $$(h_i, t_i)$$. There are lots of potentially active features, because the words and tags can be filled in in many different ways, so the standard way of defining which features are used is to create templates, then automate the process of generating actual features from the templates. The standard templates used by Ratnaparkhi were the following:

ConditionFeature
$$w_i$$ is not rare$$w_i = X \; \& \; t_i = T$$
$$w_i$$ is rarePrefixes: $$w_i[:n] = X \; \& \; t_i = T$$ where $$n \in 1..4$$
Suffixes: $$w_i[-n:] = X \; \& \; t_i = T$$ where $$n \in 1..4$$
Number: $$w_i$$ contains number $$\& \; t_i = T$$
Uppercase: $$w_i$$ contains uppercase letter  $$\& \; t_i = T$$
Hyphen: $$w_i$$ contains hyphen  $$\& \; t_i = T$$
$$\forall w_i$$$$t_{i-1} = X \; \& \; t_i = T$$
$$t_{i-1} = X\; \&\; t_{i-2} = Y\; \& \; t_i = T$$
$$w_{i-2} = X \; \& \; t_i = T$$
$$w_{i-1} = X \; \& \; t_i = T$$
$$w_{i+1} = X \; \& \; t_i = T$$
$$w_{i+2} = X \; \& \; t_i = T$$

These templates, taken together, can be filled in for each of the positions in the training corpus. They are sufficient to get the performance reported in the paper. But it is plausible that there might be other, more specialised features that will improve performance on particular words or in particular situations.

Ratnaparkhi's hypothesis was that specialised features would help. The first step in testing this hypothesis was to make precise the notion of "specialised feature", which required examination of the errors that the system was making, and decisions about what kinds of specialised feature to try. For this, the breakdown by words was useful.

The specialised features were similar to the original ones, but they mention specific words. They are not derived from templates, but selected on the basis of the word-by-word error analysis. For example:

$$w_i = "about" \; \& \; t_i = "IN" \& t_{i-2} = "DT" \& t_{i-1} = "NNS"$$

Features like this were made for 50 "difficult" words, and the experiment re-run. Unfortunately, this did not improve performance. At this point, Ratnaparkhi did something truly exemplary: realizing that the "correct" answers in the gold standard might not actually be correct, he went back and looked at the identities of the annotators, and discovered that, for example, the ratio of about/IN to about/RB changes when the annotator changes.  This should not have happened, because the tagging guidelines are supposed to be written in such a way as to ensure that annotators are interchangeable. But of course they are not, and what we have is prima facie evidence that in the case of words like "about", the gold standard is almost certainly inconsistent. If different choices had been made about which annotator did what, the corpus would be different.