Wednesday, October 1, 2014

Henry 10: Shakespeare's history plays have a new marketing department

Shakespeare's (English) history plays are too confusing for today's audiences. The problem, according to the new marketing director, is largely a matter of inconsistent branding. From a branding perspective, this list has a number of obvious problems:

For a start, is the product called "Henry" or not? Far too many different names. And they're all kings, right, so why is only John called out as such? Also, some of these guys sound like sequels or minor revisions, not full fledged new kings, which is going to affect sales. Let's get this sorted out by regularizing the names and choosing a natural and consistent numbering scheme. 

Sunday, September 14, 2014

Error Analysis of the Week: Ratnaparkhi on POS tagging

This is the first of a planned series of posts showcasing good error analysis for NLP. This is motivated by a conversation with Lori Levin about "linguists who improve your score".  It is intended as a positive response to the common lament that linguistics is getting squeezed out of modern NLP. I think that error analysis is an undervalued skill, and that this is an area where trained linguists can make an especially useful contribution. 

The plan is that the series will cover several different areas of computational linguistics, and perhaps also be useful to NLP researchers seeking to deepen their understanding of how language works. The coverage is necessarily going to be sparse. For a more thorough introduction to linguistic concepts, aimed at an NLP audience, I recommend Emily Bender's Linguistic Fundamentals for Natural Language Processing. For a general audience introduction to the ways computers can do things to language, I self-servingly recommend Dickinson, Brew and Meurers' Language and Computers.

We start with part-of-speech tagging, which might seem to be an easy case (it isn't!). Modern statistical NLP relies on linguistically annotated corpora, which are used to train and test machine learning models. The shallowest layer of linguistic annotation assigns parts of speech, such as noun and verb, to each word of each sentence. You probably learned something about parts of speech at school, but as with many things, the devil is in the detail. For some languages it is unclear that even the distinction between nouns and verbs makes sense. So it should be no surprise that when the annotators working to build the famous Penn Treebank decided to assign parts of speech, they ran into not only detailed technical issues but also genuine scientific concerns. 

The treebank annotators were by no means the first to label corpora with parts of speech. They chose to use a smaller set of part-of-speech tags than had been used in previous projects, including 36 POS tags and 12 other tags (for punctuation, currency symbols and the like).  Previous projects had used much richer tag sets. The reasons for the Penn Treebank design are outlined in a journal article by Santorini et al. The process that the treebank annotators used to produce the final corpus was an initial automated annotation using a computer program, followed by careful correction by human analysts.  As we will see, this gave rise to interesting patterns of error.

The analysis that we will focus on is part of a paper by Adwait Ratnaparkhi, who was a student at Penn, and had the benefit of insider access to the treebank annotators. The primary concern of the paper is a so-called maximum entropy model, which is shown to produce what was then state-of-the-art performance. The claim of the paper is that the maximum entropy framework succeeds because of its flexibility and ability to design and use linguistically informed features. Since the paper was written in 1996, and statistical NLP has progressed since then,  it should surprise no-one that its raw performance is lower than that achieved by the latest systems. However,  the paper does include an exemplary error analysis: one that could usefully be emulated in present day papers.

Error analysis

The point of error analysis is to understand the behaviour of the system with which one is working. Sometimes this leads directly to improvements in the system, but sometimes it instead identifies errors that are likely to be hard or impossible to fix. In either case the analysis is useful, because it helps to direct effort toward changes that might be useful and away from ones that will not be.

By the numbers, part of speech taggers are pretty good. Ratnaparkhi's, system got nearly 97% of the tags right. It does much better on words that it has seen in the training set than on words it has never seen before.


The counts for Ratnaparkhi's system are reconstructed from what is in the paper, which only gives percentages.

There's also a more detailed error analysis, indicating which words caused problems.


These numbers matter because of their general trend. If we could fix the top two categories without breaking anything else, we would have removed more than 10% of the system's errors. They also matter because of the linguistic patterns. RB(R) (adverb) JJ(R) (adjective) and IN (preposition) parts-of-speech  seem to be tricky. But why, and what can be done about it?

Rich Features

A potential advantage of Ratnaparkhi's technique is that he has more options in choosing features than are available in (say) the Hidden Markov Model taggers that were current at the time he was writing. So, what is a feature then? Basically, anything that is measurable on the history up to the point where a decision needs to be made. In Ratnaparkhi's case, the history consists of the word for which the decision is needed, the previous two part of speech tags, the two previous words, and the two following words. A feature is a predicate that relates a particular choice of tag (\(t_i\)) to properties of the history. We call them rich features, because they can refer to any measurable property of the history.

The history is \( h_i = \{ w_i, w_{i+1},w_{i+2},w_{i-1},w_{i-2}, t_{i-1},t_{i-2}\}\) and predicates are defined over \((h_i, t_i)\). There are lots of potentially active features, because the words and tags can be filled in in many different ways, so the standard way of defining which features are used is to create templates, then automate the process of generating actual features from the templates. The standard templates used by Ratnaparkhi were the following:

\(w_i\) is not rare\(w_i = X  \; \& \;  t_i = T\)
\(w_i\) is rarePrefixes: \(w_i[:n] = X  \; \& \;  t_i = T\) where \(n \in 1..4\)
Suffixes: \(w_i[-n:] = X  \; \& \;  t_i = T\) where \(n \in 1..4\)
Number: \(w_i\) contains number \( \& \;  t_i = T\)
Uppercase: \(w_i\) contains uppercase letter  \( \& \;  t_i = T\)
Hyphen: \(w_i\) contains hyphen  \( \& \;  t_i = T\)
\( \forall w_i \)\(t_{i-1} = X \;  \& \;  t_i = T\)
\(t_{i-1} = X\;   \&\;  t_{i-2} = Y\;   \& \;  t_i = T\)
\(w_{i-2} = X \;  \& \;  t_i = T\)
\(w_{i-1} = X \;  \& \;  t_i = T\)
\(w_{i+1} = X \;  \& \;  t_i = T\)
\(w_{i+2} = X \;  \& \;  t_i = T\) 

These templates, taken together, can be filled in for each of the positions in the training corpus. They are sufficient to get the performance reported in the paper. But it is plausible that there might be other, more specialised features that will improve performance on particular words or in particular situations.

Ratnaparkhi's hypothesis was that specialised features would help. The first step in testing this hypothesis was to make precise the notion of "specialised feature", which required examination of the errors that the system was making, and decisions about what kinds of specialised feature to try. For this, the breakdown by words was useful.

The specialised features were similar to the original ones, but they mention specific words. They are not derived from templates, but selected on the basis of the word-by-word error analysis. For example:

\(w_i = "about"  \; \& \;  t_i = "IN" \& t_{i-2} = "DT" \& t_{i-1} = "NNS" \)

Features like this were made for 50 "difficult" words, and the experiment re-run. Unfortunately, this did not improve performance. At this point, Ratnaparkhi did something truly exemplary: realizing that the "correct" answers in the gold standard might not actually be correct, he went back and looked at the identities of the annotators, and discovered that, for example, the ratio of about/IN to about/RB changes when the annotator changes.  This should not have happened, because the tagging guidelines are supposed to be written in such a way as to ensure that annotators are interchangeable. But of course they are not, and what we have is prima facie evidence that in the case of words like "about", the gold standard is almost certainly inconsistent. If different choices had been made about which annotator did what, the corpus would be different.

Wednesday, September 11, 2013

System 1.5: the dialog system thats expects the expected unexpected, but does the conventional thing.

A while ago, Jon Oberlander wrote a squib for Computational Linguistics called "Do the Right Thing ... but Expect the Unexpected", in which he argued that when people speak, they often succeed in choosing what to say in accordance with the maxim to do the right thing, which in this case means to produce the utterance that the listener will find easiest to interpret and make sense. But they also sometimes fail, and the article points out that reasonable generation algorithms may well also do the wrong (i.e. unexpected) thing, and that this is no surprise.

In the intervening years, things have happened, some of them expected, some of them unexpected. Among them was the invention, by Donald Rumsfeld, of a useful meme for dichotomization. He split the unknowns of a situation  into the "known unknowns" and the "unknown unknowns".  This also works for the "unexpected": we have the "expected unexpected" and the "unexpected unexpected".  Both of these turn up in information seeking natural language dialog systems. The expected exchanges of such a dialog systems are things like a question-answer pair. The "expected unexpected" are the points at which knowledge gaps and other misfires result in the need for dialog moves, such as the initiators for clarification sub-dialogs, that are there in order to fix difficulties and get the system and its interlocutor out of the ditch next to the royal road of goal-directed dialog and onto the smooth well-maintained tarmac. The "unexpected unexpected" is when something happens that leads the system to believe that its interlocutor is off in the next field climbing a tree, talking to a cow or even climbing a cow and talking to a tree. The system has no conventional moves for getting things back on track.

At this point the system may fall into the temptation that it ought to engage in sophisticated reasoning in order to work out what the appropriate repair is, by, for example, recruiting extra knowledge from somewhere until it can work out that its interlocutor is after all doing something rational. This is going to take work, which is scary. What is even more scary, this work is like the effortful, rational, slow work that Daniel Kahneman calls "System 2". Kahneman points out that System 2 thought comes less naturally than System 1 thought, which is more automatic. I think that modern dialog systems, especially the ones that work by reinforcement learning, are basically operating in a way that mirrors system 1, choosing dialog moves that, from experience, tend to work out well in moving the interaction along. They actually do a bit more than this, because they can often tell when the dialog is in a ditch, and get it out. Their remedies (such as clarification requests) are conventional, stereotyped and maybe under-informed, but they are a bit flexible, and they usually work in handling the expected unexpected. Maybe they are something like "System 1.5", with a bit of flexibility, but not enough to handle the cases where the dialog seems to be off in the next field.  I doubt that there is any hope of learning System 2 thought by reinforcement learning over dialog traces.

That's OK, because there are two quite distinct reasons why the system might think the dialog is off in the next field. Either the situation will become sensible when the system manages to find the chain of reasoning that will allow it to understand that the interlocutor is acting reasonable, or, perhaps just as likely, the interlocutor really is up a cow, talking to tree, and no amount of inference will get the
dialog back where it needs to be. This interlocutor is beyond the pale, and the best that they can expect is kindness (which co-incidentally, Don Rumsfeld ... hmm,  let's leave that thought unfinished).

So, if I ever get to design  a dialog system, it will be called System 1.5, and it will adopt the Rumsfeldian philosophy of expecting the expected unexpected, then doing the conventional thing.

Tuesday, August 21, 2012

Thursday, June 28, 2012

One for negation experts

Northern Ireland expert Denis Murray, on BBC, talking with an interviewer about the Queen’s historic meeting with Martin McGuinness.
Interviewer: Would you say that until recently something like this would have been unthinkable?
Murray: I’d say more than that, until two years ago it was not even thinkable.
It’s pretty clear that Murray felt he was adding information. But why does “not even thinkable” mean more than “unthinkable”.
I believe he took the interviewer’s “unthinkable” to mean “you can (indeed must) think it, but you really should not do it”, and augmented by using “thinkable” to imply that “nobody would even have thought of it, or have needed to judge that it was a bad idea”

Wednesday, November 9, 2011



'via Blog this'

In his IJCNLP keynote address, among many other things,
Matt Lease mentioned a crowdsourcing service that is more focused on ethics than many
others. If I do crowdsourcing, I would like to consider using this.

Thursday, September 1, 2011

Please call me, Ishmael
In my quest for the white wael
I lost your voicemael