## Wednesday, August 24, 2011

### "What Shakespeare Knew"

Not every day that you learn a new -- & fairly generally useful -- trick. It is to answer the following kind of question:
Shakespeare wrote 31534 different words, of which 14376 appear only once, 4343 twice, etc. The question considered is how many words he knew but did not use.

["Estimating the number of unseen species: How many words did Shakespeare know?", Biometrika 63, 435 (1976), via Language Log] The reason this came up was the new paper about the total number of species on earth. Mark Liberman's lecture notes on this type of estimation problem are clear enough that I'll just excerpt at length:

It often happens that scientists, engineers and other biological organisms need to predict the relative probability of a large number of alternatives that don't individually occur very often. This is especially troublesome in cases where many of the things that happen have never happened before: where "rare events are common".

The simple "maximum likelihood" method for predicting the future from the past is to estimate the probability of an event-type that has occurred r times in N trials as r/N. This generally works well if r is fairly large (and if the world doesn't change too much). But as r gets smaller, the maximum likelihood estimate gets worse. And if r is zero, it may still be quite unwise to bet that the event-type in question will never occur in the future. Even more important, the fraction of future events whose past counts are zero may be substantial.

There are two problems here. One is that the r/N formula divides up all of the probability mass -- all of our belief about the future -- among the event-types that we happen to have seen. This doesn't leave anything for the unseen event-types (if there are any). How can we decide how much of our belief to reserve for the unknown? And how should we divide up this "belief tax" among the event-types that we've already seen?
There isn't a particularly nice general formula answering the original question, but there is one -- apparently due to Alan Turing -- for a closely related question:
given a representative sample of length N words with m hapax legomena, the probability that the next word picked out of the full corpus will be something hitherto unseen is approximately m/N.
(NB it is obvious that this has the right limiting behavior. If the sample consists entirely of hapax legomena, then m/N = 1 so the prediction is that the next pick is certain to be something you have not seen so far, which is obviously true. Similarly if there are no hapax legomena you wouldn't expect to suddenly start finding them.)

This is potentially a nice trick for Fermi problems (how many words do Chicagoans have for "piano tuner"?) but does not extend v. well to the original problem -- which is what the limiting distribution would be as the corpus size goes to infinity. (Asked to do that as a Fermi problem I would just draw the histogram and extrapolate backwards. Of course I am not a statistician.)

(It strikes that Shakespeare is a shitty choice for the original estimation problem as posed. The question is something like this: suppose S. had written an infinitely large -- or at least much much larger -- corpus, of which what we have is a representative sample, how many different words would it have contained? This is not a sensible question to ask about Shakespeare -- one way to imagine him writing more plays is if he'd lived longer and/or written more rapidly, either of which would change the nature of the corpus -- but is not an unreasonable question re, say, Sophocles or other ancients.)