Monday, January 11, 2010

Counting Badgers

There was a Language Log post the other day about how some phrases seemed to return more Google hits when you searched for them with quotes than when you searched without quotes; this clearly oughtn't to happen because every phrase which contains "badger hair" contains the words "badger" and "hair." I would suppose Google is finding all the results but just not doing a very good job of estimating how many there are.

I tried the following variant on this: I googled strings of the form "badger badger ... badger" (i.e. arbitrarily long sequences of the word "badger" -- and I had quotes around the string to get rid of e.g. the Wikipedia page for badgers, which has a lot of nonsequential instances of "badger") and wanted to see if the hit count was monotonic with the length of the string -- which, again, it ought to be as any string of the form "badger badger badger" contains a string of the form "badger badger." I was amused to find out that this is not in fact the case. Something weird happens with the seventh "badger," the hit count goes up from about 10000 to about 85000 and I have no real sense of why.

(For instance I also tried this with the string "seal seal seal... seal," which was well-behaved.)

1 comment:

Zed said...

Actually I guess this is explained by Mark Liberman's thing about common n-grams. Maybe it does an "honest" search until "badger badger badger badger badger badger" and afterwards just looks up the string in a list of n-grams...