Friday, January 14, 2011

"Ubi cunt?"

Ray Girvan pointed out in Light reading comments that a lot of the early n-grams for obscenities were artifacts of bad OCR. A little digging confirms this theory; there is a lot of (Latin) "eunt" and "sunt" and the like that Google scans in as "cunt" -- and this is so even in the 18th century. Then there are cases where a word like "mitescunt" -- written with the long s -- gets scanned in as "mite/cunt." More egregiously a lot of Latin words like "ducunt" and "dicunt" are broken up by the OCR for no reason. And then there is considerable misreading of italicized and Gothic lettering, so for instance, "Divers Presbyterian divines came also" is scanned in as "...d'vinu cunt alfi." While much of this is inevitable I don't understand why Google can't do a better job of classifying Latin texts as Latin rather than English.

Note that this is unrelated to the noise issue, which only affects pre-1650 texts

The thing is that with a lot of these words, the story that the graphs tell -- of a freewheeling past followed by Victorian repression followed by the 20th century -- is broadly consistent with what the usual picture of English literary history. But (a) how often does one really see "cunt" in 17th century writing? (b) in the age of the poem and pamphlet, literary writing perhaps constituted a smaller part by volume of the total output than it did in the age of the novel. Anyway, we report, you decide.

2 comments:

Jenny Davidson said...

And of the poems that used the word 'cunt' (obviously I am thinking of Rochester), they were vastly more likely to circulate - and circulate with obscenities - in manuscript than in print!

Sarang said...

True! Though I don't know what the publication history of these is (my copy of R's poems is away in a basement). At least some seem to have been printed by the early 1700s...