Saturday, February 27, 2010

"If all else fails, write in German"

Via Dan Crow, J.S. Milne has advice for authors of math papers (at least, those who want to seem profound). Some of these tips for obfuscating are quite fresh and potentially useful:
  • Use c, a, b respectively to denote elements of sets A, B, C
  • If, in a moment of weakness, you do refer to a paper or book for a result, never say where in the paper or book the result can be found. In addition to making it difficult for the reader to find the result, this makes it almost impossible for anyone to prove that the result isn't actually there.
  • Begin and end sentences with symbols wherever possible. Since periods are almost invisible (and may be mistaken for a mathematical symbol), most readers won't even notice that you've started a new sentence. Also, where possible, attach superscripts signaling footnotes to mathematical symbols rather than words.
I must mention, in connection with this, a pet peeve and an annoying habit I used to have. The peeve: people quite often use the two kinds of lowercase phi to mean different things; this makes things extremely hard to follow as my brain processes both characters as "lowercase phi." It also makes it difficult to talk through certain arguments because there aren't conventional words for the two kinds of phi. (The latter point also applies to the two kinds of epsilon but these are less problematic because they have fixed, different connotations -- one of them's an infinitesimal, the other is a dielectric constant.)

The annoying habit: I used to write A = B = C when what I meant was B = A = C. These statements are both true in the same contexts (they have the same "reference") but mean different things; the former says A is B [for some reason] and B is C [for some other reason] so A is C; an = sign implies an argument. Rob Benedetto wasn't very forgiving of this habit. What's always struck me as odd, however, is that I'd naturally put the chains of equalities in the wrong order. It seems like one should naturally get the sequence right in the course of making the argument, unless one actively tries to screw it up.

Thursday, February 25, 2010

Surname frequencies revisited

(This is a follow-up to the previous post.)

It looks like the log-log plots did a fairly shitty job of getting the power right: the correct dependence is much more like a square-root (0.5) than 3/5 -- at least that's what toying around with the powers suggests. Apparently this is a fairly standard issue with power laws; it's hard to spot because the graphs are only somewhat sensitive to the power you use. (The systematic discrepancy presumably happens because the intercept -- the extrapolated popularity of the 0th most popular name -- is nonzero.) With the new dependencies, the correct number for the US is something like 880 rather than 600; this is probably closer but I don't know if it's an underestimate or an overestimate. (Obviously extrapolations from the range (1,100) to 800-1000 are very sensitive; it's a long way off.)

Anyway, these are the graphs. First, the US, where P(N) ~ N^0.47:

Precisely the same exponent seems to work best for Japan:

For France and Australia, the exponent changes somewhat: France is dramatically higher at 0.68:

Australia seems happiest at 0.57:

The higher power might be an artifact of there being fewer data points; if you squint at the US or Japan graphs, the initial part curves upward slightly which would suggest a slightly higher power, perhaps more like 0.6. Regardless, the overall behavior is evidently power-law. This does not appear to be the case for Hungary regardless of how you massage the data; it crosses over from a rather large power to a much smaller power around N = 10. (Prev. post's comment thread suggests that something similar might be true for Korea.)

I must say that a power of 1/2 -- which is relatively ubiquitous in these sorts of phenomena -- is probably easier to explain, and certainly less interesting, than the esoteric 3/5, which mostly shows up in 3D self-avoiding random walk and critical exponents.

Wednesday, February 24, 2010

Surname frequencies

Last year a Chinese friend of mine asked me if I knew what the smallest N was such that the first N surnames would cover half the population of, e.g., the US or the UK. I estimated 100 or so and forgot all about it until -- looking up the physicist Kobayashi on Wikipedia and following a few links -- I discovered Wikipedia's useful list of common family names. This had enough information, at least for the US, to prove that my estimate of 100 was pretty far off, but didn't directly answer the original question, so I decided to extrapolate using some rather primitive data analysis in Excel. I discovered, a little to my surprise, that surname distributions follow a surprisingly regular power law: the form is approximately
Cumulative percentage of population by Nth surname ~ N^(3/5)
for a fairly wide range of large countries. (3/5 is only true for the US and England; otherwise it sort of varies between 3/5 and 2/3: the US is at .588, England at 0.606, Germany at 0.61, Australia at 0.63, Russia at 0.637, and Japan at 0.65.) In all cases the power-law dependence is quite good. (I really ought to put up the graphs but I'm too lazy to.) For smaller countries the picture is somewhat mixed, the exponent tends to be somewhat higher on average (Scots at 0.76, N. Irish at 0.88) but in some cases -- Hungary and Sweden -- the dependence clearly isn't power-law.

Poking around in the literature I found nothing much except this old paper that describes finding something similar but doesn't have a theory (JSTOR required):
The Distribution of Surname Frequencies
Wendy Fox and Gabriel Lasker
International Statistical Review 51, 81-87 (1983)
I imagine this is one of those Zipf's law effects but I'm unclear about what kinds of processes would give rise to a Zipf's law in this context. It's esp. interesting to me that Japan fits so well in this series given cultural differences etc. Would also like to know if the exponent is really size-dependent.

Oh and to answer the original question I estimate that it'd take about 650 names to cover 50% of the US, and about 400 for the UK. I think these are underestimates.

UPDATE the graphs are up here.

Wednesday, February 10, 2010

The Way of the Word

I. It turns out that "interesting" was quite recently a euphemism for pregnant; see these OED examples:
1930 GALSWORTHY On Forsyte 'Change 171 Winifred, beginning to be ‘interesting’, owing to the approach of a little Dartie, kept her eyes somewhat watchfully on ‘Monty’. 1970 K. GILES Death in Church ii. 49 Her little maid got into An Interesting Condition and the young fellow was willing to solemnise it.
II. Here's a famous passage from Congreve's play The Way of the World (1700) about interest and pregnancy. The situation should be fairly self-explanatory: Mirabell and Mrs Fainall were [and are probably still] lovers, and he married her off to a friend of his to shield her potentially "interesting condition." (Fainall was induced to marry her for her money.)

While I only hated my husband, I could bear to see him; but since I have despised him, he's too offensive.

Oh, you should hate with prudence.

Yes, for I have loved with indiscretion.

You should have just so much disgust for your husband as may be sufficient to make you relish your lover.

You have been the cause that I have loved without bounds, and would you set limits to that aversion of which you have been the occasion? Why did you make me marry this man?

Why do we daily commit disagreeable and dangerous actions? To save that idol, reputation. If the familiarities of our loves had produced that consequence of which you were apprehensive, where could you have fixed a father's name with credit but on a husband? I knew Fainall to be a man lavish of his morals, an interested and professing friend, a false and a designing lover, yet one whose wit and outward fair behaviour have gained a reputation with the town, enough to make that woman stand excused who has suffered herself to be won by his addresses. A better man ought not to have been sacrificed to the occasion; a worse had not answered to the purpose. When you are weary of him you know your remedy.

III. The converse, of course, is true to a degree -- conception, abortion, pregnant, embryonic, fertile, and miscarriage are all potentially applicable to thoughts and actions. This paradigm does, however, lead to odd results if you substitute synonyms: e.g. a knocked-up pause, an absolutely third-trimester notion, a phrase gastrulating with meaning, a fetal vagueness, an amniotic fluidity.

Friday, February 5, 2010

Vowel Shift Trivia

Was re-reading that old Geoff Nunberg parody of Pope ("who would not weep if E.B. White were he"); happened upon an interesting discussion of English phonetic history in the comments. Someone quoted Dr. Johnson:
"Sir," said he, "what entitles Sheridan [this is Thomas, father of Richard Brinsley S. the playwright; he had undertaken to write a pronunciation dictionary] to fix the pronunciation of English? He has, in the first place, the disadvantage of being an Irishman; and if he says he will fix it after the example of the best company, why, they differ among themselves. I remember an instance: when I published the Plan for my Dictionary, Lord Chesterfield told me that the word great should be pronounced so as to rhyme to state; and Sir William Yonge sent me word that it should be pronounced so as to rhyme to seat, and that none but an Irishman would pronounce it grait. Now here were two men of the highest rank, the one the best speaker in the House of Lords, the other the best speaker in the House of Commons, differing entirely."
Incidentally this is related to the fact that the predecessors of words ending -ee- and -ea- in Modern English were pronounced differently -- resp. as "closed" and "open" forms of "ay" -- but spelled alike in Middle English; it was during the transitional period, when -ee- was pronounced in the modern way but -ea- was pronounced to rhyme with -ay- (e.g. where you, great Anna, whom three realms obey / do sometimes counsel take, and sometimes tea), that English spelling was settled; if the commenter is right the distinction was codified in spelling (by Johnson and others) just as it was becoming obsolete.

And here's Nunberg on a similar issue, which I think comes off as endearingly unprissy of the Augustans though of course it's nothing of the sort:

Or to take an example I'm more familiar with, consider the 18th c. blurring of the nuclei of words like line and loin, which turns up several times in the Essay of Criticism:

In Praise so just, let ev'ry Voice be join'd,
And fill the Gen'ral Chorus of Mankind!

And praise the Easie Vigor of a Line,
Where Denham's Strength, and Waller's Sweetness join.

Good-Nature and Good-Sense must ever join;
To err is Humane; to Forgive, Divine.

One notable point here is that this confusion was phonetically conditioned, limited to vowels before /n/, /l/ and some other sonorants, particularly when preceded by /p/ and /b/ (as in point and boil). Nobody rhymed toy and sigh as far as I know. So the rhymes here turned on a perception of phonetic closeness or identity, not simply a rhyming convention. A second is that the confusion left several doublets in its wake (rile and roil, for example) as well as some dialect variation: Dickens had his lower-class characters saying spile for spoil and jint for joint. A third is that the confusion was noted, and sometimes criticized as "abusive," by contemporary writers.

Thursday, February 4, 2010

You can't facebook the truth!

This NYRB piece about Facebook seems shoddily researched (either that or my memory's failing). For instance:
  1. It claims the original "relationship status" answers were: Single; In a Relationship; Engaged; Married; "It's Complicated"; "In an Open Relationship." I feel like "it's complicated" was a later addition, dating from 2004 or 2005.
  2. It claims that the site removed "Random Play" and "Whatever I Can Get" from the options of what members were "Looking For"—to be replaced by "Networking." Last I checked, "random play" was definitely still an option.
  3. It doesn't even mention my favorite erstwhile Facebook feature, the "social net" thing that allowed you to see which of your friends knew each other. I remember the feature was always plagued by various technical problems but it was actually a fun thing to mess with.
And there are some assertions that seem implausible. For instance: It even became something of a norm to greet a friend in the dining hall by declaring, for example, "I see you added Trotsky to your list of favorite authors—but dropped Marx!" This seems very far off; as the author remarks elsewhere, the site was generally understood to be a lark, and I certainly never overheard people talking about their profile updates at Valentine.

That said, this description of the decline and fall -- the growing tiresomeness -- of Facebook is exactly right:

... what might be called Facebook's "suburban period," which began in September 2006 and continues, in many ways, into the present. We can pinpoint the start date so precisely because at the same time that Zuckerberg opened Facebook to anyone who wanted to join, he launched a function that has since come to dominate the site: the "News Feed."

The News Feed, as the name suggests, resembled a personalized wire service. "Imagine a device that monitors the social marketplace the way a blinking Bloomberg terminal tracks incremental changes in the bond market," The New York Times described the new feature at its debut. But I would propose an alternate metaphor: the suburban backyard fence. Facebook, when restricted to colleges, had relied on the typically intense social lives of students in the dorm room and at the dining hall. It was possible to obsessively check the pages of a few good friends or a cute girl in your class, but you could easily ignore everyone else.

The News Feed, by contrast, made everyone and everything an object of gossip by automatically sending the minutest changes to a wide circle of "friends." Along with the pleasure of learning that a crush had added Godard to her list of favorite filmmakers, you had to endure image after image of the drunken escapades of people you hadn't seen in years. New features were supposed to screen out some "friends," but these settings barely worked.

Yes. This was a huge nuisance if you were already on Facebook and had the new features foisted on you. The problem was that under the old rules there was no reason not to be friends with everybody -- that way if they joined a different network you would still be able to Facebook-stalk them -- but suddenly your newsfeed was cluttered with, not the drunken escapades, but the vapid and inconsequential chatter of people you had no interest in.

In fact, there has always been a selection effect by which precisely the people you don't want to hear about are the kind that update their profile or join a group every three minutes. This effect extends, I think, to the Wall (the single most pointless and obnoxious thing about Facebook): I can't remember the last time someone actually posted something on my wall that I wanted to read; it's always random clowns I haven't seen in years saying something placeholderish and misspelled.

All the same, I can see that if I joined Facebook a long time after I actually did, I would perhaps like it better. Facebook does a lot of useful things: one could use the status updates as tweets, write facebook notes instead of blog posts, and use the privacy settings to get something almost like planworld. OTOH I don't really see myself defriending half my acquaintance; I might not want to know much about their lives, but I would like to know where they live, in case I'm ever in (say) Portland -- incidentally I'll be there in March for the APS meeting -- and have an afternoon to kill. Besides, the blog, twitter, Google reader, and planworld are entirely adequate as far as they go.