Wednesday, February 24, 2010

Surname frequencies

Last year a Chinese friend of mine asked me if I knew what the smallest N was such that the first N surnames would cover half the population of, e.g., the US or the UK. I estimated 100 or so and forgot all about it until -- looking up the physicist Kobayashi on Wikipedia and following a few links -- I discovered Wikipedia's useful list of common family names. This had enough information, at least for the US, to prove that my estimate of 100 was pretty far off, but didn't directly answer the original question, so I decided to extrapolate using some rather primitive data analysis in Excel. I discovered, a little to my surprise, that surname distributions follow a surprisingly regular power law: the form is approximately
Cumulative percentage of population by Nth surname ~ N^(3/5)
for a fairly wide range of large countries. (3/5 is only true for the US and England; otherwise it sort of varies between 3/5 and 2/3: the US is at .588, England at 0.606, Germany at 0.61, Australia at 0.63, Russia at 0.637, and Japan at 0.65.) In all cases the power-law dependence is quite good. (I really ought to put up the graphs but I'm too lazy to.) For smaller countries the picture is somewhat mixed, the exponent tends to be somewhat higher on average (Scots at 0.76, N. Irish at 0.88) but in some cases -- Hungary and Sweden -- the dependence clearly isn't power-law.

Poking around in the literature I found nothing much except this old paper that describes finding something similar but doesn't have a theory (JSTOR required):
The Distribution of Surname Frequencies
Wendy Fox and Gabriel Lasker
International Statistical Review 51, 81-87 (1983)
I imagine this is one of those Zipf's law effects but I'm unclear about what kinds of processes would give rise to a Zipf's law in this context. It's esp. interesting to me that Japan fits so well in this series given cultural differences etc. Would also like to know if the exponent is really size-dependent.

Oh and to answer the original question I estimate that it'd take about 650 names to cover 50% of the US, and about 400 for the UK. I think these are underestimates.

UPDATE the graphs are up here.

11 comments:

Joseph said...

I'm baffled by your last paragraph. If you think they're underestimates, why did you pick those numbers?

Jim said...

As it's the nature of an estimate to be off in some direction, I believe it's just guidance that the real number is probably slightly higher.

Sarang said...

Yes, what Jim said. The relationship is mostly linear but if you stare at it sufficiently hard it seems to curve down a little. I'll put the graphs up this evening or so.

Matt Doar said...

Neat

joe said...

I think the number is around 10,000 to 12,000 to get to 150 million people in the U.S. This is based on white pages data extrapolated out.

name #1 is Smith with ~ 2mm people and name #10000 is Reck with ~ 2300 people.

Zach said...

stop being lazy and give the damn plots :)

peregrine said...

In Korea, three last names suffice to address 54% of the population: Kim, Lee, and Park.

Looking at the Korean surname data indicates a problem with your power law once "obscure" names are reached. The five most common Korean names yield a power of 0.40, while when the ten most common names are considered the power is 0.34. This is because the list of "normal" names saturates quickly, and from there each new name only adds a few (hundred) people to the population, forcing the calculated power downward.

Sarang said...

Thanks for all the comments. (Just curious -- where did everybody come from? It's a pleasant surprise to find so many sentient readers...)

Joe -- I think the discrepancy is because WP has "over 200 million people" so you're finding their 75% point rather than the 50% point. (Their counts for common names are systematically lower than Wikipedia which itself assumes a total population of about 270 million.) That said I don't think the power law holds that far out.

Peregrine -- I agree about the obscure names. The law of large numbers appears to be important for whatever mechanism is causing this power law, e.g. all the divergences occur in smaller, less diverse populations. (Of course all power laws break down when you get far enough into the tail.)

Joe said...

Census data indicates that it takes 1,712 surnames to cover 50% of the (US, census-taken) population. Data at http://www.census.gov/genealogy/names/names_files.html

Sarang said...

Yes that seems reasonable. I downloaded the census data last night and tried to fit it; my power law works fine up to like 500 but the corrections pile up after that. This is of course a standard issue with extrapolating that far out; I was looking for a crude lower bound rather than anything particularly accurate, and it was quicker to extrapolate than to look up the data. (It would be interesting to know if the _corrected_ curve is the same across countries, i.e., whether there's "data collapse.")

Matt said...

I'm a bit late to the party here, but the frequency of surnames — or rather the decreasing frequency of aristocratic surnames — disturbed Victorians so much that they came up with a mathematical proof. It's called the Galton-Watson process.