Cumulative percentage of population by Nth surname ~ N^(3/5)for a fairly wide range of large countries. (3/5 is only true for the US and England; otherwise it sort of varies between 3/5 and 2/3: the US is at .588, England at 0.606, Germany at 0.61, Australia at 0.63, Russia at 0.637, and Japan at 0.65.) In all cases the power-law dependence is quite good. (I really ought to put up the graphs but I'm too lazy to.) For smaller countries the picture is somewhat mixed, the exponent tends to be somewhat higher on average (Scots at 0.76, N. Irish at 0.88) but in some cases -- Hungary and Sweden -- the dependence clearly isn't power-law.
Poking around in the literature I found nothing much except this old paper that describes finding something similar but doesn't have a theory (JSTOR required):
The Distribution of Surname FrequenciesI imagine this is one of those Zipf's law effects but I'm unclear about what kinds of processes would give rise to a Zipf's law in this context. It's esp. interesting to me that Japan fits so well in this series given cultural differences etc. Would also like to know if the exponent is really size-dependent.
Wendy Fox and Gabriel Lasker
International Statistical Review 51, 81-87 (1983)
Oh and to answer the original question I estimate that it'd take about 650 names to cover 50% of the US, and about 400 for the UK. I think these are underestimates.
UPDATE the graphs are up here.
12 comments:
I'm baffled by your last paragraph. If you think they're underestimates, why did you pick those numbers?
As it's the nature of an estimate to be off in some direction, I believe it's just guidance that the real number is probably slightly higher.
Yes, what Jim said. The relationship is mostly linear but if you stare at it sufficiently hard it seems to curve down a little. I'll put the graphs up this evening or so.
Neat
I think the number is around 10,000 to 12,000 to get to 150 million people in the U.S. This is based on white pages data extrapolated out.
name #1 is Smith with ~ 2mm people and name #10000 is Reck with ~ 2300 people.
stop being lazy and give the damn plots :)
In Korea, three last names suffice to address 54% of the population: Kim, Lee, and Park.
Looking at the Korean surname data indicates a problem with your power law once "obscure" names are reached. The five most common Korean names yield a power of 0.40, while when the ten most common names are considered the power is 0.34. This is because the list of "normal" names saturates quickly, and from there each new name only adds a few (hundred) people to the population, forcing the calculated power downward.
Thanks for all the comments. (Just curious -- where did everybody come from? It's a pleasant surprise to find so many sentient readers...)
Joe -- I think the discrepancy is because WP has "over 200 million people" so you're finding their 75% point rather than the 50% point. (Their counts for common names are systematically lower than Wikipedia which itself assumes a total population of about 270 million.) That said I don't think the power law holds that far out.
Peregrine -- I agree about the obscure names. The law of large numbers appears to be important for whatever mechanism is causing this power law, e.g. all the divergences occur in smaller, less diverse populations. (Of course all power laws break down when you get far enough into the tail.)
Census data indicates that it takes 1,712 surnames to cover 50% of the (US, census-taken) population. Data at http://www.census.gov/genealogy/names/names_files.html
Yes that seems reasonable. I downloaded the census data last night and tried to fit it; my power law works fine up to like 500 but the corrections pile up after that. This is of course a standard issue with extrapolating that far out; I was looking for a crude lower bound rather than anything particularly accurate, and it was quicker to extrapolate than to look up the data. (It would be interesting to know if the _corrected_ curve is the same across countries, i.e., whether there's "data collapse.")
I'm a bit late to the party here, but the frequency of surnames — or rather the decreasing frequency of aristocratic surnames — disturbed Victorians so much that they came up with a mathematical proof. It's called the Galton-Watson process.
Post a Comment