Thursday, February 25, 2010

Surname frequencies revisited

(This is a follow-up to the previous post.)

It looks like the log-log plots did a fairly shitty job of getting the power right: the correct dependence is much more like a square-root (0.5) than 3/5 -- at least that's what toying around with the powers suggests. Apparently this is a fairly standard issue with power laws; it's hard to spot because the graphs are only somewhat sensitive to the power you use. (The systematic discrepancy presumably happens because the intercept -- the extrapolated popularity of the 0th most popular name -- is nonzero.) With the new dependencies, the correct number for the US is something like 880 rather than 600; this is probably closer but I don't know if it's an underestimate or an overestimate. (Obviously extrapolations from the range (1,100) to 800-1000 are very sensitive; it's a long way off.)

Anyway, these are the graphs. First, the US, where P(N) ~ N^0.47:

Precisely the same exponent seems to work best for Japan:

For France and Australia, the exponent changes somewhat: France is dramatically higher at 0.68:

Australia seems happiest at 0.57:

The higher power might be an artifact of there being fewer data points; if you squint at the US or Japan graphs, the initial part curves upward slightly which would suggest a slightly higher power, perhaps more like 0.6. Regardless, the overall behavior is evidently power-law. This does not appear to be the case for Hungary regardless of how you massage the data; it crosses over from a rather large power to a much smaller power around N = 10. (Prev. post's comment thread suggests that something similar might be true for Korea.)

I must say that a power of 1/2 -- which is relatively ubiquitous in these sorts of phenomena -- is probably easier to explain, and certainly less interesting, than the esoteric 3/5, which mostly shows up in 3D self-avoiding random walk and critical exponents.

No comments: