Preamble

The source files are posted on GitHub

Outline

DATA LOADING AND SET UP
QUESTIONS
OTHER BITS OF ANALYSIS
- Some Basic Trends by Year and Gender
- Count Distributions by State, Time, and Gender
APPENDIX
- User Defined Functions

Note on the Report Reproducible Format

This report is generated from a Rmarkdown document which includes all code to perform data processing, modeling, plotting, etc. (except for functions defined in external scripts, included in the archive). However, for readability only a few bits of code are echoed explicitly in the compiled document.

The full straight reproducibility is only limited by the fact that in the interest of simplicity and for computational convenience some parts of the processing have been flagged as inactive (eval = FALSE), and in its current form some data are instead loaded from previously saved work, namely the model fitting. That said, the document includes the code to perform the entire analysis, and a few changes would allow to do so by compiling it.

DATA LOADING AND SETUP

states_codes <- list.files("./data/", pattern = "*.TXT") %>% gsub("\\.TXT", "", ., perl = TRUE)
data_dir <- "data"

df <- do.call(bind_rows, lapply( list.files(data_dir, pattern = "[A-Z][A-Z].TXT.gz", full.names = TRUE), 
                function(x) { read.csv(x, header = FALSE, stringsAsFactors = FALSE) }) )
colnames(df) <- c("state", "gender", "year", "name", "count")

unique_names_F <- filter(df, gender == "F") %>% dplyr::select(name) %>% unique(.)
unique_names_M <- filter(df, gender == "M") %>% dplyr::select(name) %>% unique(.) 
unique_names <- list(F = as.vector(unique_names_F), M = as.vector(unique_names_M))
bisex_names <- intersect(unique_names$F, unique_names$M)

rm(unique_names_F, unique_names_M)

QUESTIONS

Question #1

[A] Please describe the format of the data files.

For illustration, we show here the head of one of the state files (Alaska), showing the structure of the data as distributed.

file.head("data/AK.TXT.gz", n = 10)
# AK,F,1910,Mary,14
# AK,F,1910,Annie,12
# AK,F,1910,Anna,10
# AK,F,1910,Margaret,8
# AK,F,1910,Helen,7
# AK,F,1910,Elsie,6
# AK,F,1910,Lucy,6
# AK,F,1910,Dorothy,5
# AK,F,1911,Mary,12
# AK,F,1911,Margaret,7

The data for each state are stored in tall format in a comma separated list. Each row has five values:

State: the 2-letter code of the state.
Gender: one letter, M|F.
Year: the year of birth for this record, in 4-digit format.
Name: the name.
Count: an integer for the number of times this name occurred in that state in that year.

A few additional notes:

Entries are grouped (sorted) by gender, then year, and finally in decreasing order of counts. If there are multiple names with the same number of occurrences, they are then sorted alphabetically.
It is important to note that for each state+year subset only names with at least five occurrences are reported in this public data set (out of privacy concerns).
The total number of distinct names in the data set is 29828, and of these 2852 have been used for both females and males.

This is an example of the entries in the data frame where I loaded and combined all the data, a sample of random rows:

df[sample(1:nrow(df), 12), ]
# Source: local data frame [12 x 5]
# 
#    state gender year       name count
# 1     NC      F 1979    Georgia     9
# 2     IL      M 1996    Zackery    16
# 3     SC      F 2007      Emily   256
# 4     IA      F 1944      Pearl     7
# 5     MN      M 1999  Nathanial     6
# 6     DC      F 1922 Marguerite    21
# 7     GA      M 1951      Cleve     7
# 8     NY      F 2013       Izel     5
# 9     MI      M 2001    Kendall    19
# 10    MO      M 1948       Doug    12
# 11    RI      F 1945   Meredith    18
# 12    CA      F 1953      Dolly    20

[B] Can you identify any limitations or distortions of the data?

I did not get a chance to explore the data in much detail (and not for lack of interest) and at this stage my main concern, perhaps the only one for now, is about the effect of the cutoff imposed on the release of names with fewer than 5 occurrences in a state.

First off, given the wide range of state populations (about a factor of 50), this threshold at a fixed value effectively introduces a distortion in the data in the sense that \(n\ge5\) truncates the distribution of names at different levels depending on a state population (taken as a proxy for number of births). On the other hand, we are dealing with an inherently discrete quantity and below 5 there are always only four other “levels” 1, 2, 3, 4, no matter how large is a state population. So, it is not clear to me how much is the impact of the cutoff in practice. One pertinent question is how much data would be there, and a good starting point is looking at how the available data are distributed.

As with many things in nature, the distribution of the number of names vs. their frequency follows fairly closely a power law distribution. Here are a couple of examples, for two states, with data added over a 5-year period. Please note that this is a log-log plot, the best way to look at data following a power law distribution, which comes out just like a straight line. Pink symbols show the distribution of female names, blue symbols that for male names. Red and blue dashed lines are the best fit with a linear regression model to the log-log data (points covered by a grey square were exclude from the fit.)

Data were binned, with bin width \(0.25 \log_{10}(5)\), and the plots show the counts in each bin, not a density (i.e. count divided by bin width).

par(fig = c(0, 1, 0, 1), mar = c(5, 4, 4, 1)+0.1, cex.axis = 1.0)
single_panel_mar <- c(2, 2, 2, 1)
single_panel_oma <- c(0 ,0, 2, 0)
gr_par <- list( mar = single_panel_mar, oma = single_panel_oma, 
                cex = 1.0, cex.axis = 0.8, cex.lab = 1.0, 
                las = 0, mgp = c(1.0, 0.0, 0),
                tcl = 0.3)

par(gr_par)

mat.layout <- matrix(1:2, nrow = 1, byrow = TRUE)
layout(mat.layout)
plot_hist_state_year_with_fit(data = df, STATE = "OK", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "TX", YEAR = 2009:2013, subset = quote(y > log10(25)))

par(fig = c(0, 1, 0, 1), mar = c(5, 4, 4, 1)+0.1, cex.axis = 1.0)
title("Examples of Distributions of Names by Count", cex.main = 1.25, outer = TRUE)

Data are sometimes noisy and bumpy, but overall, for all states and time intervals (provided that these latter were long enough to build up a decent statistic in states with low counts), these distributions are well modeled with steep declining power laws, possibly with an exponential cutoff at high counts. Assess that vs. simply a drop in statistics would require more careful analysis.

The slope of the power law is around \(-1\) (it would \(-2\) for density), and if extrapolated back below the cutoff at 5 counts it would translate into a large number of names with low-count, as much as 10 times more than the names accounted for.

One way of checking at a large scale level if this could indeed be the case, could be do compare the total counts of birth accounted for by these data with more comprehensive census of birth by state and year. Unfortunately I did not get a chance to pursue this avenue.

Frankly I was surprised to see that the certainly expected power law continued apparently undeterred until the lowest count limit. My naive expectation was that before that point it would begin to taper off and eventually drop, though perhaps below the 5-count limit. I can’t help but wonder how realistic/reasonable it is to imagine that there could be several thousands of “hidden” names caught by the \(n\ge5\) filter.
How many babies are given names that is unique to them (in a state and year)?!

Question #2

top10_FM <- group_by(df, name) %>% summarize(n = sum(count)) %>% arrange(desc(n)) %>% head( . , 10)
top10_F  <- filter(df, gender == "F") %>% group_by(name) %>% 
                                          summarize(n = sum(count)) %>% 
                                          arrange(desc(n)) %>% 
                                          head( . , 10)
top10_M  <- filter(df, gender == "M") %>% group_by(name) %>% 
                                          summarize(n = sum(count)) %>% 
                                          arrange(desc(n)) %>% 
                                          head( . , 10)

top10_all <- bind_cols(top10_FM, top10_F, top10_M)
colnames(top10_all) <- c("AnyGender", "N_any", "Female", "N_female", "Male", "N_male")

What is the most popular name of all time? (Of either gender.)

Ignoring gender, the most popular given-name in this data set is James.
For female name it is Mary.
For male names James.

For a little more information, here is the Table of Top 10 rankings, left to right, irrespective of gender, female-given names, male-given names.

# Source: local data frame [10 x 6]
# 
#    AnyGender   N_any    Female N_female    Male  N_male
# 1      James 4942431      Mary  3728041   James 4924235
# 2       John 4834422  Patricia  1567405    John 4818746
# 3     Robert 4718787 Elizabeth  1490772  Robert 4703680
# 4    Michael 4297230  Jennifer  1460272 Michael 4280040
# 5    William 3822209     Linda  1445838 William 3811998
# 6       Mary 3737679   Barbara  1422631   David 3541625
# 7      David 3549801  Margaret  1120006 Richard 2526927
# 8    Richard 2531924     Susan  1107916  Joseph 2467298
# 9     Joseph 2472917   Dorothy  1051262 Charles 2237170
# 10   Charles 2244693   Jessica  1036257  Thomas 2209169

Question #3

# Preparing the relevant data frames
c2013 <- intersect_MF_names_1year(data = df, select_year = 2013)$common
c1945 <- intersect_MF_names_1year(data = df, select_year = 1945)$common

# how many rows to show
N_show <- 5

# accepted ratio, and corresponding normalized difference.
r_threshold <- 0.8
rn <- abs((r_threshold - 1)/(r_threshold + 1))

[A] What is the most gender ambiguous name in 2013?

First, how many names were used for both female and male babies in 2013?
494, a relatively small fraction of the set of 2852 names that “historically” have been used for both females and males.

With the data ready, we can answer the question in a few different ways.

Approach 1

One option would be to select names for which the counts for female and male are identical, \(\Delta=0\), and among these pick the name with the highest count. This approach penalizes popular names given the increasingly smaller likelihood that the number of occurrences in females and males match.

# diff = 0, sorted by total number
top2013a <- filter(c2013, diff == 0) %>% arrange(desc(N_tot)) %>% as.data.frame(.) %>% head( . , N_show)
kable(top2013a, row.names = TRUE)

	Name	N_female	N_male	N_tot	ratio
1	Nikita	47	47	94	1
2	Aris	15	15	30	1
3	Cree	11	11	22	1
4	Tru	11	11	22	1
5	Devine	10	10	20	1

Approach 2

To mitigate this effect, on account of the fact that we would like to pick a name that is in some way popular, we could normalize the difference in male-female counts by the total counts (equivalent to using the \(\min(F/M, M/F)\)). Names with a \(\Delta=0\) would still come up at the top, hence we exclude them in this case, pending a check of whether or not they would have ranked higher than the top name resulting from this second method.

# diff > 0, sorted by normalized difference
top2013a <- filter(c2013, diff > 0) %>% arrange(diff_norm) %>% as.data.frame(.) %>% head( . , N_show)
kable(top2013a, row.names = TRUE)

	Name	N_female	N_male	N_tot	diff	diff_norm	ratio
1	Jael	123	130	253	7	0.0277	0.9462
2	Milan	422	450	872	28	0.0321	0.9378
3	Aven	81	87	168	6	0.0357	0.9310
4	Reilly	59	54	113	5	0.0442	0.9153
5	Lennon	224	249	473	25	0.0529	0.8996

Approach 3

To open up more the “name space” towards generally more common, hence more interesting, names we need allow for a range in the normalized difference or equivalently the count ratio.
If we include names with \(min(F/M, M/F) > 0.8\), we get this new list:

top2013 <- filter(c2013, diff_norm <= rn) %>% arrange(desc(N_tot)) %>% as.data.frame(.) %>% head( . , N_show)
kable(top2013, row.names = TRUE)

	Name	N_female	N_male	N_tot	diff	diff_norm	ratio
1	Charlie	1297	1531	2828	234	0.0827	0.8472
2	Dakota	1049	865	1914	184	0.0961	0.8246
3	Justice	669	543	1212	126	0.1040	0.8117
4	Milan	422	450	872	28	0.0321	0.9378
5	Lennon	224	249	473	25	0.0529	0.8996

This plot summarizes the analysis.

The green dashed diagonal lines mark the region allowed by \(min(F/M, M/F) > 0.8\).
The green diamonds mark the top-5 most ambiguous names according to Approach 3, also labeled with the corresponding names.
The blue diamonds mark the top-5 names with a count difference strictly equal to zero (approach 1).

single_panel_mar <- c(3, 3, 2, 2)
single_panel_oma <- c(0 ,0, 2, 0)
gr_par <- list( mar = single_panel_mar, oma = single_panel_oma, 
                cex = 1.2, cex.axis = 1.0, cex.lab = 1.2, cex.main = 1.0,
                las = 0, mgp = c(1.75, 0.5, 0),
                tcl = 0.3)
par(gr_par)
lab_x <- top2013[, "N_female"]
lab_y <- top2013[, "N_male"]
lab_s <- top2013[, "Name"]
lab_pos <- rep(c(4, 2), length.out = N_show)

x  <- seq(5, 30000, by = 100)
y1 <- (1 - rn)/(1 + rn)*x
y2 <- (1 + rn)/(1 - rn)*x

plot(c2013$N_female, c2013$N_male, log = "xy", 
     xlim = c(4, 20000), ylim = c(4, 20000), asp = 1,
     pch = 21, col = "orangered2", bg = rgb(0.8, 0.3, 0, 0.3),
     xlab = "N female",
     ylab = "N male",
     main = "Male/Female Counts of Ambiguous Names (2013)")
points(c2013$N_female[c2013$diff_norm <= abs(rn)], c2013$N_male[c2013$diff_norm <= abs(rn)], pch = 23, col = "darkgreen", bg = rgb(0, 0.9, 0, 0.3))
points(c2013$N_female[c2013$diff == 0], c2013$N_male[c2013$diff == 0], pch = 23, col = "blue2", bg = rgb(0, 0, 0.9, 0.3))
lines(x, y1, lty = 4, lwd = 1.5, col = "forestgreen")
lines(x, y2, lty = 4, lwd = 1.5, col = "forestgreen")
grid()
text(lab_x, lab_y, labels = lab_s, pos = lab_pos, cex = 0.9, adj = c(NA, 0))

[B] In 1945?

We skip the two less smart approaches and show directly the results from Approach 3, with table and figure.
The total number of bisex names for 1945 is 237.

top1945 <- filter(c1945, diff_norm <= rn) %>% arrange(desc(N_tot)) %>% as.data.frame(.) %>% head( . , N_show)
kable(top1945, row.names = TRUE)

	Name	N_female	N_male	N_tot	diff	diff_norm	ratio
1	Leslie	1678	1977	3655	299	0.0818	0.8488
2	Jackie	1245	1473	2718	228	0.0839	0.8452
3	Jessie	1072	910	1982	162	0.0817	0.8489
4	Frankie	482	549	1031	67	0.0650	0.8780
5	Lavern	70	74	144	4	0.0278	0.9459

par(gr_par)
lab_x <- top1945[, "N_female"]
lab_y <- top1945[, "N_male"]
lab_s <- top1945[, "Name"]

plot(c1945$N_female, c1945$N_male, log = "xy", 
     xlim = c(4, 20000), ylim = c(4, 20000), asp = 1,
     pch = 21, col = "orangered2", bg = rgb(0.8, 0.3, 0, 0.3),
     xlab = "N female",
     ylab = "N male",
     main = "Male/Female Counts of Ambiguous Names (1945)")
points(c1945$N_female[c1945$diff_norm <= abs(rn)], c1945$N_male[c1945$diff_norm <= abs(rn)], 
       pch = 23, col = "darkgreen", bg = rgb(0, 0.9, 0, 0.3))
points(c1945$N_female[c1945$diff == 0], c1945$N_male[c1945$diff == 0], 
       pch = 23, col = "blue2", bg = rgb(0, 0, 0.9, 0.3))
lines(x, y1, lty = 4, lwd = 1.5, col = "forestgreen")
lines(x, y2, lty = 4, lwd = 1.5, col = "forestgreen")
grid()
text(lab_x, lab_y, labels = lab_s, pos = lab_pos, cex = 0.9, adj = c(NA, 0))

Question #4

all1980 <- filter(df, year == 1980) %>% mutate(nameg = paste0(name, "_", gender)) %>% 
                                        group_by(nameg) %>% 
                                        summarize(n1980 = sum(count))

all2013 <- filter(df, year == 2013) %>% mutate(nameg = paste0(name, "_", gender)) %>% 
                                        group_by(nameg) %>% 
                                        summarize(n2013 = sum(count))

#---------
inboth_namesG <- inner_join(all1980, all2013, by = "nameg") %>% mutate(ntot = n1980 + n2013, 
                                                                       change_pct = 100*(n2013 - n1980)/n1980,
                                                                       rev_pct = 100*(n1980 - n2013)/n2013,
                                                                       gender = str_sub(nameg, start = -1),
                                                                       name = str_replace(nameg, "_[FM]", "")) 
inboth_names <- inboth_namesG[, c(8, 7, 2:6)]
rm(inboth_namesG)

#---------
in2013_not1980 <- anti_join(all2013, all1980, by = "nameg") %>% mutate(change_if_1 = 100*(n2013 - 1),
                                                                       gender = str_sub(nameg, start = -1),
                                                                       name = str_replace(nameg, "_[FM]", ""))
in1980_not2013 <- anti_join(all1980, all2013, by = "nameg") %>% mutate(rev_change_if_1 = 100*(n1980 - 1),
                                                                       gender = str_sub(nameg, start = -1),
                                                                       name = str_replace(nameg, "_[FM]", ""))
in2013_not1980 <- in2013_not1980[, c(5, 4, 2, 3)]
in1980_not2013 <- in1980_not2013[, c(5, 4, 2, 3)]
#---------
max_increase <- max(inboth_names$change_pct)
max_rev_decrease <- max(inboth_names$rev_pct)

# how many names to show
N_show <- 5

names_in_both   <-  (all1980$nameg %in% all2013$nameg)
names_only_1980 <- !(all1980$nameg %in% all2013$nameg)
names_only_2013 <- !(all2013$nameg %in% all1980$nameg)

For this question we aggregate all the data nationally, summing all occurrences of each name over all states. However, we want to keep track separately of the female and male counts for bisex names, and for simplicity we append a _F or _M to the name itself for the instances where it is used for female or male.

After aggregation, the 1980 and 2013 data sets comprise 6185 and 10004 names, respectively, of which 3372 are in common.

[A] Of the names represented in the data, find the name that has had the largest percentage increase in popularity since 1980.

Taking the straightforward approach of defining percentage increase as the ratio between the difference 2013 and 1980 counts over the 1980 counts, i.e. \((n_{2013} - n_{1980})/n_{1980}\), these are the names whose adoption increased the most between those two reference years:

tab4a <- arrange(inboth_names, desc(change_pct)) %>% dplyr::select( . , -ntot, -rev_pct) 
kable(head(tab4a, N_show), row.names = TRUE, digits = 0)

	name	gender	n1980	n2013	change_pct
1	Avery	F	5	9121	182320
2	Colton	M	5	6439	128680
3	Aria	F	5	5085	101600
4	Isabella	F	23	17490	75943
5	Mila	F	5	3661	73120


frac_up_F_20 <- sum(head(tab4a, 20)$gender == "F")/20
frac_up_F_50 <- sum(head(tab4a, 50)$gender == "F")/50

The majority of them barely made into the 1980 dataset because they total count is 5, meaning that they entered this data set only in one state for that year (there are 238 names with a count of only 5 in 1980 for the 1980-2013 common data set, 206 in 2013).

[B] Largest decrease?

For the largest decrease we do the same calculation and obtain the 5 worst performers shown in the following table. To allow a more direct comparison with the changes reported for the increases, we added to this table a column with the percentage change computed with 2013 as baseline (“rev_pct”).

tab4b <- arrange(inboth_names, change_pct) %>% dplyr::select( . , -ntot)
digits_vec <- rep(0, ncol(tab4b)) 
digits_vec[which(colnames(tab4b)=="change_pct")] <- 2
kable(head(tab4b, N_show), row.names = TRUE, digits = digits_vec)

	name	gender	n1980	n2013	change_pct	rev_pct
1	Misty	F	5534	12	-99.78	46017
2	Jill	F	4553	13	-99.71	34923
3	Rhonda	F	1515	5	-99.67	30200
4	Shawna	F	1420	5	-99.65	28300
5	Tracey	F	1237	5	-99.60	24640


frac_down_F_20 <- sum(head(tab4b, 20)$gender == "F")/20
frac_down_F_50 <- sum(head(tab4b, 50)$gender == "F")/50

Again, not surprisingly, several of the most forgotten names show up right at the minimum possible number of occurrences to enter the data set.

NOTE on gender and dramatic changes: it is interesting, though at this level little more than anedoctal, that the majority of the names that exhibit the most dramatic changes are female names: they comprise 60% of the 20 largest (measurable) increases and 95% of the 20 largest (measurable) decrease (60% and 84% of the top 50 increase/decreases).

Question #5

Can you identify names that may have had an even larger increase or decrease in popularity?

There may be names that enjoyed (suffered) an even larger increase (decrease) in popularity in the 2-point comparison between 1980 and 2013 and not appear in the preceding analyses because they were below threshold for inclusion in the data set in either 1980 or 2013.

Names present in 2013 but not in 1980 could have in principle jumped up more than the 182320% of Avery. The largest increase one of these names could have experienced is if they started with a count of 1 in 1980.
Of course, in principle some of the names recorded only in 2013 could truly not have “existed” in 1980, thus earning an “infinite” increase. However, we can quite handle properly the possibility that a 2013 name was not actually given to anybody in 1980 because we are blind to what “happened” below the 5-count threshold. Therefore we restrict our estimate to the scenario whereby we assign to every name a count equal to 1 for the epochs for which it is not present.
Viceversa, names present in 1980 but not in 2013, could have dropped more than 46016.7% of Misty and gone under threshold.

Assigning, as noted, a count = 1 for 1980 to names existing in 2013 and not in 1980, we compute their hypothetical percentage increase from 1 to their actual 2013 count.

tab5a <- filter(in2013_not1980, change_if_1 > max_increase) %>% arrange(desc(change_if_1))

kable(head(tab5a, N_show), row.names = TRUE, digits = 2)

	name	gender	n2013	change_if_1
1	Jayden	M	14656	1465500
2	Aiden	M	13527	1352600
3	Madison	F	10529	1052800
4	Harper	F	8222	822100
5	Addison	F	7677	767600

tab5b <- filter(in1980_not2013, rev_change_if_1 > max_rev_decrease) %>% arrange(desc(rev_change_if_1))

kable(head(tab5b, N_show), row.names = TRUE, digits = 0)

	name	gender	n1980	rev_change_if_1
1	Tonya	F	3073	307200
2	Beth	F	2843	284200
3	Kristi	F	2521	252000
4	Latoya	F	2479	247800
5	Kelli	F	2477	247600

NOTE: if we wanted to split hairs and be excruciatingly obsessive in estimating the largest potential increase, we could add to the total count for a name a “free \(+4\) contribution” for any state for which it does not appear in the data. That would be the maximum extra number of occurrences not accounted for. Viceversa for pushing the estimate for largest decrease we could add \(4\) free counts for each state where a name does not appear in 1980.
With a little R magic it would not be difficult but it is beyond the scope of this report. Moreover, considering that in the most extreme scenario of a name appearing for only one state, this would add \(200\) counts, it looks like it would not change substantially the rankings presented above.

OTHER BITS OF ANALYSIS

This dataset is extremely interesting and could the source of endless stimulating exploration, but I will have to leave that for another time. In the remainder of this report I will present some basic plots I have made as I was playing with the data.

First a few terse thoughts about ideas that crossed my mind while working with these data, not much more than bullet points at this stage, and not a particularly deeply thought-through order:

Understanding the effect of the \(n \ge 5\) cutoff.
- Get comprehensive aggregate birth data by state.
Clustering of States (also over time Time)
- Each state represented as a vector of names (kind of like a Document Term Matrix).
- Cluster them in space and time.
- Are there pattern developing over time, e.g. correlations between states with a time delay (waves?).
Anything related to immigration, “injection” of new names, perhaps spreading?
There is certainly an ethnicity factor in naming traditions, and so I would expect some effect to transpire from the data.
I would guess that socio-economic factors might have some relationship with given names.
- Census data could be combined with this data set.
Number of new names year (decade) after year (decade)
Find names that:
- disappeared.
- popped up of nowhere.
- peaked.
Sociological/cultural influences on popular names?
A lot of analysis would lend itself to interesting visualization, over time and space (maps).
- A ShinyApp?
About name count distributions (see plots below for some hints of the background of these points).
- Interpretation of the systematic changes of slope, mostly for female names, getting steeper over time.
- Interpretation of the difference between the distributions of female and male names. The distribution of female names declines more rapidly and it seems to have an actual “roll over” at high counts, sort of like an exponential cutoff.
- Is there really a bump in many cases (state/year) between the power law part and the steep drop, something like a “knee”? Flipping through sequences of distributions it would seem to be a real feature, and only present in the male names (see for instance in the plots included later, the blue points, for CO, SC, TN, maybe TX too).

SOME BASIC TRENDS BY YEAR AND GENDER

by_year_by_gender <- group_by(df, year, gender) %>% summarize(n = n(), 
                                                              tot_count = sum(count/1000), 
                                                              n_dist = n_distinct(name),
                                                              ratio1 = tot_count / n_dist, 
                                                              ratio2 = n_dist / tot_count)

Number of Distinct Names vs. Time

Right after WW2, or so it seems, female names’ variety started to grow while the used pool of male names remained fairly stable until the end of the 1960’s.
After that point male names have expanded but it does not look like the can catch up. Please note that the Y-axis scale is logarithmic, hence constant spacing means constant ratio (instead of constant difference).

colors_FM <- c("pink2", "dodgerblue2")

ggplot(by_year_by_gender, aes(x = year, y = n_dist)) + theme_bw() + 
            theme(legend.position = c(0.15, 0.85),
                axis.title = element_text(size = 14),
                axis.text= element_text(size = 12),
                axis.line = element_line(size = 1)
                ) +
       scale_color_manual(values = colors_FM) + 
       scale_fill_manual(values = colors_FM) + 
       scale_y_log10(breaks = c(1000, 2000, 4000)) + 
       # scale_y_log10(breaks = c(500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 8000)) + 
       labs(title = "Number of Distinct Names") + 
       ylab("Count") +
       geom_line(aes(color = gender), size = 1.2) + 
       geom_point(aes(fill = gender), alpha = 0.75, pch = 21, size = 3)

Total Counts vs. Time

Total number of births as counted from this dataset. These would be the figures to compare with accurate data about birthrates, to investigate the effect of the low-count threshold for inclusion in the data.

The fact that the total counts for female and male births are systematically offset from each other by an amount that would seem hard to explain with natural causes, may be related to the presence of the threshold. The count distributions for female names are steeper than those for males, quite systematically independently of the state and by a larger degree in more recent decades. Steeper power law distributions, if they continue to lower counts below the threshold would resuls in a larger loss of data for female names and male names. This in turn would bias the cumulative counts by gender.

ggplot(by_year_by_gender, aes(x = year, y = tot_count)) + theme_bw() + 
       theme(legend.position = c(0.15, 0.85),
           axis.title = element_text(size = 14),
           axis.text= element_text(size = 12),
           axis.line = element_line(size = 1)
           ) +
       scale_color_manual(values = colors_FM) + 
       scale_fill_manual(values = colors_FM) + 
       scale_y_log10(breaks = 100*2**(1:7)) +
       coord_cartesian(ylim = c(300, 3000)) +
       # scale_y_log10(breaks = c(1000, 2000, 4000)) + 
       labs(title = "Total count (~ total number of births?)") +
       ylab("Count (thousands)") +
       geom_line(aes(color = gender), size = 1.2) + 
       geom_point(aes(fill = gender), alpha = 0.75, pch = 21, size = 3)

Names Coverage

How far do popular names stretch?
How many distinct names are necessary to cover a given fraction of the new babies?

The curves in the following figure shows the fraction of the distinct names needed to cumulatively cover 90, 95, 99% of the new babies, as a function of time. The lines were largely flat or even decreasing until the 1950’s, when the highest percentile line started a long term upward trend. All percentiles unambiguously started an increasing trend at the beginning of the 1990’s.

At first glance I don’t see a reason why this trend could caused by population growth given that these lines represent the fraction of names given in a year, a relative measure. If population growth leads to the more names, as perhaps suggested by one of the previous figures, this by itself as a mass effect would not automatically cause a shift in the percentile lines.

My impression is that the trends reflect an increasing spreading of the tails of the distribution, with their cores able to account a decreasing portion of the population.

The spreading could be due to increased creativity, less

by_year_coverage <- group_by(df, year, name) %>% summarize(tot_count = sum(count)) %>% 
                                                 arrange(year, desc(tot_count))
by_year_coverage <- group_by(df, year, name) %>% summarize(tot_count = sum(count)) %>% 
                                                 arrange(year, desc(tot_count)) %>% mutate()

for_coverage2a <- group_by(df, year, name) %>% summarize(tot_count = sum(count)) %>% 
                                               arrange(year, desc(tot_count)) %>% 
                                               summarise(fn90 = get_coverage(tot_count, 90)$Fcov,
                                                         fn95 = get_coverage(tot_count, 95)$Fcov,
                                                         fn99 = get_coverage(tot_count, 99)$Fcov)

for_coverage2 <- gather(for_coverage2a, cutoff, fcov, fn90, fn95, fn99, -year)

cols <- c("red2", "blue2", "orange")
ggplot(for_coverage2, aes(x = year, y = fcov)) + theme_bw() + 
            theme(legend.position = c(0.2, 0.85),
                  axis.title = element_text(size = 14),
                  axis.text= element_text(size = 12),
                  axis.line = element_line(size = 1)
                  ) +
            scale_color_manual(values = cols) + 
            scale_fill_manual(values = cols) + 
            coord_cartesian(ylim = c(0.0, 0.8)) +
            labs(title = "Fraction of Names Needed to Cover Given Fraction of Newborn)") + 
            geom_line(aes(color = cutoff), size = 1.2) + 
            geom_point(aes(fill = cutoff), alpha = 0.75, pch = 21, size = 3)

COUNT DISTRIBUTIONS BY STATE, TIME, AND GENDER

I already shared some thoughts about the count distributions, and here I would like to focus on the results of fitting them with power laws, allowing for the effect of gender.

Examples

First a few more examples of the count distributions. They are all for the period 2009-2013, for eight states in different regions and of different size. In all of them the female name counts distributions are steeper than the male’s. Because broadly speaking their total integral should be approximately similar, their different slopes mean that they will cross-over, and so there are many more names with high-count among male names, and many more names with low-count among female names.

We noted above that the steeper slope of the female names distribution combined with the low-count threshold may have something to do with the offset between total number of female and male births counted from these data.

In a few of these plots there is a hint of a possible “knee” in the distribution of male names, between the power law part (left / lower counts side).

par(fig = c(0, 1, 0, 1), mar = c(5, 4, 4, 1)+0.1, cex.axis = 1.0)
single_panel_mar <- c(2, 2, 2, 1)
single_panel_oma <- c(0 ,0, 2, 0)
gr_par <- list( mar = single_panel_mar, oma = single_panel_oma, 
                cex = 1.0, cex.axis = 0.8, cex.lab = 1.0, 
                las = 0, mgp = c(1.0, 0.0, 0),
                tcl = 0.3)

par(gr_par)

mat.layout <- matrix(1:8, nrow = 4, byrow = TRUE)
layout(mat.layout)

plot_hist_state_year_with_fit(data = df, STATE = "AZ", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "CA", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "CO", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "DE", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "OK", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "SC", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "TN", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "TX", YEAR = 2009:2013, subset = quote(y > log10(25)))

par(fig = c(0, 1, 0, 1), mar = c(5, 4, 4, 1)+0.1, cex.axis = 1.0)
title("Distributions", cex.main = 1.25, outer = TRUE)

Summary Plots of Power Law Fits

We split the data by state, decade and gender, binned them up and fit a linear regression model (lm()) to each name count distribution. Please note that the “decades” are actually partially overlapping, in the sense that we used a 10-year wide sliding window, moving it in steps of 5 years. So, consecutive data/models are not fully independent as they overlap for half of the time intervals.

The next three plots show the values of the slope of the female names distributions, the slope of the male names distributions and they difference. Points for each state are connected over time.
The thick long-dashed grey line connects the mean values at each time value.

The main observations from these figures are that:

The power law slope for the distributions of female names has been steadily decreasing since the 1960’s, shifting from about \(-0.75\) to \(-1.1\).
The slope of the male names has remained pretty stable at around \(-0.75\)
Looking at the means, we can appreciate that the slope for female names has been steeper than that of male names since the 1960’s.
This is illustrate more directly in the third figure, plotting the difference of the two slopes. It has been increasing for almost a century.

ggplot(powerlaw_fits, aes(x = Time, y = slopeF, color = State)) + theme_bw() + 
        theme(legend.position = "top", legend.title = element_blank()) +
        guides(col = guide_legend(nrow = 4, byrow = TRUE)) + 
        ylim(-2.2, -0.3) +
        geom_line(aes(group = State)) + 
        geom_point() +
        stat_summary(fun.y = "mean", geom = "line", aes(group=1), size = 2.0, lty = 5) +
        labs(title = "Slope of Power Law Fit to Female Names Count Distributions")

ggplot(powerlaw_fits, aes(x = Time, y = slopeM, color = factor(State))) + theme_bw() + 
        theme(legend.position = "top", legend.title = element_blank()) +
        guides(col = guide_legend(nrow = 4, byrow = TRUE)) + 
        ylim(-2.2, -0.3) +
        geom_line(aes(group = State)) + 
        geom_point() +
        stat_summary(fun.y = "mean", geom = "line", aes(group=1), size = 2.0, lty = 5) +
        labs(title = "Slope of Power Law Fit to Male Names Count Distributions")

ggplot(powerlaw_fits, aes(x = Time, y = slopeM-slopeF, color = factor(State))) + theme_bw() + 
        theme(legend.position = "top", legend.title = element_blank()) +
        guides(col = guide_legend(nrow = 4, byrow = TRUE)) + 
        ylim(-0.3, 0.9) +
        geom_line(aes(group = State)) + 
        geom_point() + 
        stat_summary(fun.y = "mean", geom = "line", aes(group=1), size = 2.0, lty = 5) +
        geom_abline(slope = 0, lty = 2, col = "grey40", size = 1) +
        labs(title = "Difference b/w Male and Female Power Law Slopes")

APPENDIX

Back to the Top

User Defined Functions

Source on GitHub or expand the following block of code:

#===================================================================================================
# Nothing Yey

intersect_MF_names_1year <- function(data = NULL, 
                                     select_year = NULL) {

    tmp_f <- filter(data, year == select_year & gender == "F") %>% group_by(name) %>% summarize(n = sum(count)) 
    tmp_m <- filter(data, year == select_year & gender == "M") %>% group_by(name) %>% summarize(n = sum(count)) 

    tmp_c <- inner_join(tmp_f, tmp_m, by = "name") %>% mutate(ntot = n.x + n.y, 
                                                              diff = abs(n.x - n.y), 
                                                              diff_norm = round(abs(n.x - n.y)/ntot, 4), 
                                                              ratio = round(pmin(n.x/n.y, n.y/n.x), 4))
    colnames(tmp_c)[1:4] <- c("Name", "N_female", "N_male", "N_tot")
    return(list(female = tmp_f, male = tmp_m, common = tmp_c))
}

#===-----------------------------------------------------------
# 
get_coverage <- function(cnt = NULL, 
                         cutoff = NULL) {

    tot <- sum(cnt)
    cov <- cumsum(cnt)/tot*100

    Ncov <- sum(cov <= cutoff)
    Fcov <- Ncov/length(cnt)
    return( list(Ncov = Ncov, Fcov = Fcov) )

}

#===-----------------------------------------------------------
# from: https://github.com/cran/descr/blob/master/R/file.head.R
# 
file.head <- function(file, 
                      n = 6, 
                      truncate.cols = TRUE) {

    lns <- readLines(file, n = n)
    lns <- gsub("\t", "\\\\t", lns)
    if(truncate.cols){
        try(file.head.lns <- substr(lns, 1, getOption("width") - 1), silent = TRUE)
        if(!exists("file.head.lns")){
            Encoding(lns) <- "bytes"
            file.head.lns <- substr(lns, 1, getOption("width") - 1)
        }
    }
    cat(file.head.lns, sep = "\n")
}

#===-----------------------------------------------------------
#  - subset must be passed enclosed in 'quote()'
#
plot_hist_state_year_with_fit <- function(data = NULL, 
                                          STATE = NULL, 
                                          YEAR = NULL, 
                                          subset = NULL,
                                          return_fit = FALSE, 
                                          debug = FALSE) {

    # require("dplyr")   
    if( is.null(STATE) & !is.null(YEAR)) { data_for_histo <- subset(data, year %in% YEAR) }
    if(!is.null(STATE) &  is.null(YEAR)) { data_for_histo <- subset(data, state %in% STATE) }
    if(!is.null(STATE) & !is.null(YEAR)) { data_for_histo <- subset(data, state %in% STATE & year %in% YEAR) }
    
    # log_breaks <- seq(log10(5), log10(2*max(data_for_histo$count)), by = 0.25*log10(5))
    log_breaks <- seq(0, 4.1, by = 0.2*log10(5)) + log10(5)
    
    #---
    fit_df <- data_for_histo %>% mutate(log_c = log10(count)) %>% dplyr::select( . , gender, log_c)
    if( debug ) { print(nrow(fit_df)) }
    hh_F <- hist(subset(fit_df$log_c, fit_df$gender == "F"), breaks = log_breaks, plot = FALSE)
    hh_M <- hist(subset(fit_df$log_c, fit_df$gender == "M"), breaks = log_breaks, plot = FALSE)
    
    hh_F_df <- data.frame(x = hh_F$mids, y = log10(hh_F$counts), g = "F", stringsAsFactors = FALSE) 
    hh_M_df <- data.frame(x = hh_M$mids, y = log10(hh_M$counts), g = "M", stringsAsFactors = FALSE) 
    hh_df <- bind_rows(hh_F_df, hh_M_df) %>% filter( . , y != -Inf)
    hh_df$xy <- hh_df$y + hh_df$x
    
    #---
    # needed for understanding the passed subset expression
    x <- hh_df$x
    y <- hh_df$y
    #
    if( is.null(subset) ) {
        flag_subset <- rep(TRUE, nrow(hh_df)) 
    } else {
        flag_subset <- eval(subset)
    }
    #---

    mod <- lm(y ~ I(x-1)*g, data = hh_df, subset = flag_subset)
    cx <- summary(mod)$coeff
    ab <- cx[, 1]
    
    cx_at_1 <- c(ab[1], ab[2], ab[1]+ab[3], ab[2]+ab[4])
    names(cx_at_1) <- c("nF", "slopeF", "nM", "slopeM")
    
    ab_F <- c( ab[1] - ab[2], ab[2] )
    ab_M <- c( (ab[1]+ab[3]) - (ab[2]+ab[4]), ab[2]+ab[4])

    if( debug ) { print(cx[, -3]) }
    if( debug ) { print(cx_at_1) }
    if( debug ) { print(rbind(ab_F, ab_M)) }
    
    #---
    if( !is.null(STATE) ) {
        label_STATE <- deparse(substitute(STATE)) %>% gsub('"', '', .)
    } else {
        label_STATE <- ""
    }
    if( !is.null(YEAR) ) {
        label_YEAR  <- deparse(substitute(YEAR)) %>% gsub('"', '', .)
    } else {
        label_YEAR  <- ""
    }

    label_main <- paste0(label_STATE, " / ", label_YEAR)
    #---
    
    # single_panel_mar <- c(3, 4, 2, 1)
    # single_panel_oma <- c(0 ,0, 2, 0)
    # gr_par <- list( mar = single_panel_mar, oma = single_panel_oma, 
    #                 cex = 1.0, cex.axis = 1.0, cex.lab = 1.0, cex.main = 1.25,
    #                 las = 0, mgp = c(1.5, 0.5, 0),
    #                 tcl = 0.3)
    # par(gr_par)

    colors_FM <- c("pink2", "dodgerblue2")
    plot(hh_df$x, hh_df$y, 
         xlim = c(0, 4.1), 
         ylim = c(0, 4.5), 
         type = "n", 
         xlab = "log10(Counts)", 
         ylab = "log10(Frequency)",
         main = label_main)
    
    points(hh_df$x[hh_df$g == "F"], hh_df$y[hh_df$g == "F"], col = colors_FM[1], pch = 19, cex = 1.5)
    points(hh_df$x[hh_df$g == "M"], hh_df$y[hh_df$g == "M"], col = colors_FM[2], pch = 19, cex = 1.5)

    abline(ab_F, col = "red2", lty = 2)
    abline(ab_M, col = "blue2", lty = 2)
    grid()

    points(hh_df$x[!flag_subset], hh_df$y[!flag_subset], pch = 15, col = rgb(0,0,0,0.2), cex = 2.2)

    if( return_fit ) { return(cx_at_1) }

}

#===-----------------------------------------------------------
#
prepare_powerlaw_fits_data_frame <- function(data = NULL) {
    
    vec_states <- unique(data$state)
    # vec_states <- c("MA", "WY", "NM", "FL")
    
    vec_years <- unique(data$year) %>% sort( . , decreasing = FALSE)
    last_year <- 2013
    
    y_start_step <- 5
    start_years <- seq(1910, 2014, by = y_start_step)
    y_window <- 10
    
    store <- list()
    
    i <- 0
    for( S in vec_states ) {
        state_df <- filter(data, state == S)
        for( y1 in start_years ) {
            i <- i + 1
            # y2 <- min(y1 + y_start_step - 1, last_year)
            y2 <- min(y1 + y_window - 1, last_year)
            cat(" ", S, " : ", y1, " - ", y2, "\n")
            # name <- paste0(S, "_", y1, "_", y2)
            name <- paste0(S, y1)
            
            fname <- paste0("distr2_", S, "_", y1, "_", y2, ".png")
            png(file = fname, width = 900, height = 900, res = 96, bg = "white")
            cx <- plot_hist_state_year_with_fit(data = state_df, YEAR = y1:y2, subset = quote(y > log10(25)), return_fit = TRUE)
            # store[[i]] <- cx
            store[[name]] <- c(name, cx)
            
            dev.off()
        }
    }
    
    powerlaw_fits <- as.data.frame(t(as.data.frame(store, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
    colnames(powerlaw_fits)[1] <- "StateYear"
    
    powerlaw_fits[, 2:5] <- apply(powerlaw_fits[,2:5], 2, as.numeric)
    row.names(powerlaw_fits) <- NULL
    powerlaw_fits <- separate(powerlaw_fits, StateYear, c("State", "Time"), sep = 2)
    powerlaw_fits$Time <- as.integer(powerlaw_fits$Time) + 0.5*y_window
    
    return(powerlaw_fits)
    
}

#===================================================================================================

R Session Info

sessionInfo()
# R version 3.1.3 (2015-03-09)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.2 LTS
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
#  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
# [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] ggplot2_1.0.1 stringr_1.0.0 magrittr_1.5  tidyr_0.2.0   dplyr_0.4.2   knitr_1.10.5 
# 
# loaded via a namespace (and not attached):
#  [1] DBI_0.3.1        MASS_7.3-41      R6_2.0.1         Rcpp_0.11.6      assertthat_0.1  
#  [6] colorspace_1.2-6 digest_0.6.8     evaluate_0.7     formatR_1.2      grid_3.1.3      
# [11] gtable_0.1.2     highr_0.5        htmltools_0.2.6  labeling_0.3     lazyeval_0.1.10 
# [16] munsell_0.4.2    parallel_3.1.3   plyr_1.8.3       proto_0.3-10     reshape2_1.4.1  
# [21] rmarkdown_0.7    scales_0.2.4     stringi_0.5-5    tools_3.1.3      yaml_2.1.13

Baby Names

Giovanni Fossati