Preamble

The source files are posted on GitHub

Note on the Report Reproducible Format

This report is generated from a Rmarkdown document which includes all code to perform data processing, modeling, plotting, etc. (except for functions defined in external scripts, included in the archive). However, for readability only a few bits of code are echoed explicitly in the compiled document.

The full straight reproducibility is only limited by the fact that in the interest of simplicity and for computational convenience some parts of the processing have been flagged as inactive (eval = FALSE), and in its current form some data are instead loaded from previously saved work, namely the model fitting. That said, the document includes the code to perform the entire analysis, and a few changes would allow to do so by compiling it.


DATA LOADING AND SETUP

states_codes <- list.files("./data/", pattern = "*.TXT") %>% gsub("\\.TXT", "", ., perl = TRUE)
data_dir <- "data"

df <- do.call(bind_rows, lapply( list.files(data_dir, pattern = "[A-Z][A-Z].TXT.gz", full.names = TRUE), 
                function(x) { read.csv(x, header = FALSE, stringsAsFactors = FALSE) }) )
colnames(df) <- c("state", "gender", "year", "name", "count")
unique_names_F <- filter(df, gender == "F") %>% dplyr::select(name) %>% unique(.)
unique_names_M <- filter(df, gender == "M") %>% dplyr::select(name) %>% unique(.) 
unique_names <- list(F = as.vector(unique_names_F), M = as.vector(unique_names_M))
bisex_names <- intersect(unique_names$F, unique_names$M)

rm(unique_names_F, unique_names_M)

QUESTIONS


Question #1

[A] Please describe the format of the data files.

For illustration, we show here the head of one of the state files (Alaska), showing the structure of the data as distributed.

file.head("data/AK.TXT.gz", n = 10)
# AK,F,1910,Mary,14
# AK,F,1910,Annie,12
# AK,F,1910,Anna,10
# AK,F,1910,Margaret,8
# AK,F,1910,Helen,7
# AK,F,1910,Elsie,6
# AK,F,1910,Lucy,6
# AK,F,1910,Dorothy,5
# AK,F,1911,Mary,12
# AK,F,1911,Margaret,7

The data for each state are stored in tall format in a comma separated list. Each row has five values:

  • State: the 2-letter code of the state.
  • Gender: one letter, M|F.
  • Year: the year of birth for this record, in 4-digit format.
  • Name: the name.
  • Count: an integer for the number of times this name occurred in that state in that year.

A few additional notes:

  • Entries are grouped (sorted) by gender, then year, and finally in decreasing order of counts. If there are multiple names with the same number of occurrences, they are then sorted alphabetically.

  • It is important to note that for each state+year subset only names with at least five occurrences are reported in this public data set (out of privacy concerns).

  • The total number of distinct names in the data set is 29828, and of these 2852 have been used for both females and males.

This is an example of the entries in the data frame where I loaded and combined all the data, a sample of random rows:

df[sample(1:nrow(df), 12), ]
# Source: local data frame [12 x 5]
# 
#    state gender year       name count
# 1     NC      F 1979    Georgia     9
# 2     IL      M 1996    Zackery    16
# 3     SC      F 2007      Emily   256
# 4     IA      F 1944      Pearl     7
# 5     MN      M 1999  Nathanial     6
# 6     DC      F 1922 Marguerite    21
# 7     GA      M 1951      Cleve     7
# 8     NY      F 2013       Izel     5
# 9     MI      M 2001    Kendall    19
# 10    MO      M 1948       Doug    12
# 11    RI      F 1945   Meredith    18
# 12    CA      F 1953      Dolly    20

[B] Can you identify any limitations or distortions of the data?

I did not get a chance to explore the data in much detail (and not for lack of interest) and at this stage my main concern, perhaps the only one for now, is about the effect of the cutoff imposed on the release of names with fewer than 5 occurrences in a state.

First off, given the wide range of state populations (about a factor of 50), this threshold at a fixed value effectively introduces a distortion in the data in the sense that \(n\ge5\) truncates the distribution of names at different levels depending on a state population (taken as a proxy for number of births). On the other hand, we are dealing with an inherently discrete quantity and below 5 there are always only four other “levels” 1, 2, 3, 4, no matter how large is a state population. So, it is not clear to me how much is the impact of the cutoff in practice. One pertinent question is how much data would be there, and a good starting point is looking at how the available data are distributed.

As with many things in nature, the distribution of the number of names vs. their frequency follows fairly closely a power law distribution. Here are a couple of examples, for two states, with data added over a 5-year period. Please note that this is a log-log plot, the best way to look at data following a power law distribution, which comes out just like a straight line. Pink symbols show the distribution of female names, blue symbols that for male names. Red and blue dashed lines are the best fit with a linear regression model to the log-log data (points covered by a grey square were exclude from the fit.)

Data were binned, with bin width \(0.25 \log_{10}(5)\), and the plots show the counts in each bin, not a density (i.e. count divided by bin width).

par(fig = c(0, 1, 0, 1), mar = c(5, 4, 4, 1)+0.1, cex.axis = 1.0)
single_panel_mar <- c(2, 2, 2, 1)
single_panel_oma <- c(0 ,0, 2, 0)
gr_par <- list( mar = single_panel_mar, oma = single_panel_oma, 
                cex = 1.0, cex.axis = 0.8, cex.lab = 1.0, 
                las = 0, mgp = c(1.0, 0.0, 0),
                tcl = 0.3)

par(gr_par)

mat.layout <- matrix(1:2, nrow = 1, byrow = TRUE)
layout(mat.layout)
plot_hist_state_year_with_fit(data = df, STATE = "OK", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "TX", YEAR = 2009:2013, subset = quote(y > log10(25)))

par(fig = c(0, 1, 0, 1), mar = c(5, 4, 4, 1)+0.1, cex.axis = 1.0)
title("Examples of Distributions of Names by Count", cex.main = 1.25, outer = TRUE)

Data are sometimes noisy and bumpy, but overall, for all states and time intervals (provided that these latter were long enough to build up a decent statistic in states with low counts), these distributions are well modeled with steep declining power laws, possibly with an exponential cutoff at high counts. Assess that vs. simply a drop in statistics would require more careful analysis.

The slope of the power law is around \(-1\) (it would \(-2\) for density), and if extrapolated back below the cutoff at 5 counts it would translate into a large number of names with low-count, as much as 10 times more than the names accounted for.

One way of checking at a large scale level if this could indeed be the case, could be do compare the total counts of birth accounted for by these data with more comprehensive census of birth by state and year. Unfortunately I did not get a chance to pursue this avenue.

Frankly I was surprised to see that the certainly expected power law continued apparently undeterred until the lowest count limit. My naive expectation was that before that point it would begin to taper off and eventually drop, though perhaps below the 5-count limit. I can’t help but wonder how realistic/reasonable it is to imagine that there could be several thousands of “hidden” names caught by the \(n\ge5\) filter.
How many babies are given names that is unique to them (in a state and year)?!


Question #2

top10_FM <- group_by(df, name) %>% summarize(n = sum(count)) %>% arrange(desc(n)) %>% head( . , 10)
top10_F  <- filter(df, gender == "F") %>% group_by(name) %>% 
                                          summarize(n = sum(count)) %>% 
                                          arrange(desc(n)) %>% 
                                          head( . , 10)
top10_M  <- filter(df, gender == "M") %>% group_by(name) %>% 
                                          summarize(n = sum(count)) %>% 
                                          arrange(desc(n)) %>% 
                                          head( . , 10)

top10_all <- bind_cols(top10_FM, top10_F, top10_M)
colnames(top10_all) <- c("AnyGender", "N_any", "Female", "N_female", "Male", "N_male")

Question #3

# Preparing the relevant data frames
c2013 <- intersect_MF_names_1year(data = df, select_year = 2013)$common
c1945 <- intersect_MF_names_1year(data = df, select_year = 1945)$common

# how many rows to show
N_show <- 5

# accepted ratio, and corresponding normalized difference.
r_threshold <- 0.8
rn <- abs((r_threshold - 1)/(r_threshold + 1))

[A] What is the most gender ambiguous name in 2013?

First, how many names were used for both female and male babies in 2013?
494, a relatively small fraction of the set of 2852 names that “historically” have been used for both females and males.

With the data ready, we can answer the question in a few different ways.

Approach 1

One option would be to select names for which the counts for female and male are identical, \(\Delta=0\), and among these pick the name with the highest count. This approach penalizes popular names given the increasingly smaller likelihood that the number of occurrences in females and males match.

# diff = 0, sorted by total number
top2013a <- filter(c2013, diff == 0) %>% arrange(desc(N_tot)) %>% as.data.frame(.) %>% head( . , N_show)
kable(top2013a, row.names = TRUE)
Name N_female N_male N_tot diff diff_norm ratio
1 Nikita 47 47 94 0 0 1
2 Aris 15 15 30 0 0 1
3 Cree 11 11 22 0 0 1
4 Tru 11 11 22 0 0 1
5 Devine 10 10 20 0 0 1

Approach 2

To mitigate this effect, on account of the fact that we would like to pick a name that is in some way popular, we could normalize the difference in male-female counts by the total counts (equivalent to using the \(\min(F/M, M/F)\)). Names with a \(\Delta=0\) would still come up at the top, hence we exclude them in this case, pending a check of whether or not they would have ranked higher than the top name resulting from this second method.

# diff > 0, sorted by normalized difference
top2013a <- filter(c2013, diff > 0) %>% arrange(diff_norm) %>% as.data.frame(.) %>% head( . , N_show)
kable(top2013a, row.names = TRUE)
Name N_female N_male N_tot diff diff_norm ratio
1 Jael 123 130 253 7 0.0277 0.9462
2 Milan 422 450 872 28 0.0321 0.9378
3 Aven 81 87 168 6 0.0357 0.9310
4 Reilly 59 54 113 5 0.0442 0.9153
5 Lennon 224 249 473 25 0.0529 0.8996

Approach 3

To open up more the “name space” towards generally more common, hence more interesting, names we need allow for a range in the normalized difference or equivalently the count ratio.
If we include names with \(min(F/M, M/F) > 0.8\), we get this new list:

top2013 <- filter(c2013, diff_norm <= rn) %>% arrange(desc(N_tot)) %>% as.data.frame(.) %>% head( . , N_show)
kable(top2013, row.names = TRUE)
Name N_female N_male N_tot diff diff_norm ratio
1 Charlie 1297 1531 2828 234 0.0827 0.8472
2 Dakota 1049 865 1914 184 0.0961 0.8246
3 Justice 669 543 1212 126 0.1040 0.8117
4 Milan 422 450 872 28 0.0321 0.9378
5 Lennon 224 249 473 25 0.0529 0.8996

This plot summarizes the analysis.

  • The green dashed diagonal lines mark the region allowed by \(min(F/M, M/F) > 0.8\).
  • The green diamonds mark the top-5 most ambiguous names according to Approach 3, also labeled with the corresponding names.
  • The blue diamonds mark the top-5 names with a count difference strictly equal to zero (approach 1).
single_panel_mar <- c(3, 3, 2, 2)
single_panel_oma <- c(0 ,0, 2, 0)
gr_par <- list( mar = single_panel_mar, oma = single_panel_oma, 
                cex = 1.2, cex.axis = 1.0, cex.lab = 1.2, cex.main = 1.0,
                las = 0, mgp = c(1.75, 0.5, 0),
                tcl = 0.3)
par(gr_par)
lab_x <- top2013[, "N_female"]
lab_y <- top2013[, "N_male"]
lab_s <- top2013[, "Name"]
lab_pos <- rep(c(4, 2), length.out = N_show)

x  <- seq(5, 30000, by = 100)
y1 <- (1 - rn)/(1 + rn)*x
y2 <- (1 + rn)/(1 - rn)*x

plot(c2013$N_female, c2013$N_male, log = "xy", 
     xlim = c(4, 20000), ylim = c(4, 20000), asp = 1,
     pch = 21, col = "orangered2", bg = rgb(0.8, 0.3, 0, 0.3),
     xlab = "N female",
     ylab = "N male",
     main = "Male/Female Counts of Ambiguous Names (2013)")
points(c2013$N_female[c2013$diff_norm <= abs(rn)], c2013$N_male[c2013$diff_norm <= abs(rn)], pch = 23, col = "darkgreen", bg = rgb(0, 0.9, 0, 0.3))
points(c2013$N_female[c2013$diff == 0], c2013$N_male[c2013$diff == 0], pch = 23, col = "blue2", bg = rgb(0, 0, 0.9, 0.3))
lines(x, y1, lty = 4, lwd = 1.5, col = "forestgreen")
lines(x, y2, lty = 4, lwd = 1.5, col = "forestgreen")
grid()
text(lab_x, lab_y, labels = lab_s, pos = lab_pos, cex = 0.9, adj = c(NA, 0))

[B] In 1945?

We skip the two less smart approaches and show directly the results from Approach 3, with table and figure.
The total number of bisex names for 1945 is 237.

top1945 <- filter(c1945, diff_norm <= rn) %>% arrange(desc(N_tot)) %>% as.data.frame(.) %>% head( . , N_show)
kable(top1945, row.names = TRUE)
Name N_female N_male N_tot diff diff_norm ratio
1 Leslie 1678 1977 3655 299 0.0818 0.8488
2 Jackie 1245 1473 2718 228 0.0839 0.8452
3 Jessie 1072 910 1982 162 0.0817 0.8489
4 Frankie 482 549 1031 67 0.0650 0.8780
5 Lavern 70 74 144 4 0.0278 0.9459
par(gr_par)
lab_x <- top1945[, "N_female"]
lab_y <- top1945[, "N_male"]
lab_s <- top1945[, "Name"]

plot(c1945$N_female, c1945$N_male, log = "xy", 
     xlim = c(4, 20000), ylim = c(4, 20000), asp = 1,
     pch = 21, col = "orangered2", bg = rgb(0.8, 0.3, 0, 0.3),
     xlab = "N female",
     ylab = "N male",
     main = "Male/Female Counts of Ambiguous Names (1945)")
points(c1945$N_female[c1945$diff_norm <= abs(rn)], c1945$N_male[c1945$diff_norm <= abs(rn)], 
       pch = 23, col = "darkgreen", bg = rgb(0, 0.9, 0, 0.3))
points(c1945$N_female[c1945$diff == 0], c1945$N_male[c1945$diff == 0], 
       pch = 23, col = "blue2", bg = rgb(0, 0, 0.9, 0.3))
lines(x, y1, lty = 4, lwd = 1.5, col = "forestgreen")
lines(x, y2, lty = 4, lwd = 1.5, col = "forestgreen")
grid()
text(lab_x, lab_y, labels = lab_s, pos = lab_pos, cex = 0.9, adj = c(NA, 0))


Question #4

all1980 <- filter(df, year == 1980) %>% mutate(nameg = paste0(name, "_", gender)) %>% 
                                        group_by(nameg) %>% 
                                        summarize(n1980 = sum(count))

all2013 <- filter(df, year == 2013) %>% mutate(nameg = paste0(name, "_", gender)) %>% 
                                        group_by(nameg) %>% 
                                        summarize(n2013 = sum(count))

#---------
inboth_namesG <- inner_join(all1980, all2013, by = "nameg") %>% mutate(ntot = n1980 + n2013, 
                                                                       change_pct = 100*(n2013 - n1980)/n1980,
                                                                       rev_pct = 100*(n1980 - n2013)/n2013,
                                                                       gender = str_sub(nameg, start = -1),
                                                                       name = str_replace(nameg, "_[FM]", "")) 
inboth_names <- inboth_namesG[, c(8, 7, 2:6)]
rm(inboth_namesG)

#---------
in2013_not1980 <- anti_join(all2013, all1980, by = "nameg") %>% mutate(change_if_1 = 100*(n2013 - 1),
                                                                       gender = str_sub(nameg, start = -1),
                                                                       name = str_replace(nameg, "_[FM]", ""))
in1980_not2013 <- anti_join(all1980, all2013, by = "nameg") %>% mutate(rev_change_if_1 = 100*(n1980 - 1),
                                                                       gender = str_sub(nameg, start = -1),
                                                                       name = str_replace(nameg, "_[FM]", ""))
in2013_not1980 <- in2013_not1980[, c(5, 4, 2, 3)]
in1980_not2013 <- in1980_not2013[, c(5, 4, 2, 3)]
#---------
max_increase <- max(inboth_names$change_pct)
max_rev_decrease <- max(inboth_names$rev_pct)

# how many names to show
N_show <- 5
names_in_both   <-  (all1980$nameg %in% all2013$nameg)
names_only_1980 <- !(all1980$nameg %in% all2013$nameg)
names_only_2013 <- !(all2013$nameg %in% all1980$nameg)

For this question we aggregate all the data nationally, summing all occurrences of each name over all states. However, we want to keep track separately of the female and male counts for bisex names, and for simplicity we append a _F or _M to the name itself for the instances where it is used for female or male.

After aggregation, the 1980 and 2013 data sets comprise 6185 and 10004 names, respectively, of which 3372 are in common.

[A] Of the names represented in the data, find the name that has had the largest percentage increase in popularity since 1980.

Taking the straightforward approach of defining percentage increase as the ratio between the difference 2013 and 1980 counts over the 1980 counts, i.e. \((n_{2013} - n_{1980})/n_{1980}\), these are the names whose adoption increased the most between those two reference years:

tab4a <- arrange(inboth_names, desc(change_pct)) %>% dplyr::select( . , -ntot, -rev_pct) 
kable(head(tab4a, N_show), row.names = TRUE, digits = 0)
name gender n1980 n2013 change_pct
1 Avery F 5 9121 182320
2 Colton M 5 6439 128680
3 Aria F 5 5085 101600
4 Isabella F 23 17490 75943
5 Mila F 5 3661 73120

frac_up_F_20 <- sum(head(tab4a, 20)$gender == "F")/20
frac_up_F_50 <- sum(head(tab4a, 50)$gender == "F")/50

The majority of them barely made into the 1980 dataset because they total count is 5, meaning that they entered this data set only in one state for that year (there are 238 names with a count of only 5 in 1980 for the 1980-2013 common data set, 206 in 2013).

[B] Largest decrease?

For the largest decrease we do the same calculation and obtain the 5 worst performers shown in the following table. To allow a more direct comparison with the changes reported for the increases, we added to this table a column with the percentage change computed with 2013 as baseline (“rev_pct”).

tab4b <- arrange(inboth_names, change_pct) %>% dplyr::select( . , -ntot)
digits_vec <- rep(0, ncol(tab4b)) 
digits_vec[which(colnames(tab4b)=="change_pct")] <- 2
kable(head(tab4b, N_show), row.names = TRUE, digits = digits_vec)
name gender n1980 n2013 change_pct rev_pct
1 Misty F 5534 12 -99.78 46017
2 Jill F 4553 13 -99.71 34923
3 Rhonda F 1515 5 -99.67 30200
4 Shawna F 1420 5 -99.65 28300
5 Tracey F 1237 5 -99.60 24640

frac_down_F_20 <- sum(head(tab4b, 20)$gender == "F")/20
frac_down_F_50 <- sum(head(tab4b, 50)$gender == "F")/50

Again, not surprisingly, several of the most forgotten names show up right at the minimum possible number of occurrences to enter the data set.

NOTE on gender and dramatic changes: it is interesting, though at this level little more than anedoctal, that the majority of the names that exhibit the most dramatic changes are female names: they comprise 60% of the 20 largest (measurable) increases and 95% of the 20 largest (measurable) decrease (60% and 84% of the top 50 increase/decreases).


Question #5

Can you identify names that may have had an even larger increase or decrease in popularity?

There may be names that enjoyed (suffered) an even larger increase (decrease) in popularity in the 2-point comparison between 1980 and 2013 and not appear in the preceding analyses because they were below threshold for inclusion in the data set in either 1980 or 2013.

  • Names present in 2013 but not in 1980 could have in principle jumped up more than the 182320% of Avery. The largest increase one of these names could have experienced is if they started with a count of 1 in 1980.
    Of course, in principle some of the names recorded only in 2013 could truly not have “existed” in 1980, thus earning an “infinite” increase. However, we can quite handle properly the possibility that a 2013 name was not actually given to anybody in 1980 because we are blind to what “happened” below the 5-count threshold. Therefore we restrict our estimate to the scenario whereby we assign to every name a count equal to 1 for the epochs for which it is not present.

  • Viceversa, names present in 1980 but not in 2013, could have dropped more than 46016.7% of Misty and gone under threshold.

Assigning, as noted, a count = 1 for 1980 to names existing in 2013 and not in 1980, we compute their hypothetical percentage increase from 1 to their actual 2013 count.

tab5a <- filter(in2013_not1980, change_if_1 > max_increase) %>% arrange(desc(change_if_1))

kable(head(tab5a, N_show), row.names = TRUE, digits = 2)
name gender n2013 change_if_1
1 Jayden M 14656 1465500
2 Aiden M 13527 1352600
3 Madison F 10529 1052800
4 Harper F 8222 822100
5 Addison F 7677 767600
tab5b <- filter(in1980_not2013, rev_change_if_1 > max_rev_decrease) %>% arrange(desc(rev_change_if_1))

kable(head(tab5b, N_show), row.names = TRUE, digits = 0)
name gender n1980 rev_change_if_1
1 Tonya F 3073 307200
2 Beth F 2843 284200
3 Kristi F 2521 252000
4 Latoya F 2479 247800
5 Kelli F 2477 247600

NOTE: if we wanted to split hairs and be excruciatingly obsessive in estimating the largest potential increase, we could add to the total count for a name a “free \(+4\) contribution” for any state for which it does not appear in the data. That would be the maximum extra number of occurrences not accounted for. Viceversa for pushing the estimate for largest decrease we could add \(4\) free counts for each state where a name does not appear in 1980.
With a little R magic it would not be difficult but it is beyond the scope of this report. Moreover, considering that in the most extreme scenario of a name appearing for only one state, this would add \(200\) counts, it looks like it would not change substantially the rankings presented above.


OTHER BITS OF ANALYSIS

This dataset is extremely interesting and could the source of endless stimulating exploration, but I will have to leave that for another time. In the remainder of this report I will present some basic plots I have made as I was playing with the data.

First a few terse thoughts about ideas that crossed my mind while working with these data, not much more than bullet points at this stage, and not a particularly deeply thought-through order:


COUNT DISTRIBUTIONS BY STATE, TIME, AND GENDER

I already shared some thoughts about the count distributions, and here I would like to focus on the results of fitting them with power laws, allowing for the effect of gender.

Examples

First a few more examples of the count distributions. They are all for the period 2009-2013, for eight states in different regions and of different size. In all of them the female name counts distributions are steeper than the male’s. Because broadly speaking their total integral should be approximately similar, their different slopes mean that they will cross-over, and so there are many more names with high-count among male names, and many more names with low-count among female names.

We noted above that the steeper slope of the female names distribution combined with the low-count threshold may have something to do with the offset between total number of female and male births counted from these data.

In a few of these plots there is a hint of a possible “knee” in the distribution of male names, between the power law part (left / lower counts side).

par(fig = c(0, 1, 0, 1), mar = c(5, 4, 4, 1)+0.1, cex.axis = 1.0)
single_panel_mar <- c(2, 2, 2, 1)
single_panel_oma <- c(0 ,0, 2, 0)
gr_par <- list( mar = single_panel_mar, oma = single_panel_oma, 
                cex = 1.0, cex.axis = 0.8, cex.lab = 1.0, 
                las = 0, mgp = c(1.0, 0.0, 0),
                tcl = 0.3)

par(gr_par)

mat.layout <- matrix(1:8, nrow = 4, byrow = TRUE)
layout(mat.layout)

plot_hist_state_year_with_fit(data = df, STATE = "AZ", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "CA", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "CO", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "DE", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "OK", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "SC", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "TN", YEAR = 2009:2013, subset = quote(y > log10(25)))
plot_hist_state_year_with_fit(data = df, STATE = "TX", YEAR = 2009:2013, subset = quote(y > log10(25)))

par(fig = c(0, 1, 0, 1), mar = c(5, 4, 4, 1)+0.1, cex.axis = 1.0)
title("Distributions", cex.main = 1.25, outer = TRUE)

Summary Plots of Power Law Fits

We split the data by state, decade and gender, binned them up and fit a linear regression model (lm()) to each name count distribution. Please note that the “decades” are actually partially overlapping, in the sense that we used a 10-year wide sliding window, moving it in steps of 5 years. So, consecutive data/models are not fully independent as they overlap for half of the time intervals.

The next three plots show the values of the slope of the female names distributions, the slope of the male names distributions and they difference. Points for each state are connected over time.
The thick long-dashed grey line connects the mean values at each time value.

The main observations from these figures are that:

  • The power law slope for the distributions of female names has been steadily decreasing since the 1960’s, shifting from about \(-0.75\) to \(-1.1\).
  • The slope of the male names has remained pretty stable at around \(-0.75\)
  • Looking at the means, we can appreciate that the slope for female names has been steeper than that of male names since the 1960’s.
  • This is illustrate more directly in the third figure, plotting the difference of the two slopes. It has been increasing for almost a century.
ggplot(powerlaw_fits, aes(x = Time, y = slopeF, color = State)) + theme_bw() + 
        theme(legend.position = "top", legend.title = element_blank()) +
        guides(col = guide_legend(nrow = 4, byrow = TRUE)) + 
        ylim(-2.2, -0.3) +
        geom_line(aes(group = State)) + 
        geom_point() +
        stat_summary(fun.y = "mean", geom = "line", aes(group=1), size = 2.0, lty = 5) +
        labs(title = "Slope of Power Law Fit to Female Names Count Distributions")

ggplot(powerlaw_fits, aes(x = Time, y = slopeM, color = factor(State))) + theme_bw() + 
        theme(legend.position = "top", legend.title = element_blank()) +
        guides(col = guide_legend(nrow = 4, byrow = TRUE)) + 
        ylim(-2.2, -0.3) +
        geom_line(aes(group = State)) + 
        geom_point() +
        stat_summary(fun.y = "mean", geom = "line", aes(group=1), size = 2.0, lty = 5) +
        labs(title = "Slope of Power Law Fit to Male Names Count Distributions")

ggplot(powerlaw_fits, aes(x = Time, y = slopeM-slopeF, color = factor(State))) + theme_bw() + 
        theme(legend.position = "top", legend.title = element_blank()) +
        guides(col = guide_legend(nrow = 4, byrow = TRUE)) + 
        ylim(-0.3, 0.9) +
        geom_line(aes(group = State)) + 
        geom_point() + 
        stat_summary(fun.y = "mean", geom = "line", aes(group=1), size = 2.0, lty = 5) +
        geom_abline(slope = 0, lty = 2, col = "grey40", size = 1) +
        labs(title = "Difference b/w Male and Female Power Law Slopes")


APPENDIX

Back to the Top

User Defined Functions

Source on GitHub or expand the following block of code:

#===================================================================================================
# Nothing Yey

intersect_MF_names_1year <- function(data = NULL, 
                                     select_year = NULL) {

    tmp_f <- filter(data, year == select_year & gender == "F") %>% group_by(name) %>% summarize(n = sum(count)) 
    tmp_m <- filter(data, year == select_year & gender == "M") %>% group_by(name) %>% summarize(n = sum(count)) 

    tmp_c <- inner_join(tmp_f, tmp_m, by = "name") %>% mutate(ntot = n.x + n.y, 
                                                              diff = abs(n.x - n.y), 
                                                              diff_norm = round(abs(n.x - n.y)/ntot, 4), 
                                                              ratio = round(pmin(n.x/n.y, n.y/n.x), 4))
    colnames(tmp_c)[1:4] <- c("Name", "N_female", "N_male", "N_tot")
    return(list(female = tmp_f, male = tmp_m, common = tmp_c))
}

#===-----------------------------------------------------------
# 
get_coverage <- function(cnt = NULL, 
                         cutoff = NULL) {

    tot <- sum(cnt)
    cov <- cumsum(cnt)/tot*100

    Ncov <- sum(cov <= cutoff)
    Fcov <- Ncov/length(cnt)
    return( list(Ncov = Ncov, Fcov = Fcov) )

}

#===-----------------------------------------------------------
# from: https://github.com/cran/descr/blob/master/R/file.head.R
# 
file.head <- function(file, 
                      n = 6, 
                      truncate.cols = TRUE) {

    lns <- readLines(file, n = n)
    lns <- gsub("\t", "\\\\t", lns)
    if(truncate.cols){
        try(file.head.lns <- substr(lns, 1, getOption("width") - 1), silent = TRUE)
        if(!exists("file.head.lns")){
            Encoding(lns) <- "bytes"
            file.head.lns <- substr(lns, 1, getOption("width") - 1)
        }
    }
    cat(file.head.lns, sep = "\n")
}

#===-----------------------------------------------------------
#  - subset must be passed enclosed in 'quote()'
#
plot_hist_state_year_with_fit <- function(data = NULL, 
                                          STATE = NULL, 
                                          YEAR = NULL, 
                                          subset = NULL,
                                          return_fit = FALSE, 
                                          debug = FALSE) {

    # require("dplyr")   
    if( is.null(STATE) & !is.null(YEAR)) { data_for_histo <- subset(data, year %in% YEAR) }
    if(!is.null(STATE) &  is.null(YEAR)) { data_for_histo <- subset(data, state %in% STATE) }
    if(!is.null(STATE) & !is.null(YEAR)) { data_for_histo <- subset(data, state %in% STATE & year %in% YEAR) }
    
    # log_breaks <- seq(log10(5), log10(2*max(data_for_histo$count)), by = 0.25*log10(5))
    log_breaks <- seq(0, 4.1, by = 0.2*log10(5)) + log10(5)
    
    #---
    fit_df <- data_for_histo %>% mutate(log_c = log10(count)) %>% dplyr::select( . , gender, log_c)
    if( debug ) { print(nrow(fit_df)) }
    hh_F <- hist(subset(fit_df$log_c, fit_df$gender == "F"), breaks = log_breaks, plot = FALSE)
    hh_M <- hist(subset(fit_df$log_c, fit_df$gender == "M"), breaks = log_breaks, plot = FALSE)
    
    hh_F_df <- data.frame(x = hh_F$mids, y = log10(hh_F$counts), g = "F", stringsAsFactors = FALSE) 
    hh_M_df <- data.frame(x = hh_M$mids, y = log10(hh_M$counts), g = "M", stringsAsFactors = FALSE) 
    hh_df <- bind_rows(hh_F_df, hh_M_df) %>% filter( . , y != -Inf)
    hh_df$xy <- hh_df$y + hh_df$x
    
    #---
    # needed for understanding the passed subset expression
    x <- hh_df$x
    y <- hh_df$y
    #
    if( is.null(subset) ) {
        flag_subset <- rep(TRUE, nrow(hh_df)) 
    } else {
        flag_subset <- eval(subset)
    }
    #---

    mod <- lm(y ~ I(x-1)*g, data = hh_df, subset = flag_subset)
    cx <- summary(mod)$coeff
    ab <- cx[, 1]
    
    cx_at_1 <- c(ab[1], ab[2], ab[1]+ab[3], ab[2]+ab[4])
    names(cx_at_1) <- c("nF", "slopeF", "nM", "slopeM")
    
    ab_F <- c( ab[1] - ab[2], ab[2] )
    ab_M <- c( (ab[1]+ab[3]) - (ab[2]+ab[4]), ab[2]+ab[4])

    if( debug ) { print(cx[, -3]) }
    if( debug ) { print(cx_at_1) }
    if( debug ) { print(rbind(ab_F, ab_M)) }
    
    #---
    if( !is.null(STATE) ) {
        label_STATE <- deparse(substitute(STATE)) %>% gsub('"', '', .)
    } else {
        label_STATE <- ""
    }
    if( !is.null(YEAR) ) {
        label_YEAR  <- deparse(substitute(YEAR)) %>% gsub('"', '', .)
    } else {
        label_YEAR  <- ""
    }

    label_main <- paste0(label_STATE, " / ", label_YEAR)
    #---
    
    # single_panel_mar <- c(3, 4, 2, 1)
    # single_panel_oma <- c(0 ,0, 2, 0)
    # gr_par <- list( mar = single_panel_mar, oma = single_panel_oma, 
    #                 cex = 1.0, cex.axis = 1.0, cex.lab = 1.0, cex.main = 1.25,
    #                 las = 0, mgp = c(1.5, 0.5, 0),
    #                 tcl = 0.3)
    # par(gr_par)

    colors_FM <- c("pink2", "dodgerblue2")
    plot(hh_df$x, hh_df$y, 
         xlim = c(0, 4.1), 
         ylim = c(0, 4.5), 
         type = "n", 
         xlab = "log10(Counts)", 
         ylab = "log10(Frequency)",
         main = label_main)
    
    points(hh_df$x[hh_df$g == "F"], hh_df$y[hh_df$g == "F"], col = colors_FM[1], pch = 19, cex = 1.5)
    points(hh_df$x[hh_df$g == "M"], hh_df$y[hh_df$g == "M"], col = colors_FM[2], pch = 19, cex = 1.5)

    abline(ab_F, col = "red2", lty = 2)
    abline(ab_M, col = "blue2", lty = 2)
    grid()

    points(hh_df$x[!flag_subset], hh_df$y[!flag_subset], pch = 15, col = rgb(0,0,0,0.2), cex = 2.2)

    if( return_fit ) { return(cx_at_1) }

}

#===-----------------------------------------------------------
#
prepare_powerlaw_fits_data_frame <- function(data = NULL) {
    
    vec_states <- unique(data$state)
    # vec_states <- c("MA", "WY", "NM", "FL")
    
    vec_years <- unique(data$year) %>% sort( . , decreasing = FALSE)
    last_year <- 2013
    
    y_start_step <- 5
    start_years <- seq(1910, 2014, by = y_start_step)
    y_window <- 10
    
    store <- list()
    
    i <- 0
    for( S in vec_states ) {
        state_df <- filter(data, state == S)
        for( y1 in start_years ) {
            i <- i + 1
            # y2 <- min(y1 + y_start_step - 1, last_year)
            y2 <- min(y1 + y_window - 1, last_year)
            cat(" ", S, " : ", y1, " - ", y2, "\n")
            # name <- paste0(S, "_", y1, "_", y2)
            name <- paste0(S, y1)
            
            fname <- paste0("distr2_", S, "_", y1, "_", y2, ".png")
            png(file = fname, width = 900, height = 900, res = 96, bg = "white")
            cx <- plot_hist_state_year_with_fit(data = state_df, YEAR = y1:y2, subset = quote(y > log10(25)), return_fit = TRUE)
            # store[[i]] <- cx
            store[[name]] <- c(name, cx)
            
            dev.off()
        }
    }
    
    powerlaw_fits <- as.data.frame(t(as.data.frame(store, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
    colnames(powerlaw_fits)[1] <- "StateYear"
    
    powerlaw_fits[, 2:5] <- apply(powerlaw_fits[,2:5], 2, as.numeric)
    row.names(powerlaw_fits) <- NULL
    powerlaw_fits <- separate(powerlaw_fits, StateYear, c("State", "Time"), sep = 2)
    powerlaw_fits$Time <- as.integer(powerlaw_fits$Time) + 0.5*y_window
    
    return(powerlaw_fits)
    
}

#===================================================================================================

R Session Info

sessionInfo()
# R version 3.1.3 (2015-03-09)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.2 LTS
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
#  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
# [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] ggplot2_1.0.1 stringr_1.0.0 magrittr_1.5  tidyr_0.2.0   dplyr_0.4.2   knitr_1.10.5 
# 
# loaded via a namespace (and not attached):
#  [1] DBI_0.3.1        MASS_7.3-41      R6_2.0.1         Rcpp_0.11.6      assertthat_0.1  
#  [6] colorspace_1.2-6 digest_0.6.8     evaluate_0.7     formatR_1.2      grid_3.1.3      
# [11] gtable_0.1.2     highr_0.5        htmltools_0.2.6  labeling_0.3     lazyeval_0.1.10 
# [16] munsell_0.4.2    parallel_3.1.3   plyr_1.8.3       proto_0.3-10     reshape2_1.4.1  
# [21] rmarkdown_0.7    scales_0.2.4     stringi_0.5-5    tools_3.1.3      yaml_2.1.13