Preamble

Report for the first assignment of the Reproducible Research course of the Coursera/JHSPH Data Science Specialization.

The source files are posted on GitHub.

Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database records major weather-related events in the United States, tracking their times and locations as well as estimates of fatalities, injuries, property and agricultural damages that may be associated with them.

We seek to identify which type of storm events have the greatest impact on the population both from the point of view of their health and of the economic consequences. To address these questions, our analysis aggregates the data by storm event type to identify the storm events categories responsible for the largest impact as measured by fatalities, injuries, property damage, crop damage.

Findings

Looking separately at each harm/damage category separately based on our analysis we can say that the most severe impact comes from the following types of events:

  • Agricultural Damage: hurricanes and drought, both in terms of total economic cost and fraction of events that cause significant damage.
    It is worth noting that in addition to events directly classified as hurricane, among the most impactful event types there are others that are/can be related to hurricanes themselves or other kind of tropical storm activity (e.g. tropical storms, storm surge).

  • Property Damage: in absolute terms hurricanes and floods dominate the economic impact (and here again one may want to look at a broader “tropical event” category). When we consider the likelihood that a storm causes “recordable” damages, wind and snow emerge at the top. While their cumulative cost may not be as substantial as that of major events like hurricanes, the damage caused by wind and snow can have a serious impact on the life and economic activity of a wider range of families and communities across the country and it warrants attention and adequate preparation.

  • Human Fatalities/Injuries: tornadoes, floods (including flash floods) and excessive heat emerge as the most serious threats to human health in absolute terms. It is interesting however, to note that in terms of fraction of events with fatalities/injuries, rip currents and avalanches take the top spots (with “likelihood” >= 50% of fatalities and about 25% of injuries on a given event).
    This may suggest that people expose themselves to avoidable risks, given that one can think of associating rip currents and avalanches with leisure activities.

Future Work

The scope and time for this investigation were limited, but there are clear directions worth further investigation, such as

  • Geographic analysis, clearly necessary given the fairy localized impact of different type of events.
  • Variation over time, perhaps reflecting changes in behavior, preparation and policies.

THE DATA

The data structure and definition documentation is at the following URLs of the NOAA National Climatic Data Center (NCDC):

The raw data comprise 37 variables with 902,297 observation from years 1950-2011.

The Data and the Aim of Our Analysis

The objective of this research is to address two main questions:

  • which events have the greatest health consequences measured by fatalities and injuries.
  • Which events have the greatest economic consequences measured by property damage, crop damage, total economic damage.

The main variables in the NOAA/NCDC Storm Event databases that we will consider are:

  • For storm event date and type:

    • BGN_DATE: begin date of event.
    • EVTYPE: event type (see discussion later).
  • For human and economic damages associated with each storm event:

    • FATALITIES: number of fatalities.
    • INJURIES: number of injuries.
    • PROPDMG: property damage assessment.
    • CROPDMG: agricultural damage assessment.

The values reported for these two latter have to be converted in actual US dollar amounts by multiplying them by a scaling factor coded in two other variables (PROPDMGEXP, CROPDMGEXP) as the power of 10 of the multiplier. These two auxiliary variables seem to suffer from some data-entry issues, which we will address below.

Loading the Dataset

The data set comes as a compressed csv file, which can be read directly with read.csv(). After reviewing the data we set our preferred values for the classes of the variables loaded in the data-frame and we set them on loading via the colClasses option, using a character vector read from an auxiliary file.

NOTE: given the substantial size of the data set we set the options cache = TRUE to reduce the knitr processing time to a manageable length.

col.cl <- read.csv("data/readin_classes.csv", stringsAsFactors=FALSE, strip.white=TRUE)
col.cl$newclass2
#  [1] "numeric"   "character" "character" "factor"    "numeric"   "character" "factor"    "character"
#  [9] "numeric"   "character" "character" "character" "character" "numeric"   "character" "numeric"  
# [17] "character" "character" "numeric"   "numeric"   "factor"    "numeric"   "numeric"   "numeric"  
# [25] "numeric"   "character" "numeric"   "character" "character" "character" "character" "numeric"  
# [33] "numeric"   "numeric"   "numeric"   "character" "numeric"
data <- read.csv("data/StormData.csv.bz2", colClasses= col.cl$newclass2, nrows=1300000, strip.white=TRUE) 

This is the structure of the data-frame as loaded.

str(data)
# 'data.frame': 902297 obs. of  37 variables:
#  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
#  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
#  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
#  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 6 6 6 6 6 6 6 6 6 6 ...
#  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
#  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
#  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
#  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
#  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ BGN_AZI   : chr  "" "" "" "" ...
#  $ BGN_LOCATI: chr  "" "" "" "" ...
#  $ END_DATE  : chr  "" "" "" "" ...
#  $ END_TIME  : chr  "" "" "" "" ...
#  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ COUNTYENDN: chr  "" "" "" "" ...
#  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ END_AZI   : chr  "" "" "" "" ...
#  $ END_LOCATI: chr  "" "" "" "" ...
#  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
#  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
#  $ F         : Factor w/ 7 levels "","0","1","2",..: 5 4 4 4 4 4 4 3 5 5 ...
#  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
#  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
#  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
#  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
#  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ CROPDMGEXP: chr  "" "" "" "" ...
#  $ WFO       : chr  "" "" "" "" ...
#  $ STATEOFFIC: chr  "" "" "" "" ...
#  $ ZONENAMES : chr  "" "" "" "" ...
#  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
#  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
#  $ LATITUDE_E: num  3051 0 0 0 0 ...
#  $ LONGITUDE_: num  8806 0 0 0 0 ...
#  $ REMARKS   : chr  "" "" "" "" ...
#  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

DATA PROCESSING

Due to the nature of the development of this dataset several variables suffer from inconsistent formatting and in order to make our analysis as robust as possible we cleaned and reformatted some of the potentially useful variables, as well as created ones for further convenience.

Cleaning

REMARKS and EVTYPE: tidying up spaces and letters’ case

data$REMARKS <- gsub("^ *", "", data$REMARKS, perl=TRUE)
data$REMARKS <- gsub(" *$", "", data$REMARKS, perl=TRUE)
data$REMARKS <- gsub("[ ]{2,}", " ", data$REMARKS, perl=TRUE)

data$EVTYPE <- gsub("^ *", "", data$EVTYPE, perl=TRUE)
data$EVTYPE <- gsub(" *$", "", data$EVTYPE, perl=TRUE)
data$EVTYPE <- gsub("[ ]{2,}", " ", data$EVTYPE, perl=TRUE)

data$COUNTYNAME <- toupper(data$COUNTYNAME)
data$EVTYPE     <- toupper(data$EVTYPE)
data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
data$CROPDMGEXP <- toupper(data$CROPDMGEXP)

Setting “missing” coordinates to NA

data$LATITUDE[data$LATITUDE <= 0]  <- NA
data$LONGITUDE[data$LONGITUDE <= 0]  <- NA

Fixing times

First a small fix motivated by the discovery of O instead of a 0 in the BGN_TIME variable:

data$BGN_TIME <- gsub("([0-9])O(.*)([^M])$","\\10\\2\\3", data$BGN_TIME, perl=TRUE)

It is better to work with properly constructed time variables, hence we convert BGN_DATE and END_DATE into POSIXlt class variables.

# making BGN and END dates dates
data$BGN_DATE.new <- strptime(as.character(data$BGN_DATE), "%m/%d/%Y %H:%M:%S")
data$END_DATE.new <- strptime(as.character(data$END_DATE), "%m/%d/%Y %H:%M:%S")

Finally, we define a new variable YEAR.

# create a 4-digit year variable
data$YEAR <- substr(as.character(data$BGN_DATE.new),0,4)

Subsetting full data-frame to good data-frame

For ease of processing it is practical to work with a reduced set of variables.
Our choice was the following:

sel.columns.n <- c( 37, 7, 5, 40, 38, 39, 8, 23, 24, 25, 26, 27, 28, 36)
colnames(data)[sel.columns.n]
#  [1] "REFNUM"       "STATE"        "COUNTY"       "YEAR"         "BGN_DATE.new" "END_DATE.new"
#  [7] "EVTYPE"       "FATALITIES"   "INJURIES"     "PROPDMG"      "PROPDMGEXP"   "CROPDMG"     
# [13] "CROPDMGEXP"   "REMARKS"

After creating the new data-frame, we rename the newly created time variables to take the original names, and sort the data-frame chronologically by BGN_DATE.

good <- data[, sel.columns.n]

colnames(good)[5] <- "BGN_DATE"
colnames(good)[6] <- "END_DATE"

# sorting data frame chronologically 
good <- good[order(good$BGN_DATE), ]

The EVTYPE problem!

We perform some more cleaning steps on this leaner data-frame to regularize the EVTYPE variable.
Because of historical reasons, namely the heterogenous sources of the data compiled into the Storm Event database, the EVTYPE variable requires a significant amount of work to tidy it up into a more usable form.

While in theory since 1996 NOAA has codified a set of 48 type of events See NWS Directive to be used in the classification of storm events, a quick review of the different EVTYPE values for each year shows that their number remained higher until much more recently.

EVTYPE.by.year <- tapply(good$EVTYPE, good$YEAR, function(x) table(x), simplify=TRUE)
sapply(EVTYPE.by.year, function(x) length(names(x)))
# 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 
#    1    1    1    1    1    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
# 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 
#    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
#    3    3    3  160  266  384  199  135  122  118  110  120   99   51   38   46   50   46   46   46 
# 2010 2011 
#   46   46

Until 1993 “Tornado”, “Thunderstorm Wind” and “Hail” type of events were recorded. The 1993 and 1995 data have been extracted from unformatted text files, and this helps explain the high level of noise of EVTYPE during those years. It is less clear why after the Directive of 1996 the data have not become

A quick census of the most frequent EVTYPE can be extracted with table()

evt.census <- as.data.frame(table(data$EVTYPE))
head(evt.census[order(evt.census$Freq, decreasing=TRUE),],10)
#                   Var1   Freq
# 203               HAIL 288661
# 766          TSTM WIND 219946
# 672  THUNDERSTORM WIND  82564
# 745            TORNADO  60652
# 129        FLASH FLOOD  54278
# 145              FLOOD  25327
# 698 THUNDERSTORM WINDS  20850
# 309          HIGH WIND  20214
# 407          LIGHTNING  15755
# 265         HEAVY SNOW  15708

nrow(evt.census)
# [1] 883

The full database comprises 883 unique EVTYPE values!
It is immediately obvious the problem with inconsistent naming, as in the top-10 there are three entries for the same storm event type, Thunderstorm Wind: TSTM WIND, THUNDERSTORM WIND, THUNDERSTORM WINDS .

EVTYPE Regularization

After an extensive analysis of the EVTYPE, we put together a lengthy set of substitutions via gsub() and grepl() to consolidate as many related types as possible. We run this by calling an external script to avoid clogging the document, but we include the source of the script in the Appendix.
The script creates a new character vector added to the good data-frame as EVTYPE.

  • show percentage “recovered”, number of EVTYPE per year, etc…
# Calling 'grepl.R' script to clean the EVTYPE entries
source("./scripts/evtype_regularization.R")
# # Entering 'evtype_regularization' script
# # Leaving 'evtype_regularization' script

# colnames(good)[7] <- "old.EVTYPE"
good$EVTYPE <- TESTvec

After regularization the number of unique EVTYPE values is now 373.

The PROPDMGEXP and CROPDMGEXP problem

The variables PROPDMGEXP and CROPDMGEXP are supposed to encode exponents for power of 10 units for multiplying factors for their respective economic damage variables.
They too suffer from inconsistent data entry in the transition years around 1995 as it can be seen by this quick analysis of the number of unique values for each year, accompanied by the list of values for the years with the highest number of them:

PROPDMGEXP.by.year <- tapply(good$PROPDMGEXP, good$YEAR, function(x) table(x), simplify=TRUE)
sapply(PROPDMGEXP.by.year, function(x) length(names(x)))
# 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 
#    2    2    2    2    2    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
# 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 
#    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
#    3    3    3    7    9   17    3    4    4    4    4    4    3    4    4    4    4    2    3    2 
# 2010 2011 
#    3    4

PROPDMGEXP.by.year[which.max(sapply(PROPDMGEXP.by.year, function(x) length(names(x))))]
# $`1995`
# x
#           +     -     0     1     2     3     4     5     6     7     8     ?     B     H     K 
# 18331     4     1   186    25    13     4     3    26     3     5     1     5     5     6  8768 
#     M 
#   584

CROPDMGEXP.by.year <- tapply(good$CROPDMGEXP, good$YEAR, function(x) table(x), simplify=TRUE)
sapply(CROPDMGEXP.by.year, function(x) length(names(x)))
# 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 
#    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
# 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 
#    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
#    1    1    1    6    5    7    3    3    3    3    3    3    3    3    3    4    4    2    2    2 
# 2010 2011 
#    2    3

CROPDMGEXP.by.year[which.max(sapply(CROPDMGEXP.by.year, function(x) length(names(x))))]
# $`1995`
# x
#           0     2     ?     B     K     M 
# 27019     8     1     5     3   800   134

We corrected their values according to the following interpretation:

  • The exponents are considered for the base of 10 as the base.
  • Missing values (NA) or empty strings are considered as \(10^0=1\).
  • H, K, M, B are interpreted as: hecto (=2), kilo (=3), mega (=6), billion (=9).
  • Numeric exponents are interpreted as its corresponding power of 10 as the base.
  • The rest of effectively undefined exponents are also interpreted to be spurious and set to 0.

After these conversions the PROPDMGEXP and CROPDMGEXP variable are changed to numeric class.

good$PROPDMGEXP[good$PROPDMGEXP == "?" | good$PROPDMGEXP == "+" | 
                good$PROPDMGEXP == "-" | good$PROPDMGEXP == ""] <- 0
good$PROPDMGEXP[good$PROPDMGEXP == "B"] <- 9
good$PROPDMGEXP[good$PROPDMGEXP == "M"] <- 6
good$PROPDMGEXP[good$PROPDMGEXP == "K"] <- 3
good$PROPDMGEXP[good$PROPDMGEXP == "H"] <- 2

good$CROPDMGEXP[good$CROPDMGEXP == "?" | good$CROPDMGEXP == "+" | 
                good$CROPDMGEXP == "-" | good$CROPDMGEXP == ""] <- 0
good$CROPDMGEXP[good$CROPDMGEXP == "B"] <- 9
good$CROPDMGEXP[good$CROPDMGEXP == "M"] <- 6
good$CROPDMGEXP[good$CROPDMGEXP == "K"] <- 3
good$CROPDMGEXP[good$CROPDMGEXP == "H"] <- 2

good$PROPDMGEXP <- as.numeric(good$PROPDMGEXP)
good$CROPDMGEXP <- as.numeric(good$CROPDMGEXP)

The good data frame structure at this point is the following:

str(good)
# 'data.frame': 902297 obs. of  14 variables:
#  $ REFNUM    : num  28479 28480 83316 114844 6348 ...
#  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 20 20 37 48 5 20 37 63 63 63 ...
#  $ COUNTY    : num  119 135 189 161 113 91 93 47 39 201 ...
#  $ YEAR      : chr  "1950" "1950" "1950" "1950" ...
#  $ BGN_DATE  : POSIXlt, format: "1950-01-03" "1950-01-03" "1950-01-03" ...
#  $ END_DATE  : POSIXlt, format: NA NA NA ...
#  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
#  $ FATALITIES: num  0 0 0 0 1 0 0 0 0 1 ...
#  $ INJURIES  : num  0 3 3 1 1 0 5 2 0 12 ...
#  $ PROPDMG   : num  250 250 2.5 25 2.5 250 250 0 25 25 ...
#  $ PROPDMGEXP: num  3 3 6 3 3 3 3 3 3 3 ...
#  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ CROPDMGEXP: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ REMARKS   : chr  "" "" "" "" ...

The choice of limiting the analysis to recent years (>=1996)

After reviewing and cleaning the data, we choose to limit our analysis to the data since 1996-01-01.

There are two main motivations for this choice:

  • As noted above, before 1996 the NCDC database only tracked three event types “Tornado”“, “Thunderstorm Wind”” and “Hail”. While it would certainly be interesting to perform an analysis of the longer time baseline for these three categories, here we are mostly interested in assessing the impact of storm events from a broader, more inclusive perspective.
  • Imposing this cut yields a significantly cleaner data set, for instance with respect to the DMGEXP and EVTYPE variables (but also the various date/time variables.)
# subsetting 'good' to 'recent' taking only events since 1996
recent <- subset(good, BGN_DATE > as.POSIXlt("1996-01-01"))
row.names(recent) <- NULL
str(recent)
# 'data.frame': 653507 obs. of  14 variables:
#  $ REFNUM    : num  249653 251131 251416 252425 252426 ...
#  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 7 8 9 10 10 13 13 13 13 13 ...
#  $ COUNTY    : num  12 11 4 1 2 105 19 19 19 109 ...
#  $ YEAR      : chr  "1996" "1996" "1996" "1996" ...
#  $ BGN_DATE  : POSIXlt, format: "1996-01-02" "1996-01-02" "1996-01-02" ...
#  $ END_DATE  : POSIXlt, format: "1996-01-02" "1996-01-02" "1996-01-05" ...
#  $ EVTYPE    : chr  "HIGH WIND" "HIGH WIND" "HEAVY SNOW" "HEAVY SNOW" ...
#  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ INJURIES  : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ PROPDMG   : num  0 0 0 0 0 50 1.5 2 5 1.5 ...
#  $ PROPDMGEXP: num  0 0 0 0 0 3 3 3 3 3 ...
#  $ CROPDMG   : num  0 1 0 0 0 0 0 0 0 0 ...
#  $ CROPDMGEXP: num  0 6 0 0 0 0 3 3 3 3 ...
#  $ REMARKS   : chr  "Wind Damage in Flagstaff, AZ. Non-convective winds knocked down trees, large branches, and power lines. No estimated or measure"| __truncated__ "Strong Santa Ana winds caused widespread tree and crop damage and downed power lines. Premature fruit was knocked off trees, ca"| __truncated__ "Snowfall totals included: Vail 46 inches, Steamboat Springs 45 inches, Beaver Creek 38 inches, Arrowhead ski resort 30 inches, "| __truncated__ "A major winter storm developed over the Gulf coast states on January 2nd and tracked northeast along the eastern seaboard durin"| __truncated__ ...

Computing actual values for property and crop damages

Final step of the data processing is the computation of actual straight US Dollar values for damages combining the DMG and DMGEXP variables.

NOTE: before we compute them, we make an ad hoc fix to the value of PROPDMGEXP for one specific entry, a flood in Napa in 2005. More details about this can be found in this section of the Appendix.

recent$PROPDMGEXP[recent$REFNUM == 605943] <- 6
recent$PropDamage <- recent$PROPDMG * 10^recent$PROPDMGEXP
recent$CropDamage <- recent$CROPDMG * 10^recent$CROPDMGEXP
str(recent)
# 'data.frame': 653507 obs. of  16 variables:
#  $ REFNUM    : num  249653 251131 251416 252425 252426 ...
#  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 7 8 9 10 10 13 13 13 13 13 ...
#  $ COUNTY    : num  12 11 4 1 2 105 19 19 19 109 ...
#  $ YEAR      : chr  "1996" "1996" "1996" "1996" ...
#  $ BGN_DATE  : POSIXlt, format: "1996-01-02" "1996-01-02" "1996-01-02" ...
#  $ END_DATE  : POSIXlt, format: "1996-01-02" "1996-01-02" "1996-01-05" ...
#  $ EVTYPE    : chr  "HIGH WIND" "HIGH WIND" "HEAVY SNOW" "HEAVY SNOW" ...
#  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ INJURIES  : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ PROPDMG   : num  0 0 0 0 0 50 1.5 2 5 1.5 ...
#  $ PROPDMGEXP: num  0 0 0 0 0 3 3 3 3 3 ...
#  $ CROPDMG   : num  0 1 0 0 0 0 0 0 0 0 ...
#  $ CROPDMGEXP: num  0 6 0 0 0 0 3 3 3 3 ...
#  $ REMARKS   : chr  "Wind Damage in Flagstaff, AZ. Non-convective winds knocked down trees, large branches, and power lines. No estimated or measure"| __truncated__ "Strong Santa Ana winds caused widespread tree and crop damage and downed power lines. Premature fruit was knocked off trees, ca"| __truncated__ "Snowfall totals included: Vail 46 inches, Steamboat Springs 45 inches, Beaver Creek 38 inches, Arrowhead ski resort 30 inches, "| __truncated__ "A major winter storm developed over the Gulf coast states on January 2nd and tracked northeast along the eastern seaboard durin"| __truncated__ ...
#  $ PropDamage: num  0 0 0 0 0 50000 1500 2000 5000 1500 ...
#  $ CropDamage: num  0e+00 1e+06 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 ...

RESULTS

We aim to answer two questions concerning the impact of different type of events across the United States:

  1. Which types of events are most harmful with respect to population health?
  2. which types of events have the greatest economic consequences?

Further data manipulation

We work with a data-frame comprising a subset of variables and summarize the data for human and economic impact in two final data-frames: human and economic. In each of them, we aggregate data by EVTYPE and compute:

  • number of events of that type.
  • number of events with given impact (that is fatalities, injuries, property and crop damages, all counted separately).
  • fraction (%) of events with impact.
  • cumulative number of impact (e.g. fatalities, or Dollar-amount damage).
  • average number of impact (computed considering only the number of events with >0 impact).

We keep only EVTYPE with at least 100 occurrences.

lean <- recent[c(1,2,4,7,8,9,14:16)]
human <- ddply(lean, .(EVTYPE), summarize, 
               N.tot = length(FATALITIES), 
               N.with.F = length(FATALITIES[FATALITIES>0]), 
               pct.with.F = length(FATALITIES[FATALITIES>0])/length(FATALITIES)*100.,
               F.tot = sum(FATALITIES), 
               F.avrg = sum(FATALITIES)/length(FATALITIES[FATALITIES>0]), 
               N.with.I = length(INJURIES[INJURIES>0]), 
               pct.with.I = length(INJURIES[INJURIES>0])/length(FATALITIES)*100., 
               I.tot = sum(INJURIES), 
               I.avrg = sum(INJURIES)/length(INJURIES[INJURIES>0]), 
               flag =  (sum(FATALITIES) + sum(INJURIES))>0 ) 
human <- subset(human, N.tot >100)

str(human)
# 'data.frame': 42 obs. of  11 variables:
#  $ EVTYPE    : chr  "ASTRONOMICAL LOW TIDE" "AVALANCHE" "BLIZZARD" "COASTAL FLOOD" ...
#  $ N.tot     : int  277 378 2634 795 643 618 2479 173 559 1851 ...
#  $ N.with.F  : int  0 173 46 8 93 14 0 3 9 565 ...
#  $ pct.with.F: num  0 45.77 1.75 1.01 14.46 ...
#  $ F.tot     : num  0 223 70 10 131 ...
#  $ F.avrg    : num  NaN 1.29 1.52 1.25 1.41 ...
#  $ N.with.I  : int  0 105 41 6 4 14 1 9 50 163 ...
#  $ pct.with.I: num  0 27.778 1.557 0.755 0.622 ...
#  $ I.tot     : num  0 156 385 10 24 ...
#  $ I.avrg    : num  NaN 1.49 9.39 1.67 6 ...
#  $ flag      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
economic <- ddply(lean, .(EVTYPE), summarize, 
                  N.tot = length(PropDamage), 
                  N.with.PrDmg = length(PropDamage[PropDamage>0]),
                  pct.with.PrDmg = length(PropDamage[PropDamage>0])/length(PropDamage)*100., 
                  PrDmg.tot = sum(PropDamage)/1.0e6, 
                  PrDmg.avrg = sum(PropDamage)/length(PropDamage[PropDamage>0])/1.0e6,
                  N.with.CrDmg = length(CropDamage[CropDamage>0]),
                  pct.with.CrDmg = length(CropDamage[CropDamage>0])/length(PropDamage)*100., 
                  CrDmg.tot = sum(CropDamage)/1.0e6, 
                  CrDmg.avrg = sum(CropDamage)/length(CropDamage[CropDamage>0])/1.0e6,
                  flag =  (sum(PropDamage) + sum(CropDamage))>0 ) 
economic <- subset(economic, N.tot >100)

str(economic)
# 'data.frame': 42 obs. of  11 variables:
#  $ EVTYPE        : chr  "ASTRONOMICAL LOW TIDE" "AVALANCHE" "BLIZZARD" "COASTAL FLOOD" ...
#  $ N.tot         : int  277 378 2634 795 643 618 2479 173 559 1851 ...
#  $ N.with.PrDmg  : int  10 52 193 197 19 193 53 66 150 23 ...
#  $ pct.with.PrDmg: num  3.61 13.76 7.33 24.78 2.95 ...
#  $ PrDmg.tot     : num  9.74 3.71 525.66 407.37 2.54 ...
#  $ PrDmg.avrg    : num  0.9745 0.0714 2.7236 2.0679 0.1339 ...
#  $ N.with.CrDmg  : int  0 0 3 0 6 2 243 6 4 2 ...
#  $ pct.with.CrDmg: num  0 0 0.114 0 0.933 ...
#  $ CrDmg.tot     : num  0 0 7.06 0 30.74 ...
#  $ CrDmg.avrg    : num  NaN NaN 2.35 NaN 5.12 ...
#  $ flag          : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

With the data summarized we can easily slice them in different ways.
In the next two sections we review the evidence emerging from the data collected and organized as described so far.

Population Health: Injuries and Fatalities

Fatalities

Top ten event types by total cumulative number of fatalities.

Event_Type N_Fata N_events with_Fata with_Fata(%)
1 EXCESSIVE HEAT 1799 1851 565 30.52
2 TORNADO 1511 23154 459 1.98
3 FLASH FLOOD 887 51005 584 1.14
4 LIGHTNING 650 13203 608 4.61
5 RIP CURRENT 542 734 480 65.40
6 FLOOD 444 27761 294 1.06
7 THUNDERSTORM WIND 379 211197 327 0.15
8 EXTREME COLD/WIND CHILL 265 1892 195 10.31
9 HIGH WIND 254 20240 194 0.96
10 HEAT 237 829 133 16.04

Top 10 event types by the fraction (%) of events of that type that caused fatalities

Event_Type with_Fata(%) N_events with_Fata N_Fata
1 RIP CURRENT 65.40 734 480 542
2 AVALANCHE 45.77 378 173 223
3 EXCESSIVE HEAT 30.52 1851 565 1799
4 HEAT 16.04 829 133 237
5 HURRICANE 15.50 271 42 125
6 COLD/WIND CHILL 14.46 643 93 131
7 EXTREME COLD/WIND CHILL 10.31 1892 195 265
8 HIGH SURF 9.65 1047 101 142
9 LIGHTNING 4.61 13203 608 650
10 TROPICAL STORM 3.52 682 24 57

Injuries

Top ten event types by total cumulative number of injuries

Event_Type N_Inj N_events with_Inj with_Inj(%)
1 TORNADO 20667 23154 1877 8.11
2 FLOOD 6838 27761 169 0.61
3 EXCESSIVE HEAT 6461 1851 163 8.81
4 THUNDERSTORM WIND 5129 211197 2108 1.00
5 LIGHTNING 4140 13203 2250 17.04
6 FLASH FLOOD 1674 51005 335 0.66
7 WILDFIRE 1458 4176 315 7.54
8 HURRICANE 1328 271 27 9.96
9 WINTER STORM 1292 11315 144 1.27
10 HEAT 1239 829 40 4.83

Top 10 event types by the fraction (%) of events of that type that caused injuries

Event_Type with_Inj(%) N_events with_Inj N_Inj
1 AVALANCHE 27.78 378 105 156
2 RIP CURRENT 26.02 734 191 503
3 LIGHTNING 17.04 13203 2250 4140
4 HURRICANE 9.96 271 27 1328
5 DUST STORM 8.94 559 50 415
6 EXCESSIVE HEAT 8.81 1851 163 6461
7 TORNADO 8.11 23154 1877 20667
8 WILDFIRE 7.54 4176 315 1458
9 HIGH SURF 6.78 1047 71 238
10 FOG GENERAL 5.23 1777 93 855

Economic Impact: Property, Agricultural and Total Damages

As in the previous section, we report events ranked by total damage and by fraction of events causing a recorded amount of damages.

Property

Top ten event types by total cumulative property damages (in Millions of USD)

Event_Type Tot_PropDmg N_events with_PropDmg with_PropDmg(%)
1 HURRICANE 81718.89 271 196 72.32
2 STORM SURGE 47834.72 401 212 52.87
3 FLOOD 29244.58 27761 9838 35.44
4 TORNADO 24616.91 23154 11847 51.17
5 FLASH FLOOD 15222.27 51005 18650 36.57
6 HAIL 14595.21 207767 20004 9.63
7 THUNDERSTORM WIND 7913.54 211197 103493 49.00
8 WILDFIRE 7760.45 4176 1026 24.57
9 TROPICAL STORM 7642.48 682 390 57.18
10 HIGH WIND 5250.67 20240 5258 25.98

Top 10 event types by the fraction (%) of events of that type that caused property damages

Event_Type with_PropDmg(%) N_events with_PropDmg Tot_PropDmg
1 STRONG WIND 86.22 3758 3240 176.99
2 LIGHT SNOW 81.03 174 141 2.51
3 HURRICANE 72.32 271 196 81718.89
4 LIGHTNING 66.23 13203 8744 743.08
5 TROPICAL STORM 57.18 682 390 7642.48
6 STORM SURGE 52.87 401 212 47834.72
7 TORNADO 51.17 23154 11847 24616.91
8 THUNDERSTORM WIND 49.00 211197 103493 7913.54
9 DRY MICROBURST 38.15 173 66 1.73
10 FLASH FLOOD 36.57 51005 18650 15222.27

Agricultural

Top ten event types by total cumulative crop damages (in Millions of USD)

Event_Type Tot_CropDmg N_events with_CropDmg with_CropDmg(%)
1 DROUGHT 13367.57 2479 243 9.80
2 HURRICANE 5350.11 271 85 31.37
3 FLOOD 5013.16 27761 1722 6.20
4 HAIL 2496.82 207767 8109 3.90
5 FLASH FLOOD 1334.90 51005 1845 3.62
6 EXTREME COLD/WIND CHILL 1326.02 1892 50 2.64
7 FROST/FREEZE 1094.19 1343 103 7.67
8 THUNDERSTORM WIND 1016.94 211197 4453 2.11
9 HEAVY RAIN 729.67 11546 124 1.07
10 TROPICAL STORM 677.71 682 59 8.65

Top 10 event types by the fraction (%) of events of that type that caused crop damages

Event_Type with_CropDmg(%) N_events with_CropDmg Tot_CropDmg
1 HURRICANE 31.37 271 85 5350.11
2 DROUGHT 9.80 2479 243 13367.57
3 TROPICAL STORM 8.65 682 59 677.71
4 FROST/FREEZE 7.67 1343 103 1094.19
5 FLOOD 6.20 27761 1722 5013.16
6 TORNADO 5.42 23154 1254 283.43
7 HAIL 3.90 207767 8109 2496.82
8 FLASH FLOOD 3.62 51005 1845 1334.90
9 DRY MICROBURST 3.47 173 6 0.02
10 WILDFIRE 2.97 4176 124 402.26
p <- ggplot(data=eco20plot, aes(x=EVTYPE)) + theme_bw()

ecoplot.p1 <- p + geom_bar(stat="identity", aes(y=PrDmg.tot), fill="skyblue") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme(legend.position="none") + 
    scale_y_log10() + coord_cartesian(ylim=c(1e1, 3e5)) +
    xlab("") + ylab("Damages (Million of Dollars)") +
    ggtitle("Total Property Damages by Event Type") 

ecoplot.p2 <- p + geom_bar(stat="identity", aes(y=pct.with.PrDmg), fill="navyblue") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme(legend.position="none") + 
    xlab("Event Type") + ylab("Percentage") + ylim(c(-5,105)) +
    ggtitle("Fraction of Events Causing Damages")

grid.draw(rbind(ggplotGrob(ecoplot.p1), ggplotGrob(ecoplot.p2),  size="last"))

Next, we can apply the same logic to identify the events which cause the greatest economic impacts. First, we’ll split the dataframe in order to sum the total property damages and crop damages per event type and then we’ll sort the data and extract the top 10 events which cause the most economic damages to properties and to crops.


APPENDIX

[Back to the Top]

The “anomalous” 2005 Napa Valley Flood

Going by the “raw” data, the event that caused the highest property damage, was a flood event in Napa, California, at the end of 2005, reported at over 100 Billion USD. However, the REMARKS entry in the database itsels raises doubts about this figure, and a USGS report assessing the impact of the late 2005 storms confirms that.
Our best guess is that the value of the EXP parameter for this event should been M instead of B as this would bring the damages amount in line with the narrative remarks and USGS report. We therefore adjusted accordingly the PROPDMGEXP for this specific event.

Here is the original data:

recent$REMARKS[which.max(tmp.PropDamage)]
# [1] "Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million."

Now, the single event which resulted in highest property damage is storm surge caused by Hurricane Katrina in Lousiana, resulting in over 31 Billion.

recent$REMARKS[which.max(recent$PropDamage)]
# [1] "Storm surge damage in southeast Louisiana, especially in the New Orleans area and the coastal parishes, was catastrophic. Hurricane protection levees and floodwalls were overtopped and/or breached resulting in widespread and deep flooding of homes and businesses. Much of Orleans and Plaquemines Parishes and nearly all of St. Bernard Parish were flooded by storm surge. Approximately 80 percent of the city of New Orleans was flooded. Thousands of people were stranded by the flood waters in homes and buildings and on rooftops for several days and had to be rescued by boat and helicopter. In Jefferson Parish, levees were not compromised, however many homes were flooded by either heavy rain overwhelming limited pumping capacity or storm surge water moving through in-operable pumps into the parish. Severe storm surge damage also occurred along the north shore of Lake Pontchartrain from Mandeville to Slidell with storm surge water moving inland as far as Old Towne Slidell with water up to 6 feet deep in some locations\n\nPost storm high water surveys of the area conducted by FEMA indicated the following storm surge estimates: Orleans Parish - 12-15 feet in east New Orleans to 9 to 12 feet along the Lakefront; St. Bernard Parish - 14 to 17 feet; Jefferson Parish - 6 to 9 feet along the lakefront to 5 to 8 feet from Lafitte to Grand Isle; Plaquemines Parish - 15 to 17 feet; St. Tammany Parish - 11 to 16 feet in southeast portion to 7 to 10 feet in western portion. All storm surge heights are still water elevations referenced to NAVD88 datum."

recent$PropDamage[which.max(recent$PropDamage)]
# [1] 3.13e+10

EVTYPE regularization

This is the source of the evt_regularization.R script:

cat("# Entering 'evtype_regularization' script\n")
TESTvec <- good$EVTYPE

TESTvec <- gsub("^SNOW$", "HEAVY SNOW", TESTvec, perl=TRUE)
TESTvec <- gsub("^WIND$", "HIGH WIND", TESTvec, perl=TRUE)
TESTvec <- gsub("^WINDS$", "HIGH WIND", TESTvec, perl=TRUE)
TESTvec <- gsub("^ICE$", "ICE STORM", TESTvec, perl=TRUE)
TESTvec <- gsub("^UNSEASONABLY WARM$", "HEAT", TESTvec, perl=TRUE)
TESTvec <- gsub("^TORNDAO$", "TORNADO", TESTvec, perl=TRUE)
TESTvec <- gsub("^WAYTERSPOUT$", "WATERSPOUT", TESTvec, perl=TRUE)
TESTvec <- gsub("^EXTREME HEAT$", "EXCESSIVE HEAT", TESTvec, perl=TRUE)
TESTvec <- gsub("^LIGHT FREEZING RAIN$", "SLEET", TESTvec, perl=TRUE)
TESTvec <- gsub("^UNSEASONABLY DRY$", "DROUGHT", TESTvec, perl=TRUE)
TESTvec <- gsub("^TSTM HEAVY RAIN$", "HEAVY RAIN", TESTvec, perl=TRUE)
TESTvec <- gsub("^SMALL HAIL$", "HAIL", TESTvec, perl=TRUE)
TESTvec <- gsub("^$EXCESSIVE SNOW", "HEAVY SNOW", TESTvec, perl=TRUE)

TESTvec[grepl("^ASTRONOMICAL", TESTvec, perl=TRUE)] <- "ASTRONOMICAL LOW TIDE"
TESTvec[grepl("^AVALA", TESTvec, perl=TRUE)] <- "AVALANCHE"
TESTvec[grepl("^BEACH", TESTvec, perl=TRUE)] <- "COASTAL FLOOD"
TESTvec[grepl("^BITTER", TESTvec, perl=TRUE)] <- "EXTREME COLD/WIND CHILL"
TESTvec[grepl("^BLIZZARD", TESTvec, perl=TRUE)] <- "BLIZZARD"
TESTvec[grepl("^BRUSH FIRE", TESTvec, perl=TRUE)] <- "WILDFIRE"
TESTvec[grepl("^COASTAL", TESTvec, perl=TRUE)] <- "COASTAL FLOOD"
TESTvec[grepl("^DROUGHT", TESTvec, perl=TRUE)] <- "DROUGHT"
TESTvec[grepl("^DUST D", TESTvec, perl=TRUE)] <- "DUST DEVIL"
TESTvec[grepl("^EXCESSIVE H", TESTvec, perl=TRUE)] <- "EXCESSIVE HEAT"
TESTvec[grepl("^EXTREME [CW]", TESTvec, perl=TRUE)] <- "EXTREME COLD/WIND CHILL"
TESTvec[grepl("^FLASH", TESTvec, perl=TRUE)] <- "FLASH FLOOD"
TESTvec[grepl("^FLO.*FLA", TESTvec, perl=TRUE)] <- "FLASH FLOOD"
TESTvec[grepl("^HYP", TESTvec, perl=TRUE)] <- "EXTREME COLD/WIND CHILL"
TESTvec[grepl("^LIGHT[A-Z]", TESTvec, perl=TRUE)] <- "LIGHTNING"
TESTvec[grepl("^LANDSL", TESTvec, perl=TRUE)] <- "DEBRIS FLOW"
TESTvec[grepl("^HAIL", TESTvec, perl=TRUE)] <- "HAIL"
TESTvec[grepl("^HEAT WAVE", TESTvec, perl=TRUE)] <- "EXCESSIVE HEAT"
TESTvec[grepl("^HEAVY PR", TESTvec, perl=TRUE)] <- "HEAVY RAIN"
TESTvec[grepl("^HEAVY RAIN", TESTvec, perl=TRUE)] <- "HEAVY RAIN"
TESTvec[grepl("^HEAVY SNOW", TESTvec, perl=TRUE)] <- "HEAVY SNOW"
TESTvec[grepl("^HEAVY SURF", TESTvec, perl=TRUE)] <- "HIGH SURF"
TESTvec[grepl("^HIGH S[UW]", TESTvec, perl=TRUE)] <- "HIGH SURF"
TESTvec[grepl("^MARINE T", TESTvec, perl=TRUE)] <- "MARINE THUNDERSTORM WIND"
TESTvec[grepl("^MINOR FL", TESTvec, perl=TRUE)] <- "FLOOD"
TESTvec[grepl("^MIXED PR", TESTvec, perl=TRUE)] <- "SLEET"
TESTvec[grepl("^MUD", TESTvec, perl=TRUE)] <- "DEBRIS FLOW"
TESTvec[grepl("^RECORD C", TESTvec, perl=TRUE)] <- "EXTREME COLD/WIND CHILL"
TESTvec[grepl("^RECORD H", TESTvec, perl=TRUE)] <- "EXCESSIVE HEAT"
TESTvec[grepl("^RECORD WA", TESTvec, perl=TRUE)] <- "EXCESSIVE HEAT"
TESTvec[grepl("^RIP", TESTvec, perl=TRUE)] <- "RIP CURRENT"
TESTvec[grepl("^R.*FLOOD", TESTvec, perl=TRUE)] <- "FLOOD"
TESTvec[grepl("^SEVERE TH", TESTvec, perl=TRUE)] <- "THUNDERSTORM WIND"
TESTvec[grepl("^SLEET", TESTvec, perl=TRUE)] <- "SLEET"
TESTvec[grepl("^SMALL STR", TESTvec, perl=TRUE)] <- "FLOOD"
TESTvec[grepl("^STRONG W", TESTvec, perl=TRUE)] <- "STRONG WIND"
TESTvec[grepl("^TIDAL FL", TESTvec, perl=TRUE)] <- "COASTAL FLOOD"
TESTvec[grepl("^TORRENTIAL", TESTvec, perl=TRUE)] <- "HEAVY RAIN"
TESTvec[grepl("^TROPICAL STORM", TESTvec, perl=TRUE)] <- "TROPICAL STORM"
TESTvec[grepl("^TSTM W", TESTvec, perl=TRUE)] <- "THUNDERSTORM WIND"
TESTvec[grepl("^URBAN", TESTvec, perl=TRUE)] <- "FLOOD"
TESTvec[grepl("^UNSEASO.* CO", TESTvec, perl=TRUE)] <- "COLD/WIND CHILL"
TESTvec[grepl("^WINTER ST", TESTvec, perl=TRUE)] <- "WINTER STORM"
TESTvec[grepl("^WINTER W", TESTvec, perl=TRUE)] <- "WINTER WEATHER"
TESTvec[grepl("^WATER", TESTvec, perl=TRUE)] <- "WATERSPOUT"
TESTvec[grepl("^WILD", TESTvec, perl=TRUE)] <- "WILDFIRE"
TESTvec[grepl("^WIND CHILL", TESTvec, perl=TRUE)] <- "COLD/WIND CHILL"
TESTvec[grepl("^VOLCANIC", TESTvec, perl=TRUE)] <- "VOLCANIC ASH"


TESTvec[grepl("[VF]OG", TESTvec, perl=TRUE)] <- "FOG GENERAL"
TESTvec[grepl("TYPHOON", TESTvec, perl=TRUE)] <- "HURRICANE"
TESTvec[grepl("SMOKE", TESTvec, perl=TRUE)] <- "DENSE SMOKE"
TESTvec[grepl("FIRES", TESTvec, perl=TRUE)] <- "WILDFIRE"
TESTvec[grepl("SQUALL", TESTvec, perl=TRUE)] <- "HEAVY SNOW"
TESTvec[grepl("EROSION", TESTvec, perl=TRUE)] <- "COASTAL FLOOD"
TESTvec[grepl("^SNOW.*ICE", TESTvec, perl=TRUE)] <- "ICE STORM"
TESTvec[grepl("^SNOW.*SLEET", TESTvec, perl=TRUE)] <- "SLEET"
TESTvec[grepl("^SNOW.*RAIN", TESTvec, perl=TRUE)] <- "SLEET"
TESTvec[grepl("LAKE.*SNOW", TESTvec, perl=TRUE)] <- "LAKE-EFFECT SNOW"
TESTvec[grepl("LAKE.*FLOOD", TESTvec, perl=TRUE)] <- "LAKESHORE FLOOD"
TESTvec[grepl("HURRICANE-GENERATED SWELLS", TESTvec, perl=TRUE)] <- "STORM SURGE"

TESTvec[grepl("^HURRICANE", TESTvec, perl=TRUE)] <- "HURRICANE"             # after |HURRICANE-GENERATED SWELLS

TESTvec[grepl("^ICE[ /S][A-EO-T]", TESTvec, perl=TRUE)] <- "ICE STORM"      # after |[VF]OG
TESTvec[grepl("^FREEZING [DR]", TESTvec, perl=TRUE)] <- "SLEET"             # after |[VF]OG
TESTvec[grepl("^WINT.* MIX", TESTvec, perl=TRUE)] <- "SLEET"                # after |^WINTER W
TESTvec[grepl("SURGE", TESTvec, perl=TRUE)] <- "STORM SURGE"                # after |^COASTAL
TESTvec[grepl("FUNNEL", TESTvec, perl=TRUE)] <- "FUNNEL CLOUD"              # after |^WATER
TESTvec[grepl("TORNADO", TESTvec, perl=TRUE)] <- "TORNADO"                  # after |^WATER
TESTvec[grepl("DUST", TESTvec, perl=TRUE)] <- "DUST STORM"                  # after |^DUST D
TESTvec[grepl("^HIGH WIN", TESTvec, perl=TRUE)] <- "HIGH WIND"              # after |DUST
TESTvec[grepl("^FLOOD", TESTvec, perl=TRUE)] <- "FLOOD"                     # after |^FLO.*FLA

TESTvec[grepl("^COLD", TESTvec, perl=TRUE)] <- "COLD/WIND CHILL"            # after |FUNNEL  , |TORNADO
TESTvec[grepl("^THUNDERSTORM", TESTvec, perl=TRUE)] <- "THUNDERSTORM WIND"  # after |FUNNEL
TESTvec[grepl("^THU.*TOR.*WI", TESTvec, perl=TRUE)] <- "THUNDERSTORM WIND"  # after |FUNNEL

cat("# Leaving 'evtype_regularization' script\n")

R Session Info

sessionInfo()
# R version 3.1.3 (2015-03-09)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.2 LTS
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
#  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
# [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] grid      stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] ggplot2_1.0.1  reshape2_1.4.1 dplyr_0.4.2    plyr_1.8.3     knitr_1.10.5  
# 
# loaded via a namespace (and not attached):
#  [1] DBI_0.3.1        MASS_7.3-41      R6_2.0.1         Rcpp_0.11.6      assertthat_0.1  
#  [6] colorspace_1.2-6 digest_0.6.8     evaluate_0.7     formatR_1.2      gtable_0.1.2    
# [11] highr_0.5        htmltools_0.2.6  labeling_0.3     lazyeval_0.1.10  magrittr_1.5    
# [16] munsell_0.4.2    parallel_3.1.3   proto_0.3-10     rmarkdown_0.7    scales_0.2.4    
# [21] stringi_0.5-5    stringr_1.0.0    tools_3.1.3      yaml_2.1.13