Preamble

Report for the first assignment of the Reproducible Research course of the Coursera/JHSPH Data Science Specialization.

The source files are posted on GitHub.

Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database records major weather-related events in the United States, tracking their times and locations as well as estimates of fatalities, injuries, property and agricultural damages that may be associated with them.

We seek to identify which type of storm events have the greatest impact on the population both from the point of view of their health and of the economic consequences. To address these questions, our analysis aggregates the data by storm event type to identify the storm events categories responsible for the largest impact as measured by fatalities, injuries, property damage, crop damage.

Findings

Looking separately at each harm/damage category separately based on our analysis we can say that the most severe impact comes from the following types of events:

Agricultural Damage: hurricanes and drought, both in terms of total economic cost and fraction of events that cause significant damage.
It is worth noting that in addition to events directly classified as hurricane, among the most impactful event types there are others that are/can be related to hurricanes themselves or other kind of tropical storm activity (e.g. tropical storms, storm surge).
Property Damage: in absolute terms hurricanes and floods dominate the economic impact (and here again one may want to look at a broader “tropical event” category). When we consider the likelihood that a storm causes “recordable” damages, wind and snow emerge at the top. While their cumulative cost may not be as substantial as that of major events like hurricanes, the damage caused by wind and snow can have a serious impact on the life and economic activity of a wider range of families and communities across the country and it warrants attention and adequate preparation.
Human Fatalities/Injuries: tornadoes, floods (including flash floods) and excessive heat emerge as the most serious threats to human health in absolute terms. It is interesting however, to note that in terms of fraction of events with fatalities/injuries, rip currents and avalanches take the top spots (with “likelihood” >= 50% of fatalities and about 25% of injuries on a given event).
This may suggest that people expose themselves to avoidable risks, given that one can think of associating rip currents and avalanches with leisure activities.

Future Work

The scope and time for this investigation were limited, but there are clear directions worth further investigation, such as

Geographic analysis, clearly necessary given the fairy localized impact of different type of events.
Variation over time, perhaps reflecting changes in behavior, preparation and policies.

Outline

THE DATA
- The Data and the Aim of Our Analysis
- Loading the Dataset
DATA PROCESSING
- Cleaning
  - REMARKS and EVTYPE: tidying up spaces and letters’ case
  - Setting “missing” coordinates to NA
  - Fixing times
- Subsetting full data-frame to good data-frame
- The EVTYPE problem!
  - EVTYPE Regularization
- The PROPDMGEXP and CROPDMGEXP problem
- The choice of limiting the analysis to recent years (>=1996)
RESULTS
- Population Health: Injuries and Fatalities
- Economic Impact: Property, Agricultural and Total Damages
APPENDIX

THE DATA

The data structure and definition documentation is at the following URLs of the NOAA National Climatic Data Center (NCDC):

The raw data comprise 37 variables with 902,297 observation from years 1950-2011.

The Data and the Aim of Our Analysis

The objective of this research is to address two main questions:

which events have the greatest health consequences measured by fatalities and injuries.
Which events have the greatest economic consequences measured by property damage, crop damage, total economic damage.

The main variables in the NOAA/NCDC Storm Event databases that we will consider are:

For storm event date and type:
- BGN_DATE: begin date of event.
- EVTYPE: event type (see discussion later).
For human and economic damages associated with each storm event:
- FATALITIES: number of fatalities.
- INJURIES: number of injuries.
- PROPDMG: property damage assessment.
- CROPDMG: agricultural damage assessment.

The values reported for these two latter have to be converted in actual US dollar amounts by multiplying them by a scaling factor coded in two other variables (PROPDMGEXP, CROPDMGEXP) as the power of 10 of the multiplier. These two auxiliary variables seem to suffer from some data-entry issues, which we will address below.

Loading the Dataset

The data set comes as a compressed csv file, which can be read directly with read.csv(). After reviewing the data we set our preferred values for the classes of the variables loaded in the data-frame and we set them on loading via the colClasses option, using a character vector read from an auxiliary file.

NOTE: given the substantial size of the data set we set the options cache = TRUE to reduce the knitr processing time to a manageable length.

col.cl <- read.csv("data/readin_classes.csv", stringsAsFactors=FALSE, strip.white=TRUE)
col.cl$newclass2
#  [1] "numeric"   "character" "character" "factor"    "numeric"   "character" "factor"    "character"
#  [9] "numeric"   "character" "character" "character" "character" "numeric"   "character" "numeric"  
# [17] "character" "character" "numeric"   "numeric"   "factor"    "numeric"   "numeric"   "numeric"  
# [25] "numeric"   "character" "numeric"   "character" "character" "character" "character" "numeric"  
# [33] "numeric"   "numeric"   "numeric"   "character" "numeric"

data <- read.csv("data/StormData.csv.bz2", colClasses= col.cl$newclass2, nrows=1300000, strip.white=TRUE)

This is the structure of the data-frame as loaded.

str(data)
# 'data.frame': 902297 obs. of  37 variables:
#  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
#  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
#  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
#  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 6 6 6 6 6 6 6 6 6 6 ...
#  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
#  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
#  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
#  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
#  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ BGN_AZI   : chr  "" "" "" "" ...
#  $ BGN_LOCATI: chr  "" "" "" "" ...
#  $ END_DATE  : chr  "" "" "" "" ...
#  $ END_TIME  : chr  "" "" "" "" ...
#  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ COUNTYENDN: chr  "" "" "" "" ...
#  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ END_AZI   : chr  "" "" "" "" ...
#  $ END_LOCATI: chr  "" "" "" "" ...
#  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
#  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
#  $ F         : Factor w/ 7 levels "","0","1","2",..: 5 4 4 4 4 4 4 3 5 5 ...
#  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
#  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
#  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
#  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
#  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ CROPDMGEXP: chr  "" "" "" "" ...
#  $ WFO       : chr  "" "" "" "" ...
#  $ STATEOFFIC: chr  "" "" "" "" ...
#  $ ZONENAMES : chr  "" "" "" "" ...
#  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
#  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
#  $ LATITUDE_E: num  3051 0 0 0 0 ...
#  $ LONGITUDE_: num  8806 0 0 0 0 ...
#  $ REMARKS   : chr  "" "" "" "" ...
#  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

DATA PROCESSING

Due to the nature of the development of this dataset several variables suffer from inconsistent formatting and in order to make our analysis as robust as possible we cleaned and reformatted some of the potentially useful variables, as well as created ones for further convenience.

Cleaning

REMARKS and EVTYPE: tidying up spaces and letters’ case

data$REMARKS <- gsub("^ *", "", data$REMARKS, perl=TRUE)
data$REMARKS <- gsub(" *$", "", data$REMARKS, perl=TRUE)
data$REMARKS <- gsub("[ ]{2,}", " ", data$REMARKS, perl=TRUE)

data$EVTYPE <- gsub("^ *", "", data$EVTYPE, perl=TRUE)
data$EVTYPE <- gsub(" *$", "", data$EVTYPE, perl=TRUE)
data$EVTYPE <- gsub("[ ]{2,}", " ", data$EVTYPE, perl=TRUE)

data$COUNTYNAME <- toupper(data$COUNTYNAME)
data$EVTYPE     <- toupper(data$EVTYPE)
data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
data$CROPDMGEXP <- toupper(data$CROPDMGEXP)

Setting “missing” coordinates to NA

data$LATITUDE[data$LATITUDE <= 0]  <- NA
data$LONGITUDE[data$LONGITUDE <= 0]  <- NA

Fixing times

First a small fix motivated by the discovery of O instead of a 0 in the BGN_TIME variable:

data$BGN_TIME <- gsub("([0-9])O(.*)([^M])$","\\10\\2\\3", data$BGN_TIME, perl=TRUE)

It is better to work with properly constructed time variables, hence we convert BGN_DATE and END_DATE into POSIXlt class variables.

# making BGN and END dates dates
data$BGN_DATE.new <- strptime(as.character(data$BGN_DATE), "%m/%d/%Y %H:%M:%S")
data$END_DATE.new <- strptime(as.character(data$END_DATE), "%m/%d/%Y %H:%M:%S")

Finally, we define a new variable YEAR.

# create a 4-digit year variable
data$YEAR <- substr(as.character(data$BGN_DATE.new),0,4)

Subsetting full data-frame to good data-frame

For ease of processing it is practical to work with a reduced set of variables.
Our choice was the following:

sel.columns.n <- c( 37, 7, 5, 40, 38, 39, 8, 23, 24, 25, 26, 27, 28, 36)
colnames(data)[sel.columns.n]
#  [1] "REFNUM"       "STATE"        "COUNTY"       "YEAR"         "BGN_DATE.new" "END_DATE.new"
#  [7] "EVTYPE"       "FATALITIES"   "INJURIES"     "PROPDMG"      "PROPDMGEXP"   "CROPDMG"     
# [13] "CROPDMGEXP"   "REMARKS"

After creating the new data-frame, we rename the newly created time variables to take the original names, and sort the data-frame chronologically by BGN_DATE.

good <- data[, sel.columns.n]

colnames(good)[5] <- "BGN_DATE"
colnames(good)[6] <- "END_DATE"

# sorting data frame chronologically 
good <- good[order(good$BGN_DATE), ]

The EVTYPE problem!

We perform some more cleaning steps on this leaner data-frame to regularize the EVTYPE variable.
Because of historical reasons, namely the heterogenous sources of the data compiled into the Storm Event database, the EVTYPE variable requires a significant amount of work to tidy it up into a more usable form.

While in theory since 1996 NOAA has codified a set of 48 type of events See NWS Directive to be used in the classification of storm events, a quick review of the different EVTYPE values for each year shows that their number remained higher until much more recently.

EVTYPE.by.year <- tapply(good$EVTYPE, good$YEAR, function(x) table(x), simplify=TRUE)
sapply(EVTYPE.by.year, function(x) length(names(x)))
# 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 
#    1    1    1    1    1    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
# 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 
#    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
#    3    3    3  160  266  384  199  135  122  118  110  120   99   51   38   46   50   46   46   46 
# 2010 2011 
#   46   46

Until 1993 “Tornado”, “Thunderstorm Wind” and “Hail” type of events were recorded. The 1993 and 1995 data have been extracted from unformatted text files, and this helps explain the high level of noise of EVTYPE during those years. It is less clear why after the Directive of 1996 the data have not become

A quick census of the most frequent EVTYPE can be extracted with table()

evt.census <- as.data.frame(table(data$EVTYPE))
head(evt.census[order(evt.census$Freq, decreasing=TRUE),],10)
#                   Var1   Freq
# 203               HAIL 288661
# 766          TSTM WIND 219946
# 672  THUNDERSTORM WIND  82564
# 745            TORNADO  60652
# 129        FLASH FLOOD  54278
# 145              FLOOD  25327
# 698 THUNDERSTORM WINDS  20850
# 309          HIGH WIND  20214
# 407          LIGHTNING  15755
# 265         HEAVY SNOW  15708

nrow(evt.census)
# [1] 883

The full database comprises 883 unique EVTYPE values!
It is immediately obvious the problem with inconsistent naming, as in the top-10 there are three entries for the same storm event type, Thunderstorm Wind: TSTM WIND, THUNDERSTORM WIND, THUNDERSTORM WINDS .

`EVTYPE` Regularization

After an extensive analysis of the EVTYPE, we put together a lengthy set of substitutions via gsub() and grepl() to consolidate as many related types as possible. We run this by calling an external script to avoid clogging the document, but we include the source of the script in the Appendix.
The script creates a new character vector added to the good data-frame as EVTYPE.

show percentage “recovered”, number of EVTYPE per year, etc…

# Calling 'grepl.R' script to clean the EVTYPE entries
source("./scripts/evtype_regularization.R")
# # Entering 'evtype_regularization' script
# # Leaving 'evtype_regularization' script

# colnames(good)[7] <- "old.EVTYPE"
good$EVTYPE <- TESTvec

After regularization the number of unique EVTYPE values is now 373.

The PROPDMGEXP and CROPDMGEXP problem

The variables PROPDMGEXP and CROPDMGEXP are supposed to encode exponents for power of 10 units for multiplying factors for their respective economic damage variables.
They too suffer from inconsistent data entry in the transition years around 1995 as it can be seen by this quick analysis of the number of unique values for each year, accompanied by the list of values for the years with the highest number of them:

PROPDMGEXP.by.year <- tapply(good$PROPDMGEXP, good$YEAR, function(x) table(x), simplify=TRUE)
sapply(PROPDMGEXP.by.year, function(x) length(names(x)))
# 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 
#    2    2    2    2    2    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
# 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 
#    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
#    3    3    3    7    9   17    3    4    4    4    4    4    3    4    4    4    4    2    3    2 
# 2010 2011 
#    3    4

PROPDMGEXP.by.year[which.max(sapply(PROPDMGEXP.by.year, function(x) length(names(x))))]
# $`1995`
# x
#           +     -     0     1     2     3     4     5     6     7     8     ?     B     H     K 
# 18331     4     1   186    25    13     4     3    26     3     5     1     5     5     6  8768 
#     M 
#   584

CROPDMGEXP.by.year <- tapply(good$CROPDMGEXP, good$YEAR, function(x) table(x), simplify=TRUE)
sapply(CROPDMGEXP.by.year, function(x) length(names(x)))
# 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 
#    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
# 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 
#    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
#    1    1    1    6    5    7    3    3    3    3    3    3    3    3    3    4    4    2    2    2 
# 2010 2011 
#    2    3

CROPDMGEXP.by.year[which.max(sapply(CROPDMGEXP.by.year, function(x) length(names(x))))]
# $`1995`
# x
#           0     2     ?     B     K     M 
# 27019     8     1     5     3   800   134

We corrected their values according to the following interpretation:

The exponents are considered for the base of 10 as the base.
Missing values (NA) or empty strings are considered as \(10^0=1\).
H, K, M, B are interpreted as: hecto (=2), kilo (=3), mega (=6), billion (=9).
Numeric exponents are interpreted as its corresponding power of 10 as the base.
The rest of effectively undefined exponents are also interpreted to be spurious and set to 0.

After these conversions the PROPDMGEXP and CROPDMGEXP variable are changed to numeric class.

good$PROPDMGEXP[good$PROPDMGEXP == "?" | good$PROPDMGEXP == "+" | 
                good$PROPDMGEXP == "-" | good$PROPDMGEXP == ""] <- 0
good$PROPDMGEXP[good$PROPDMGEXP == "B"] <- 9
good$PROPDMGEXP[good$PROPDMGEXP == "M"] <- 6
good$PROPDMGEXP[good$PROPDMGEXP == "K"] <- 3
good$PROPDMGEXP[good$PROPDMGEXP == "H"] <- 2

good$CROPDMGEXP[good$CROPDMGEXP == "?" | good$CROPDMGEXP == "+" | 
                good$CROPDMGEXP == "-" | good$CROPDMGEXP == ""] <- 0
good$CROPDMGEXP[good$CROPDMGEXP == "B"] <- 9
good$CROPDMGEXP[good$CROPDMGEXP == "M"] <- 6
good$CROPDMGEXP[good$CROPDMGEXP == "K"] <- 3
good$CROPDMGEXP[good$CROPDMGEXP == "H"] <- 2

good$PROPDMGEXP <- as.numeric(good$PROPDMGEXP)
good$CROPDMGEXP <- as.numeric(good$CROPDMGEXP)

The good data frame structure at this point is the following:

str(good)
# 'data.frame': 902297 obs. of  14 variables:
#  $ REFNUM    : num  28479 28480 83316 114844 6348 ...
#  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 20 20 37 48 5 20 37 63 63 63 ...
#  $ COUNTY    : num  119 135 189 161 113 91 93 47 39 201 ...
#  $ YEAR      : chr  "1950" "1950" "1950" "1950" ...
#  $ BGN_DATE  : POSIXlt, format: "1950-01-03" "1950-01-03" "1950-01-03" ...
#  $ END_DATE  : POSIXlt, format: NA NA NA ...
#  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
#  $ FATALITIES: num  0 0 0 0 1 0 0 0 0 1 ...
#  $ INJURIES  : num  0 3 3 1 1 0 5 2 0 12 ...
#  $ PROPDMG   : num  250 250 2.5 25 2.5 250 250 0 25 25 ...
#  $ PROPDMGEXP: num  3 3 6 3 3 3 3 3 3 3 ...
#  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ CROPDMGEXP: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ REMARKS   : chr  "" "" "" "" ...

The choice of limiting the analysis to recent years (>=1996)

After reviewing and cleaning the data, we choose to limit our analysis to the data since 1996-01-01.

There are two main motivations for this choice:

As noted above, before 1996 the NCDC database only tracked three event types “Tornado”“, “Thunderstorm Wind”” and “Hail”. While it would certainly be interesting to perform an analysis of the longer time baseline for these three categories, here we are mostly interested in assessing the impact of storm events from a broader, more inclusive perspective.
Imposing this cut yields a significantly cleaner data set, for instance with respect to the DMGEXP and EVTYPE variables (but also the various date/time variables.)

# subsetting 'good' to 'recent' taking only events since 1996
recent <- subset(good, BGN_DATE > as.POSIXlt("1996-01-01"))
row.names(recent) <- NULL

str(recent)
# 'data.frame': 653507 obs. of  14 variables:
#  $ REFNUM    : num  249653 251131 251416 252425 252426 ...
#  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 7 8 9 10 10 13 13 13 13 13 ...
#  $ COUNTY    : num  12 11 4 1 2 105 19 19 19 109 ...
#  $ YEAR      : chr  "1996" "1996" "1996" "1996" ...
#  $ BGN_DATE  : POSIXlt, format: "1996-01-02" "1996-01-02" "1996-01-02" ...
#  $ END_DATE  : POSIXlt, format: "1996-01-02" "1996-01-02" "1996-01-05" ...
#  $ EVTYPE    : chr  "HIGH WIND" "HIGH WIND" "HEAVY SNOW" "HEAVY SNOW" ...
#  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ INJURIES  : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ PROPDMG   : num  0 0 0 0 0 50 1.5 2 5 1.5 ...
#  $ PROPDMGEXP: num  0 0 0 0 0 3 3 3 3 3 ...
#  $ CROPDMG   : num  0 1 0 0 0 0 0 0 0 0 ...
#  $ CROPDMGEXP: num  0 6 0 0 0 0 3 3 3 3 ...
#  $ REMARKS   : chr  "Wind Damage in Flagstaff, AZ. Non-convective winds knocked down trees, large branches, and power lines. No estimated or measure"| __truncated__ "Strong Santa Ana winds caused widespread tree and crop damage and downed power lines. Premature fruit was knocked off trees, ca"| __truncated__ "Snowfall totals included: Vail 46 inches, Steamboat Springs 45 inches, Beaver Creek 38 inches, Arrowhead ski resort 30 inches, "| __truncated__ "A major winter storm developed over the Gulf coast states on January 2nd and tracked northeast along the eastern seaboard durin"| __truncated__ ...

Computing actual values for property and crop damages

Final step of the data processing is the computation of actual straight US Dollar values for damages combining the DMG and DMGEXP variables.

NOTE: before we compute them, we make an ad hoc fix to the value of PROPDMGEXP for one specific entry, a flood in Napa in 2005. More details about this can be found in this section of the Appendix.

recent$PROPDMGEXP[recent$REFNUM == 605943] <- 6
recent$PropDamage <- recent$PROPDMG * 10^recent$PROPDMGEXP
recent$CropDamage <- recent$CROPDMG * 10^recent$CROPDMGEXP
str(recent)
# 'data.frame': 653507 obs. of  16 variables:
#  $ REFNUM    : num  249653 251131 251416 252425 252426 ...
#  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 7 8 9 10 10 13 13 13 13 13 ...
#  $ COUNTY    : num  12 11 4 1 2 105 19 19 19 109 ...
#  $ YEAR      : chr  "1996" "1996" "1996" "1996" ...
#  $ BGN_DATE  : POSIXlt, format: "1996-01-02" "1996-01-02" "1996-01-02" ...
#  $ END_DATE  : POSIXlt, format: "1996-01-02" "1996-01-02" "1996-01-05" ...
#  $ EVTYPE    : chr  "HIGH WIND" "HIGH WIND" "HEAVY SNOW" "HEAVY SNOW" ...
#  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
#  $ INJURIES  : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ PROPDMG   : num  0 0 0 0 0 50 1.5 2 5 1.5 ...
#  $ PROPDMGEXP: num  0 0 0 0 0 3 3 3 3 3 ...
#  $ CROPDMG   : num  0 1 0 0 0 0 0 0 0 0 ...
#  $ CROPDMGEXP: num  0 6 0 0 0 0 3 3 3 3 ...
#  $ REMARKS   : chr  "Wind Damage in Flagstaff, AZ. Non-convective winds knocked down trees, large branches, and power lines. No estimated or measure"| __truncated__ "Strong Santa Ana winds caused widespread tree and crop damage and downed power lines. Premature fruit was knocked off trees, ca"| __truncated__ "Snowfall totals included: Vail 46 inches, Steamboat Springs 45 inches, Beaver Creek 38 inches, Arrowhead ski resort 30 inches, "| __truncated__ "A major winter storm developed over the Gulf coast states on January 2nd and tracked northeast along the eastern seaboard durin"| __truncated__ ...
#  $ PropDamage: num  0 0 0 0 0 50000 1500 2000 5000 1500 ...
#  $ CropDamage: num  0e+00 1e+06 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 ...

RESULTS

We aim to answer two questions concerning the impact of different type of events across the United States:

Which types of events are most harmful with respect to population health?
which types of events have the greatest economic consequences?

Further data manipulation

We work with a data-frame comprising a subset of variables and summarize the data for human and economic impact in two final data-frames: human and economic. In each of them, we aggregate data by EVTYPE and compute:

number of events of that type.
number of events with given impact (that is fatalities, injuries, property and crop damages, all counted separately).
fraction (%) of events with impact.
cumulative number of impact (e.g. fatalities, or Dollar-amount damage).
average number of impact (computed considering only the number of events with >0 impact).

We keep only EVTYPE with at least 100 occurrences.

lean <- recent[c(1,2,4,7,8,9,14:16)]
human <- ddply(lean, .(EVTYPE), summarize, 
               N.tot = length(FATALITIES), 
               N.with.F = length(FATALITIES[FATALITIES>0]), 
               pct.with.F = length(FATALITIES[FATALITIES>0])/length(FATALITIES)*100.,
               F.tot = sum(FATALITIES), 
               F.avrg = sum(FATALITIES)/length(FATALITIES[FATALITIES>0]), 
               N.with.I = length(INJURIES[INJURIES>0]), 
               pct.with.I = length(INJURIES[INJURIES>0])/length(FATALITIES)*100., 
               I.tot = sum(INJURIES), 
               I.avrg = sum(INJURIES)/length(INJURIES[INJURIES>0]), 
               flag =  (sum(FATALITIES) + sum(INJURIES))>0 ) 
human <- subset(human, N.tot >100)

str(human)
# 'data.frame': 42 obs. of  11 variables:
#  $ EVTYPE    : chr  "ASTRONOMICAL LOW TIDE" "AVALANCHE" "BLIZZARD" "COASTAL FLOOD" ...
#  $ N.tot     : int  277 378 2634 795 643 618 2479 173 559 1851 ...
#  $ N.with.F  : int  0 173 46 8 93 14 0 3 9 565 ...
#  $ pct.with.F: num  0 45.77 1.75 1.01 14.46 ...
#  $ F.tot     : num  0 223 70 10 131 ...
#  $ F.avrg    : num  NaN 1.29 1.52 1.25 1.41 ...
#  $ N.with.I  : int  0 105 41 6 4 14 1 9 50 163 ...
#  $ pct.with.I: num  0 27.778 1.557 0.755 0.622 ...
#  $ I.tot     : num  0 156 385 10 24 ...
#  $ I.avrg    : num  NaN 1.49 9.39 1.67 6 ...
#  $ flag      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...

economic <- ddply(lean, .(EVTYPE), summarize, 
                  N.tot = length(PropDamage), 
                  N.with.PrDmg = length(PropDamage[PropDamage>0]),
                  pct.with.PrDmg = length(PropDamage[PropDamage>0])/length(PropDamage)*100., 
                  PrDmg.tot = sum(PropDamage)/1.0e6, 
                  PrDmg.avrg = sum(PropDamage)/length(PropDamage[PropDamage>0])/1.0e6,
                  N.with.CrDmg = length(CropDamage[CropDamage>0]),
                  pct.with.CrDmg = length(CropDamage[CropDamage>0])/length(PropDamage)*100., 
                  CrDmg.tot = sum(CropDamage)/1.0e6, 
                  CrDmg.avrg = sum(CropDamage)/length(CropDamage[CropDamage>0])/1.0e6,
                  flag =  (sum(PropDamage) + sum(CropDamage))>0 ) 
economic <- subset(economic, N.tot >100)

str(economic)
# 'data.frame': 42 obs. of  11 variables:
#  $ EVTYPE        : chr  "ASTRONOMICAL LOW TIDE" "AVALANCHE" "BLIZZARD" "COASTAL FLOOD" ...
#  $ N.tot         : int  277 378 2634 795 643 618 2479 173 559 1851 ...
#  $ N.with.PrDmg  : int  10 52 193 197 19 193 53 66 150 23 ...
#  $ pct.with.PrDmg: num  3.61 13.76 7.33 24.78 2.95 ...
#  $ PrDmg.tot     : num  9.74 3.71 525.66 407.37 2.54 ...
#  $ PrDmg.avrg    : num  0.9745 0.0714 2.7236 2.0679 0.1339 ...
#  $ N.with.CrDmg  : int  0 0 3 0 6 2 243 6 4 2 ...
#  $ pct.with.CrDmg: num  0 0 0.114 0 0.933 ...
#  $ CrDmg.tot     : num  0 0 7.06 0 30.74 ...
#  $ CrDmg.avrg    : num  NaN NaN 2.35 NaN 5.12 ...
#  $ flag          : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

With the data summarized we can easily slice them in different ways.
In the next two sections we review the evidence emerging from the data collected and organized as described so far.

Population Health: Injuries and Fatalities

Fatalities

Top ten event types by total cumulative number of fatalities.

	Event_Type	N_Fata	N_events	with_Fata	with_Fata(%)
1	EXCESSIVE HEAT	1799	1851	565	30.52
2	TORNADO	1511	23154	459	1.98
3	FLASH FLOOD	887	51005	584	1.14
4	LIGHTNING	650	13203	608	4.61
5	RIP CURRENT	542	734	480	65.40
6	FLOOD	444	27761	294	1.06
7	THUNDERSTORM WIND	379	211197	327	0.15
8	EXTREME COLD/WIND CHILL	265	1892	195	10.31
9	HIGH WIND	254	20240	194	0.96
10	HEAT	237	829	133	16.04

Top 10 event types by the fraction (%) of events of that type that caused fatalities

	Event_Type	with_Fata(%)	N_events	with_Fata	N_Fata
1	RIP CURRENT	65.40	734	480	542
2	AVALANCHE	45.77	378	173	223
3	EXCESSIVE HEAT	30.52	1851	565	1799
4	HEAT	16.04	829	133	237
5	HURRICANE	15.50	271	42	125
6	COLD/WIND CHILL	14.46	643	93	131
7	EXTREME COLD/WIND CHILL	10.31	1892	195	265
8	HIGH SURF	9.65	1047	101	142
9	LIGHTNING	4.61	13203	608	650
10	TROPICAL STORM	3.52	682	24	57

Injuries

Top ten event types by total cumulative number of injuries

	Event_Type	N_Inj	N_events	with_Inj	with_Inj(%)
1	TORNADO	20667	23154	1877	8.11
2	FLOOD	6838	27761	169	0.61
3	EXCESSIVE HEAT	6461	1851	163	8.81
4	THUNDERSTORM WIND	5129	211197	2108	1.00
5	LIGHTNING	4140	13203	2250	17.04
6	FLASH FLOOD	1674	51005	335	0.66
7	WILDFIRE	1458	4176	315	7.54
8	HURRICANE	1328	271	27	9.96
9	WINTER STORM	1292	11315	144	1.27
10	HEAT	1239	829	40	4.83

Top 10 event types by the fraction (%) of events of that type that caused injuries

	Event_Type	with_Inj(%)	N_events	with_Inj	N_Inj
1	AVALANCHE	27.78	378	105	156
2	RIP CURRENT	26.02	734	191	503
3	LIGHTNING	17.04	13203	2250	4140
4	HURRICANE	9.96	271	27	1328
5	DUST STORM	8.94	559	50	415
6	EXCESSIVE HEAT	8.81	1851	163	6461
7	TORNADO	8.11	23154	1877	20667
8	WILDFIRE	7.54	4176	315	1458
9	HIGH SURF	6.78	1047	71	238
10	FOG GENERAL	5.23	1777	93	855

Economic Impact: Property, Agricultural and Total Damages

As in the previous section, we report events ranked by total damage and by fraction of events causing a recorded amount of damages.

Property

Top ten event types by total cumulative property damages (in Millions of USD)

	Event_Type	Tot_PropDmg	N_events	with_PropDmg	with_PropDmg(%)
1	HURRICANE	81718.89	271	196	72.32
2	STORM SURGE	47834.72	401	212	52.87
3	FLOOD	29244.58	27761	9838	35.44
4	TORNADO	24616.91	23154	11847	51.17
5	FLASH FLOOD	15222.27	51005	18650	36.57
6	HAIL	14595.21	207767	20004	9.63
7	THUNDERSTORM WIND	7913.54	211197	103493	49.00
8	WILDFIRE	7760.45	4176	1026	24.57
9	TROPICAL STORM	7642.48	682	390	57.18
10	HIGH WIND	5250.67	20240	5258	25.98

Top 10 event types by the fraction (%) of events of that type that caused property damages

	Event_Type	with_PropDmg(%)	N_events	with_PropDmg	Tot_PropDmg
1	STRONG WIND	86.22	3758	3240	176.99
2	LIGHT SNOW	81.03	174	141	2.51
3	HURRICANE	72.32	271	196	81718.89
4	LIGHTNING	66.23	13203	8744	743.08
5	TROPICAL STORM	57.18	682	390	7642.48
6	STORM SURGE	52.87	401	212	47834.72
7	TORNADO	51.17	23154	11847	24616.91
8	THUNDERSTORM WIND	49.00	211197	103493	7913.54
9	DRY MICROBURST	38.15	173	66	1.73
10	FLASH FLOOD	36.57	51005	18650	15222.27

Agricultural

Top ten event types by total cumulative crop damages (in Millions of USD)

	Event_Type	Tot_CropDmg	N_events	with_CropDmg	with_CropDmg(%)
1	DROUGHT	13367.57	2479	243	9.80
2	HURRICANE	5350.11	271	85	31.37
3	FLOOD	5013.16	27761	1722	6.20
4	HAIL	2496.82	207767	8109	3.90
5	FLASH FLOOD	1334.90	51005	1845	3.62
6	EXTREME COLD/WIND CHILL	1326.02	1892	50	2.64
7	FROST/FREEZE	1094.19	1343	103	7.67
8	THUNDERSTORM WIND	1016.94	211197	4453	2.11
9	HEAVY RAIN	729.67	11546	124	1.07
10	TROPICAL STORM	677.71	682	59	8.65

Top 10 event types by the fraction (%) of events of that type that caused crop damages

	Event_Type	with_CropDmg(%)	N_events	with_CropDmg	Tot_CropDmg
1	HURRICANE	31.37	271	85	5350.11
2	DROUGHT	9.80	2479	243	13367.57
3	TROPICAL STORM	8.65	682	59	677.71
4	FROST/FREEZE	7.67	1343	103	1094.19
5	FLOOD	6.20	27761	1722	5013.16
6	TORNADO	5.42	23154	1254	283.43
7	HAIL	3.90	207767	8109	2496.82
8	FLASH FLOOD	3.62	51005	1845	1334.90
9	DRY MICROBURST	3.47	173	6	0.02
10	WILDFIRE	2.97	4176	124	402.26

p <- ggplot(data=eco20plot, aes(x=EVTYPE)) + theme_bw()

ecoplot.p1 <- p + geom_bar(stat="identity", aes(y=PrDmg.tot), fill="skyblue") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme(legend.position="none") + 
    scale_y_log10() + coord_cartesian(ylim=c(1e1, 3e5)) +
    xlab("") + ylab("Damages (Million of Dollars)") +
    ggtitle("Total Property Damages by Event Type") 

ecoplot.p2 <- p + geom_bar(stat="identity", aes(y=pct.with.PrDmg), fill="navyblue") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme(legend.position="none") + 
    xlab("Event Type") + ylab("Percentage") + ylim(c(-5,105)) +
    ggtitle("Fraction of Events Causing Damages")

grid.draw(rbind(ggplotGrob(ecoplot.p1), ggplotGrob(ecoplot.p2),  size="last"))

Next, we can apply the same logic to identify the events which cause the greatest economic impacts. First, we’ll split the dataframe in order to sum the total property damages and crop damages per event type and then we’ll sort the data and extract the top 10 events which cause the most economic damages to properties and to crops.

APPENDIX

[Back to the Top]

The “anomalous” 2005 Napa Valley Flood

Going by the “raw” data, the event that caused the highest property damage, was a flood event in Napa, California, at the end of 2005, reported at over 100 Billion USD. However, the REMARKS entry in the database itsels raises doubts about this figure, and a USGS report assessing the impact of the late 2005 storms confirms that.
Our best guess is that the value of the EXP parameter for this event should been M instead of B as this would bring the damages amount in line with the narrative remarks and USGS report. We therefore adjusted accordingly the PROPDMGEXP for this specific event.

Here is the original data:

recent$REMARKS[which.max(tmp.PropDamage)]
# [1] "Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million."

Now, the single event which resulted in highest property damage is storm surge caused by Hurricane Katrina in Lousiana, resulting in over 31 Billion.

recent$REMARKS[which.max(recent$PropDamage)]
# [1] "Storm surge damage in southeast Louisiana, especially in the New Orleans area and the coastal parishes, was catastrophic. Hurricane protection levees and floodwalls were overtopped and/or breached resulting in widespread and deep flooding of homes and businesses. Much of Orleans and Plaquemines Parishes and nearly all of St. Bernard Parish were flooded by storm surge. Approximately 80 percent of the city of New Orleans was flooded. Thousands of people were stranded by the flood waters in homes and buildings and on rooftops for several days and had to be rescued by boat and helicopter. In Jefferson Parish, levees were not compromised, however many homes were flooded by either heavy rain overwhelming limited pumping capacity or storm surge water moving through in-operable pumps into the parish. Severe storm surge damage also occurred along the north shore of Lake Pontchartrain from Mandeville to Slidell with storm surge water moving inland as far as Old Towne Slidell with water up to 6 feet deep in some locations\n\nPost storm high water surveys of the area conducted by FEMA indicated the following storm surge estimates: Orleans Parish - 12-15 feet in east New Orleans to 9 to 12 feet along the Lakefront; St. Bernard Parish - 14 to 17 feet; Jefferson Parish - 6 to 9 feet along the lakefront to 5 to 8 feet from Lafitte to Grand Isle; Plaquemines Parish - 15 to 17 feet; St. Tammany Parish - 11 to 16 feet in southeast portion to 7 to 10 feet in western portion. All storm surge heights are still water elevations referenced to NAVD88 datum."

recent$PropDamage[which.max(recent$PropDamage)]
# [1] 3.13e+10

EVTYPE regularization

This is the source of the evt_regularization.R script:

cat("# Entering 'evtype_regularization' script\n")
TESTvec <- good$EVTYPE

TESTvec <- gsub("^SNOW$", "HEAVY SNOW", TESTvec, perl=TRUE)
TESTvec <- gsub("^WIND$", "HIGH WIND", TESTvec, perl=TRUE)
TESTvec <- gsub("^WINDS$", "HIGH WIND", TESTvec, perl=TRUE)
TESTvec <- gsub("^ICE$", "ICE STORM", TESTvec, perl=TRUE)
TESTvec <- gsub("^UNSEASONABLY WARM$", "HEAT", TESTvec, perl=TRUE)
TESTvec <- gsub("^TORNDAO$", "TORNADO", TESTvec, perl=TRUE)
TESTvec <- gsub("^WAYTERSPOUT$", "WATERSPOUT", TESTvec, perl=TRUE)
TESTvec <- gsub("^EXTREME HEAT$", "EXCESSIVE HEAT", TESTvec, perl=TRUE)
TESTvec <- gsub("^LIGHT FREEZING RAIN$", "SLEET", TESTvec, perl=TRUE)
TESTvec <- gsub("^UNSEASONABLY DRY$", "DROUGHT", TESTvec, perl=TRUE)
TESTvec <- gsub("^TSTM HEAVY RAIN$", "HEAVY RAIN", TESTvec, perl=TRUE)
TESTvec <- gsub("^SMALL HAIL$", "HAIL", TESTvec, perl=TRUE)
TESTvec <- gsub("^$EXCESSIVE SNOW", "HEAVY SNOW", TESTvec, perl=TRUE)

TESTvec[grepl("^ASTRONOMICAL", TESTvec, perl=TRUE)] <- "ASTRONOMICAL LOW TIDE"
TESTvec[grepl("^AVALA", TESTvec, perl=TRUE)] <- "AVALANCHE"
TESTvec[grepl("^BEACH", TESTvec, perl=TRUE)] <- "COASTAL FLOOD"
TESTvec[grepl("^BITTER", TESTvec, perl=TRUE)] <- "EXTREME COLD/WIND CHILL"
TESTvec[grepl("^BLIZZARD", TESTvec, perl=TRUE)] <- "BLIZZARD"
TESTvec[grepl("^BRUSH FIRE", TESTvec, perl=TRUE)] <- "WILDFIRE"
TESTvec[grepl("^COASTAL", TESTvec, perl=TRUE)] <- "COASTAL FLOOD"
TESTvec[grepl("^DROUGHT", TESTvec, perl=TRUE)] <- "DROUGHT"
TESTvec[grepl("^DUST D", TESTvec, perl=TRUE)] <- "DUST DEVIL"
TESTvec[grepl("^EXCESSIVE H", TESTvec, perl=TRUE)] <- "EXCESSIVE HEAT"
TESTvec[grepl("^EXTREME [CW]", TESTvec, perl=TRUE)] <- "EXTREME COLD/WIND CHILL"
TESTvec[grepl("^FLASH", TESTvec, perl=TRUE)] <- "FLASH FLOOD"
TESTvec[grepl("^FLO.*FLA", TESTvec, perl=TRUE)] <- "FLASH FLOOD"
TESTvec[grepl("^HYP", TESTvec, perl=TRUE)] <- "EXTREME COLD/WIND CHILL"
TESTvec[grepl("^LIGHT[A-Z]", TESTvec, perl=TRUE)] <- "LIGHTNING"
TESTvec[grepl("^LANDSL", TESTvec, perl=TRUE)] <- "DEBRIS FLOW"
TESTvec[grepl("^HAIL", TESTvec, perl=TRUE)] <- "HAIL"
TESTvec[grepl("^HEAT WAVE", TESTvec, perl=TRUE)] <- "EXCESSIVE HEAT"
TESTvec[grepl("^HEAVY PR", TESTvec, perl=TRUE)] <- "HEAVY RAIN"
TESTvec[grepl("^HEAVY RAIN", TESTvec, perl=TRUE)] <- "HEAVY RAIN"
TESTvec[grepl("^HEAVY SNOW", TESTvec, perl=TRUE)] <- "HEAVY SNOW"
TESTvec[grepl("^HEAVY SURF", TESTvec, perl=TRUE)] <- "HIGH SURF"
TESTvec[grepl("^HIGH S[UW]", TESTvec, perl=TRUE)] <- "HIGH SURF"
TESTvec[grepl("^MARINE T", TESTvec, perl=TRUE)] <- "MARINE THUNDERSTORM WIND"
TESTvec[grepl("^MINOR FL", TESTvec, perl=TRUE)] <- "FLOOD"
TESTvec[grepl("^MIXED PR", TESTvec, perl=TRUE)] <- "SLEET"
TESTvec[grepl("^MUD", TESTvec, perl=TRUE)] <- "DEBRIS FLOW"
TESTvec[grepl("^RECORD C", TESTvec, perl=TRUE)] <- "EXTREME COLD/WIND CHILL"
TESTvec[grepl("^RECORD H", TESTvec, perl=TRUE)] <- "EXCESSIVE HEAT"
TESTvec[grepl("^RECORD WA", TESTvec, perl=TRUE)] <- "EXCESSIVE HEAT"
TESTvec[grepl("^RIP", TESTvec, perl=TRUE)] <- "RIP CURRENT"
TESTvec[grepl("^R.*FLOOD", TESTvec, perl=TRUE)] <- "FLOOD"
TESTvec[grepl("^SEVERE TH", TESTvec, perl=TRUE)] <- "THUNDERSTORM WIND"
TESTvec[grepl("^SLEET", TESTvec, perl=TRUE)] <- "SLEET"
TESTvec[grepl("^SMALL STR", TESTvec, perl=TRUE)] <- "FLOOD"
TESTvec[grepl("^STRONG W", TESTvec, perl=TRUE)] <- "STRONG WIND"
TESTvec[grepl("^TIDAL FL", TESTvec, perl=TRUE)] <- "COASTAL FLOOD"
TESTvec[grepl("^TORRENTIAL", TESTvec, perl=TRUE)] <- "HEAVY RAIN"
TESTvec[grepl("^TROPICAL STORM", TESTvec, perl=TRUE)] <- "TROPICAL STORM"
TESTvec[grepl("^TSTM W", TESTvec, perl=TRUE)] <- "THUNDERSTORM WIND"
TESTvec[grepl("^URBAN", TESTvec, perl=TRUE)] <- "FLOOD"
TESTvec[grepl("^UNSEASO.* CO", TESTvec, perl=TRUE)] <- "COLD/WIND CHILL"
TESTvec[grepl("^WINTER ST", TESTvec, perl=TRUE)] <- "WINTER STORM"
TESTvec[grepl("^WINTER W", TESTvec, perl=TRUE)] <- "WINTER WEATHER"
TESTvec[grepl("^WATER", TESTvec, perl=TRUE)] <- "WATERSPOUT"
TESTvec[grepl("^WILD", TESTvec, perl=TRUE)] <- "WILDFIRE"
TESTvec[grepl("^WIND CHILL", TESTvec, perl=TRUE)] <- "COLD/WIND CHILL"
TESTvec[grepl("^VOLCANIC", TESTvec, perl=TRUE)] <- "VOLCANIC ASH"


TESTvec[grepl("[VF]OG", TESTvec, perl=TRUE)] <- "FOG GENERAL"
TESTvec[grepl("TYPHOON", TESTvec, perl=TRUE)] <- "HURRICANE"
TESTvec[grepl("SMOKE", TESTvec, perl=TRUE)] <- "DENSE SMOKE"
TESTvec[grepl("FIRES", TESTvec, perl=TRUE)] <- "WILDFIRE"
TESTvec[grepl("SQUALL", TESTvec, perl=TRUE)] <- "HEAVY SNOW"
TESTvec[grepl("EROSION", TESTvec, perl=TRUE)] <- "COASTAL FLOOD"
TESTvec[grepl("^SNOW.*ICE", TESTvec, perl=TRUE)] <- "ICE STORM"
TESTvec[grepl("^SNOW.*SLEET", TESTvec, perl=TRUE)] <- "SLEET"
TESTvec[grepl("^SNOW.*RAIN", TESTvec, perl=TRUE)] <- "SLEET"
TESTvec[grepl("LAKE.*SNOW", TESTvec, perl=TRUE)] <- "LAKE-EFFECT SNOW"
TESTvec[grepl("LAKE.*FLOOD", TESTvec, perl=TRUE)] <- "LAKESHORE FLOOD"
TESTvec[grepl("HURRICANE-GENERATED SWELLS", TESTvec, perl=TRUE)] <- "STORM SURGE"

TESTvec[grepl("^HURRICANE", TESTvec, perl=TRUE)] <- "HURRICANE"             # after |HURRICANE-GENERATED SWELLS

TESTvec[grepl("^ICE[ /S][A-EO-T]", TESTvec, perl=TRUE)] <- "ICE STORM"      # after |[VF]OG
TESTvec[grepl("^FREEZING [DR]", TESTvec, perl=TRUE)] <- "SLEET"             # after |[VF]OG
TESTvec[grepl("^WINT.* MIX", TESTvec, perl=TRUE)] <- "SLEET"                # after |^WINTER W
TESTvec[grepl("SURGE", TESTvec, perl=TRUE)] <- "STORM SURGE"                # after |^COASTAL
TESTvec[grepl("FUNNEL", TESTvec, perl=TRUE)] <- "FUNNEL CLOUD"              # after |^WATER
TESTvec[grepl("TORNADO", TESTvec, perl=TRUE)] <- "TORNADO"                  # after |^WATER
TESTvec[grepl("DUST", TESTvec, perl=TRUE)] <- "DUST STORM"                  # after |^DUST D
TESTvec[grepl("^HIGH WIN", TESTvec, perl=TRUE)] <- "HIGH WIND"              # after |DUST
TESTvec[grepl("^FLOOD", TESTvec, perl=TRUE)] <- "FLOOD"                     # after |^FLO.*FLA

TESTvec[grepl("^COLD", TESTvec, perl=TRUE)] <- "COLD/WIND CHILL"            # after |FUNNEL  , |TORNADO
TESTvec[grepl("^THUNDERSTORM", TESTvec, perl=TRUE)] <- "THUNDERSTORM WIND"  # after |FUNNEL
TESTvec[grepl("^THU.*TOR.*WI", TESTvec, perl=TRUE)] <- "THUNDERSTORM WIND"  # after |FUNNEL

cat("# Leaving 'evtype_regularization' script\n")

R Session Info

sessionInfo()
# R version 3.1.3 (2015-03-09)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.2 LTS
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
#  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
# [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] grid      stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] ggplot2_1.0.1  reshape2_1.4.1 dplyr_0.4.2    plyr_1.8.3     knitr_1.10.5  
# 
# loaded via a namespace (and not attached):
#  [1] DBI_0.3.1        MASS_7.3-41      R6_2.0.1         Rcpp_0.11.6      assertthat_0.1  
#  [6] colorspace_1.2-6 digest_0.6.8     evaluate_0.7     formatR_1.2      gtable_0.1.2    
# [11] highr_0.5        htmltools_0.2.6  labeling_0.3     lazyeval_0.1.10  magrittr_1.5    
# [16] munsell_0.4.2    parallel_3.1.3   proto_0.3-10     rmarkdown_0.7    scales_0.2.4    
# [21] stringi_0.5-5    stringr_1.0.0    tools_3.1.3      yaml_2.1.13

Human and Economic Cost of Major Storm Events: An Analysis of the NOAA/NCDC Database

Giovanni Fossati