### Synopsis

This report describes the harmful impact of severe weather events on the American population health and economy. To study the top weather events, we obtained the data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. NOAA keeps records of fatalities, injuries as well as estimates on property and crop damage. We specifically look at the years 1995 to 2011 because earlier years’ data is largely incomplete. From the storm database, we do an analysis to extract the top 10 severe weather events affecting population health and the economy. Our results show that severe events during warm climate months and storm seasons have the greatest impact on population health and the economy. Flooding, excessive heat and tornadoes are such examples. We also found that the economic consequences are significantly higher for properties than crops, as expected.

### Data Processing

From the NOAA Storm Database, we collected the database between the years 1950 and 2011. The Storm Data Documentation and FAQ are also available and used for this report.
We begin with loading the required libraries that were used in the analysis. Next we use the read.csv to directly read the compressed .bz2 data, which contains column names and cache in memory to a new dataset variable, NOAA.

library(dplyr)
library(ggplot2)
library(gridExtra)

NOAA <- read.csv(bzfile("repdata_data_StormData.csv.bz2"), header=TRUE, stringsAsFactors=FALSE)

##### Summarizing and Cleaning the Data

We look at the structure of the dataset to check for size, column names, classes and which columns are of interest to us.

str(NOAA)

## 'data.frame':    902297 obs. of  37 variables:
##  $STATE__ : num 1 1 1 1 1 1 1 1 1 1 ... ##$ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $BGN_TIME : chr "0130" "0145" "1600" "0900" ... ##$ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $COUNTY : num 97 3 57 89 43 77 9 123 125 57 ... ##$ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $STATE : chr "AL" "AL" "AL" "AL" ... ##$ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ... ##$ BGN_AZI   : chr  "" "" "" "" ...
##  $BGN_LOCATI: chr "" "" "" "" ... ##$ END_DATE  : chr  "" "" "" "" ...
##  $END_TIME : chr "" "" "" "" ... ##$ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $COUNTYENDN: logi NA NA NA NA NA NA ... ##$ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $END_AZI : chr "" "" "" "" ... ##$ END_LOCATI: chr  "" "" "" "" ...
##  $LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ... ##$ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $F : int 3 2 2 2 2 2 2 1 3 3 ... ##$ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ... ##$ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ... ##$ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ... ##$ CROPDMGEXP: chr  "" "" "" "" ...
##  $WFO : chr "" "" "" "" ... ##$ STATEOFFIC: chr  "" "" "" "" ...
##  $ZONENAMES : chr "" "" "" "" ... ##$ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $LONGITUDE : num 8812 8755 8742 8626 8642 ... ##$ LATITUDE_E: num  3051 0 0 0 0 ...
##  $LONGITUDE_: num 8806 0 0 0 0 ... ##$ REMARKS   : chr  "" "" "" "" ...
##  $REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...  As can be seen, the BGN_DATE field is of the character class and we would like it in the date class so that we can explore the data by year in a frequency table. We add this date class with the year as a new column. NOAA$YEAR <- as.numeric(format(as.Date(NOAA$BGN_DATE, "%m/%d/%Y %H:%M:%S"), "%Y")) table(NOAA$YEAR)

##
##  1950  1951  1952  1953  1954  1955  1956  1957  1958  1959  1960  1961
##   223   269   272   492   609  1413  1703  2184  2213  1813  1945  2246
##  1962  1963  1964  1965  1966  1967  1968  1969  1970  1971  1972  1973
##  2389  1968  2348  2855  2388  2688  3312  2926  3215  3471  2168  4463
##  1974  1975  1976  1977  1978  1979  1980  1981  1982  1983  1984  1985
##  5386  4975  3768  3728  3657  4279  6146  4517  7132  8322  7335  7979
##  1986  1987  1988  1989  1990  1991  1992  1993  1994  1995  1996  1997
##  8726  7367  7257 10410 10946 12522 13534 12607 20631 27970 32270 28680
##  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009
## 38128 31289 34471 34962 36293 39752 39363 39184 44034 43289 55663 45817
##  2010  2011
## 48161 62174


Looking at the table, we see small values for earlier years up until 1995 where there is a relatively large increase. Therefore, we select to extract only from the year 1995 onwards and store to a new filtered dataset.

NOAA.filtered <- NOAA %>%
filter(YEAR >= 1995)


We also check for missing data by finding the proportion of the observations.

mean(is.na(NOAA.filtered))

## [1] 0.0516963


Since this value of 0.0516963 is low, we can continue with our analysis. Looking at the EVTYPE column, we notice lots of typographical errors and duplications. This column needs cleaning. We issue some pattern matching and replacement instructions to fix this. The Storm Data Documentation was followed to see how columns should be identified by name and which categories. This new clean column will replace the EVTYPE column in the filtered dataset.

type <- NOAA.filtered$EVTYPE type <- gsub("TSTM W[^.]+", "THUNDERSTORM WIND", type, ignore.case=TRUE) type <- gsub("THUNDERSTORMS W[^.]+", "THUNDERSTORM WIND", type, ignore.case=TRUE) type <- gsub("^ THUNDERSTORM W[^.]+", "THUNDERSTORM WIND", type, ignore.case=TRUE) type <- gsub("THUNDERSTORM W[^.]+", "THUNDERSTORM WIND", type, ignore.case=TRUE) type <- gsub("thunderstorm wind", "THUNDERSTORM WIND", type) type <- gsub("TORNADO[^.]+", "TORNADO", type, ignore.case=TRUE) type <- gsub("FLASH FLOOD[^.]+", "FLASH FLOOD", type, ignore.case=TRUE) type <- gsub("[^.]+FLASH FLOOD", "FLASH FLOOD", type, ignore.case=TRUE) type <- gsub("[^.]+HAIL$", "HAIL", type, ignore.case=TRUE)
type <- gsub("^HAIL[^.]+", "HAIL", type, ignore.case=TRUE)
type <- gsub("[^.]+HAIL[^.]+", "HAIL", type, ignore.case=TRUE)
type <- gsub("^WILDFIRE$", "WILD/FOREST FIRE", type, ignore.case=TRUE) type <- gsub("^HURRICANE$", "HURRICANE/TYPHOON", type, ignore.case=TRUE)
type <- gsub("^TYPHOON$", "HURRICANE/TYPHOON", type, ignore.case=TRUE) type <- gsub("^STORM SURGE$", "STORM SURGE/TIDE", type, ignore.case=TRUE)
type <- gsub("HIGH WIND[^.]+", "HIGH WIND", type, ignore.case=TRUE)
type <- gsub("RIP CURRENT[^.]+", "RIP CURRENT", type, ignore.case=TRUE)
type <- gsub("[^.]+DROUGHT[^.]+", "DROUGHT", type, ignore.case=TRUE)
type <- gsub("River Flood[^.]+", "RIVER FLOOD", type, ignore.case=TRUE)
type <- gsub("URBAN/SML STREAM FLD", "RIVER FLOOD", type, ignore.case=TRUE)
type <- gsub("^RIVER FLOOD$", "FLOOD", type, ignore.case=TRUE) type <- gsub("^COASTAL FLOOD$", "COASTAL FLOODING/EROSION", type, ignore.case=TRUE)
type <- gsub("COASTAL FLOOD[^.]+", "COASTAL FLOODING/EROSION", type, ignore.case=TRUE)
type <- gsub("EXTREME WIND[^.]+", "EXTREME COLD/WIND CHILL", type, ignore.case=TRUE)
type <- gsub("^EXTREME COLD$", "EXTREME COLD/WIND CHILL", type, ignore.case=TRUE) type <- gsub("Heavy Rain[^.]+", "HEAVY RAIN", type, ignore.case=TRUE) type <- gsub("Cold[^.]+", "COLD/WIND CHILL", type, ignore.case=TRUE) type <- gsub("^Cold$", "COLD/WIND CHILL", type, ignore.case=TRUE)
type <- gsub("WINTER WEATHER[^.]+", "WINTER WEATHER", type, ignore.case=TRUE)
NOAA.filtered$EVTYPE <- type  ##### Data Manipulation of the Population Health To find the data on the population health, we use the dplyr package to extract the relevant columns and get the sum by each weather type. We then create new separate datasets, fatalities and injuries that store the top ten fatalities and injuries by weather type. health <- NOAA.filtered %>% select(EVTYPE, FATALITIES, INJURIES) %>% group_by(EVTYPE) %>% summarise_each(funs(sum)) fatalities <- health %>% select(EVTYPE, FATALITIES) %>% arrange(desc(FATALITIES)) %>% top_n(10) injuries <- health %>% select(EVTYPE, INJURIES) %>% arrange(desc(INJURIES)) %>% top_n(10)  ##### Data Manipulation of the Economy Looking at the relevant columns for the property and crop damage, we notice small numbers. However, there are two additional columns, PROPDMGEXP and CROPDMGEXP that contain the magnitudes. The contents are below. unique(NOAA.filtered$PROPDMGEXP)

##  [1] ""  "B" "M" "K" "m" "+" "0" "5" "6" "?" "4" "2" "3" "7" "H" "-" "1"
## [18] "8"


Referring to the Storm Data Documentation, we see that the letters refer to magnitudes, such as h=1e2, k=1e3, etc. The number value of these letters can be used to multiply with the PROPDMG and CROPDMG columns to get the actual damage value. We do this but set all the other characters as 0 since they are either digit carry-overs from other columns or not applicable characters.

economic <- NOAA.filtered %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
economic$PROPDMGEXP <- with(economic, ifelse(PROPDMGEXP=="h"|PROPDMGEXP=="H", 1e2, ifelse(PROPDMGEXP=="k"|PROPDMGEXP=="K", 1e3, ifelse(PROPDMGEXP=="m"|PROPDMGEXP=="M", 1e6, ifelse(PROPDMGEXP=="b"|PROPDMGEXP=="B", 1e9, 0))))) economic$CROPDMGEXP <- with(economic, ifelse(CROPDMGEXP=="h"|CROPDMGEXP=="H", 1e2,
ifelse(CROPDMGEXP=="k"|CROPDMGEXP=="K", 1e3,
ifelse(CROPDMGEXP=="m"|CROPDMGEXP=="M", 1e6,
ifelse(CROPDMGEXP=="b"|CROPDMGEXP=="B", 1e9, 0)))))
economic$PROPDMG <- economic$PROPDMG * economic$PROPDMGEXP economic$CROPDMG <- economic$CROPDMG * economic$CROPDMGEXP


To find the data on the economy, just as we did for the population health, we extract the relevant columns and get the sum by each weather type. We then create new separate datasets, propertyDamage and CropDamage that store the top ten crop and property damages by weather type.

economic <- economic %>%
select(EVTYPE, PROPDMG, CROPDMG) %>%
group_by(EVTYPE) %>%
summarise_each(funs(sum))
cropDamage <- economic %>%
select(EVTYPE, CROPDMG) %>%
arrange(desc(CROPDMG)) %>%
top_n(10)
propertyDamage <- economic %>%
select(EVTYPE, PROPDMG) %>%
arrange(desc(PROPDMG)) %>%
top_n(10)


Finally, we do some cosmetic changes of the extracted top 10 values for our graphical plots. The weather events are changed to lowercase words with leading capital letters and the damage values are divided by 1e9 so the results can be displayed in larger billion USD units.

fatalities$EVTYPE <- gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", fatalities$EVTYPE, perl=TRUE)
injuries$EVTYPE <- gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", injuries$EVTYPE, perl=TRUE)
cropDamage$EVTYPE <- gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", cropDamage$EVTYPE, perl=TRUE)
propertyDamage$EVTYPE <- gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", propertyDamage$EVTYPE, perl=TRUE)
cropDamage <- mutate(cropDamage, CROPDMG=CROPDMG/1e9)
cropDamage$CROPDMG <- as.numeric(format(round(cropDamage$CROPDMG, 2)))
propertyDamage <- mutate(propertyDamage, PROPDMG=PROPDMG/1e9)
propertyDamage$PROPDMG = as.numeric(format(round(propertyDamage$PROPDMG, 2)))


### Results

##### Impact on the Population Health

The results of the top 10 weather events leading to the largest number of fatalities and injuries from 1995 to 2011 are shown below. Finally, we show a plot.

EVTYPE FATALITIES
Excessive Heat 1903
Flash Flood 951
Heat 924
Lightning 729
Rip Current 569
Flood 454
Thunderstorm Wind 418
Extreme Cold/Wind Chill 270
High Wind 252
EVTYPE INJURIES
Flood 6850
Excessive Heat 6525
Thunderstorm Wind 5603
Lightning 4631
Heat 2030
Flash Flood 1739
Wild/Forest Fire 1456
Hurricane/Typhoon 1326
Winter Storm 1298

As we can see, excessive heat is the number one cause of fatalities followed by tornadoes. There is a slight drop in the numbers afterwards. However, what can be said is that the top 5 severe events are related to warm climate and that is when more deaths occur. Injuries are a similar story with tornadoes taking the top position by a large margin. Again, the top 5 events causing injuries are related to warm climates. We should note that flooding can also occur during winter after large snow storms.

#### Impact on the Economy

The results of the top 10 weather events leading to the largest number of property and crop damages from 1995 to 2011 are shown below. The damages are in billion (USD) units. Finally, we show a plot.

EVTYPE PROPDMG
Flood 144.24
Hurricane/Typhoon 81.72
Storm Surge/Tide 47.83
Flash Flood 15.53
Hail 15.29
Thunderstorm Wind 8.65
Wild/Forest Fire 7.76
Tropical Storm 7.65
High Wind 5.35
EVTYPE CROPDMG
Drought 13.92
Flood 5.48
Hurricane/Typhoon 5.35
Hail 2.64
Flash Flood 1.45
Extreme Cold/Wind Chill 1.33
Frost/Freeze 1.09
Thunderstorm Wind 1.09
Heavy Rain 0.73
Tropical Storm 0.68

Looking at the figure, we can clearly see floods cause the greatest property damage by a significant margin. In fact, the top 5 severe events are related to warm climates in reference to winds and water. For crop damage, the opposite is true with drought being number one. Still, the top 5 events also seem to be related to warm climates and flooding is a major issue. We can conclude that more resources must be made available during warmer months and storm seasons to prevent damages and more vigilance taken by people for their safety during those times.