Synopsis
This report describes the harmful impact of severe weather events on the American population health and economy. To study the top weather events, we obtained the data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. NOAA keeps records of fatalities, injuries as well as estimates on property and crop damage. We specifically look at the years 1995 to 2011 because earlier years’ data is largely incomplete. From the storm database, we do an analysis to extract the top 10 severe weather events affecting population health and the economy. Our results show that severe events during warm climate months and storm seasons have the greatest impact on population health and the economy. Flooding, excessive heat and tornadoes are such examples. We also found that the economic consequences are significantly higher for properties than crops, as expected.
Data Processing
Loading and Processing the Raw Data
From the NOAA Storm Database, we collected the database between the years 1950 and 2011. The Storm Data Documentation and FAQ are also available and used for this report.
We begin with loading the required libraries that were used in the analysis. Next we use the read.csv to directly read the compressed .bz2 data, which contains column names and cache in memory to a new dataset variable, NOAA.
library(dplyr)
library(ggplot2)
library(gridExtra)
NOAA <- read.csv(bzfile("repdata_data_StormData.csv.bz2"), header=TRUE, stringsAsFactors=FALSE)
Summarizing and Cleaning the Data
We look at the structure of the dataset to check for size, column names, classes and which columns are of interest to us.
str(NOAA)
## 'data.frame': 902297 obs. of 37 variables: ## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ... ## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ... ## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ... ## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ... ## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ... ## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ... ## $ STATE : chr "AL" "AL" "AL" "AL" ... ## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ... ## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ... ## $ BGN_AZI : chr "" "" "" "" ... ## $ BGN_LOCATI: chr "" "" "" "" ... ## $ END_DATE : chr "" "" "" "" ... ## $ END_TIME : chr "" "" "" "" ... ## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ... ## $ COUNTYENDN: logi NA NA NA NA NA NA ... ## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ... ## $ END_AZI : chr "" "" "" "" ... ## $ END_LOCATI: chr "" "" "" "" ... ## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ... ## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ... ## $ F : int 3 2 2 2 2 2 2 1 3 3 ... ## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ... ## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ... ## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ... ## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ... ## $ PROPDMGEXP: chr "K" "K" "K" "K" ... ## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ... ## $ CROPDMGEXP: chr "" "" "" "" ... ## $ WFO : chr "" "" "" "" ... ## $ STATEOFFIC: chr "" "" "" "" ... ## $ ZONENAMES : chr "" "" "" "" ... ## $ LATITUDE : num 3040 3042 3340 3458 3412 ... ## $ LONGITUDE : num 8812 8755 8742 8626 8642 ... ## $ LATITUDE_E: num 3051 0 0 0 0 ... ## $ LONGITUDE_: num 8806 0 0 0 0 ... ## $ REMARKS : chr "" "" "" "" ... ## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
As can be seen, the BGN_DATE field is of the character class and we would like it in the date class so that we can explore the data by year in a frequency table. We add this date class with the year as a new column.
NOAA$YEAR <- as.numeric(format(as.Date(NOAA$BGN_DATE, "%m/%d/%Y %H:%M:%S"), "%Y"))
table(NOAA$YEAR)
## ## 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 ## 223 269 272 492 609 1413 1703 2184 2213 1813 1945 2246 ## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 ## 2389 1968 2348 2855 2388 2688 3312 2926 3215 3471 2168 4463 ## 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 ## 5386 4975 3768 3728 3657 4279 6146 4517 7132 8322 7335 7979 ## 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 ## 8726 7367 7257 10410 10946 12522 13534 12607 20631 27970 32270 28680 ## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ## 38128 31289 34471 34962 36293 39752 39363 39184 44034 43289 55663 45817 ## 2010 2011 ## 48161 62174
Looking at the table, we see small values for earlier years up until 1995 where there is a relatively large increase. Therefore, we select to extract only from the year 1995 onwards and store to a new filtered dataset.
NOAA.filtered <- NOAA %>%
filter(YEAR >= 1995)
We also check for missing data by finding the proportion of the observations.
mean(is.na(NOAA.filtered))
## [1] 0.0516963
Since this value of 0.0516963 is low, we can continue with our analysis. Looking at the EVTYPE column, we notice lots of typographical errors and duplications. This column needs cleaning. We issue some pattern matching and replacement instructions to fix this. The Storm Data Documentation was followed to see how columns should be identified by name and which categories. This new clean column will replace the EVTYPE column in the filtered dataset.
type <- NOAA.filtered$EVTYPE
type <- gsub("TSTM W[^.]+", "THUNDERSTORM WIND", type, ignore.case=TRUE)
type <- gsub("THUNDERSTORMS W[^.]+", "THUNDERSTORM WIND", type, ignore.case=TRUE)
type <- gsub("^ THUNDERSTORM W[^.]+", "THUNDERSTORM WIND", type, ignore.case=TRUE)
type <- gsub("THUNDERSTORM W[^.]+", "THUNDERSTORM WIND", type, ignore.case=TRUE)
type <- gsub("thunderstorm wind", "THUNDERSTORM WIND", type)
type <- gsub("TORNADO[^.]+", "TORNADO", type, ignore.case=TRUE)
type <- gsub("FLASH FLOOD[^.]+", "FLASH FLOOD", type, ignore.case=TRUE)
type <- gsub("[^.]+FLASH FLOOD", "FLASH FLOOD", type, ignore.case=TRUE)
type <- gsub("[^.]+HAIL$", "HAIL", type, ignore.case=TRUE)
type <- gsub("^HAIL[^.]+", "HAIL", type, ignore.case=TRUE)
type <- gsub("[^.]+HAIL[^.]+", "HAIL", type, ignore.case=TRUE)
type <- gsub("^WILDFIRE$", "WILD/FOREST FIRE", type, ignore.case=TRUE)
type <- gsub("^HURRICANE$", "HURRICANE/TYPHOON", type, ignore.case=TRUE)
type <- gsub("^TYPHOON$", "HURRICANE/TYPHOON", type, ignore.case=TRUE)
type <- gsub("^STORM SURGE$", "STORM SURGE/TIDE", type, ignore.case=TRUE)
type <- gsub("HIGH WIND[^.]+", "HIGH WIND", type, ignore.case=TRUE)
type <- gsub("RIP CURRENT[^.]+", "RIP CURRENT", type, ignore.case=TRUE)
type <- gsub("[^.]+DROUGHT[^.]+", "DROUGHT", type, ignore.case=TRUE)
type <- gsub("River Flood[^.]+", "RIVER FLOOD", type, ignore.case=TRUE)
type <- gsub("URBAN/SML STREAM FLD", "RIVER FLOOD", type, ignore.case=TRUE)
type <- gsub("^RIVER FLOOD$", "FLOOD", type, ignore.case=TRUE)
type <- gsub("^COASTAL FLOOD$", "COASTAL FLOODING/EROSION", type, ignore.case=TRUE)
type <- gsub("COASTAL FLOOD[^.]+", "COASTAL FLOODING/EROSION", type, ignore.case=TRUE)
type <- gsub("EXTREME WIND[^.]+", "EXTREME COLD/WIND CHILL", type, ignore.case=TRUE)
type <- gsub("^EXTREME COLD$", "EXTREME COLD/WIND CHILL", type, ignore.case=TRUE)
type <- gsub("Heavy Rain[^.]+", "HEAVY RAIN", type, ignore.case=TRUE)
type <- gsub("Cold[^.]+", "COLD/WIND CHILL", type, ignore.case=TRUE)
type <- gsub("^Cold$", "COLD/WIND CHILL", type, ignore.case=TRUE)
type <- gsub("WINTER WEATHER[^.]+", "WINTER WEATHER", type, ignore.case=TRUE)
NOAA.filtered$EVTYPE <- type
Data Manipulation of the Population Health
To find the data on the population health, we use the dplyr package to extract the relevant columns and get the sum by each weather type. We then create new separate datasets, fatalities and injuries that store the top ten fatalities and injuries by weather type.
health <- NOAA.filtered %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE) %>%
summarise_each(funs(sum))
fatalities <- health %>%
select(EVTYPE, FATALITIES) %>%
arrange(desc(FATALITIES)) %>%
top_n(10)
injuries <- health %>%
select(EVTYPE, INJURIES) %>%
arrange(desc(INJURIES)) %>%
top_n(10)
Data Manipulation of the Economy
Looking at the relevant columns for the property and crop damage, we notice small numbers. However, there are two additional columns, PROPDMGEXP and CROPDMGEXP that contain the magnitudes. The contents are below.
unique(NOAA.filtered$PROPDMGEXP)
## [1] "" "B" "M" "K" "m" "+" "0" "5" "6" "?" "4" "2" "3" "7" "H" "-" "1" ## [18] "8"
Referring to the Storm Data Documentation, we see that the letters refer to magnitudes, such as h=1e2, k=1e3, etc. The number value of these letters can be used to multiply with the PROPDMG and CROPDMG columns to get the actual damage value. We do this but set all the other characters as 0 since they are either digit carry-overs from other columns or not applicable characters.
economic <- NOAA.filtered %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
economic$PROPDMGEXP <- with(economic, ifelse(PROPDMGEXP=="h"|PROPDMGEXP=="H", 1e2,
ifelse(PROPDMGEXP=="k"|PROPDMGEXP=="K", 1e3,
ifelse(PROPDMGEXP=="m"|PROPDMGEXP=="M", 1e6,
ifelse(PROPDMGEXP=="b"|PROPDMGEXP=="B", 1e9, 0)))))
economic$CROPDMGEXP <- with(economic, ifelse(CROPDMGEXP=="h"|CROPDMGEXP=="H", 1e2,
ifelse(CROPDMGEXP=="k"|CROPDMGEXP=="K", 1e3,
ifelse(CROPDMGEXP=="m"|CROPDMGEXP=="M", 1e6,
ifelse(CROPDMGEXP=="b"|CROPDMGEXP=="B", 1e9, 0)))))
economic$PROPDMG <- economic$PROPDMG * economic$PROPDMGEXP
economic$CROPDMG <- economic$CROPDMG * economic$CROPDMGEXP
To find the data on the economy, just as we did for the population health, we extract the relevant columns and get the sum by each weather type. We then create new separate datasets, propertyDamage and CropDamage that store the top ten crop and property damages by weather type.
economic <- economic %>%
select(EVTYPE, PROPDMG, CROPDMG) %>%
group_by(EVTYPE) %>%
summarise_each(funs(sum))
cropDamage <- economic %>%
select(EVTYPE, CROPDMG) %>%
arrange(desc(CROPDMG)) %>%
top_n(10)
propertyDamage <- economic %>%
select(EVTYPE, PROPDMG) %>%
arrange(desc(PROPDMG)) %>%
top_n(10)
Finally, we do some cosmetic changes of the extracted top 10 values for our graphical plots. The weather events are changed to lowercase words with leading capital letters and the damage values are divided by 1e9 so the results can be displayed in larger billion USD units.
fatalities$EVTYPE <- gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", fatalities$EVTYPE, perl=TRUE)
injuries$EVTYPE <- gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", injuries$EVTYPE, perl=TRUE)
cropDamage$EVTYPE <- gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", cropDamage$EVTYPE, perl=TRUE)
propertyDamage$EVTYPE <- gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", propertyDamage$EVTYPE, perl=TRUE)
cropDamage <- mutate(cropDamage, CROPDMG=CROPDMG/1e9)
cropDamage$CROPDMG <- as.numeric(format(round(cropDamage$CROPDMG, 2)))
propertyDamage <- mutate(propertyDamage, PROPDMG=PROPDMG/1e9)
propertyDamage$PROPDMG = as.numeric(format(round(propertyDamage$PROPDMG, 2)))
Results
Impact on the Population Health
The results of the top 10 weather events leading to the largest number of fatalities and injuries from 1995 to 2011 are shown below. Finally, we show a plot.
EVTYPE | FATALITIES |
---|---|
Excessive Heat | 1903 |
Tornado | 1545 |
Flash Flood | 951 |
Heat | 924 |
Lightning | 729 |
Rip Current | 569 |
Flood | 454 |
Thunderstorm Wind | 418 |
Extreme Cold/Wind Chill | 270 |
High Wind | 252 |
EVTYPE | INJURIES |
---|---|
Tornado | 21783 |
Flood | 6850 |
Excessive Heat | 6525 |
Thunderstorm Wind | 5603 |
Lightning | 4631 |
Heat | 2030 |
Flash Flood | 1739 |
Wild/Forest Fire | 1456 |
Hurricane/Typhoon | 1326 |
Winter Storm | 1298 |
As we can see, excessive heat is the number one cause of fatalities followed by tornadoes. There is a slight drop in the numbers afterwards. However, what can be said is that the top 5 severe events are related to warm climate and that is when more deaths occur. Injuries are a similar story with tornadoes taking the top position by a large margin. Again, the top 5 events causing injuries are related to warm climates. We should note that flooding can also occur during winter after large snow storms.
Impact on the Economy
The results of the top 10 weather events leading to the largest number of property and crop damages from 1995 to 2011 are shown below. The damages are in billion (USD) units. Finally, we show a plot.
EVTYPE | PROPDMG |
---|---|
Flood | 144.24 |
Hurricane/Typhoon | 81.72 |
Storm Surge/Tide | 47.83 |
Tornado | 24.93 |
Flash Flood | 15.53 |
Hail | 15.29 |
Thunderstorm Wind | 8.65 |
Wild/Forest Fire | 7.76 |
Tropical Storm | 7.65 |
High Wind | 5.35 |
EVTYPE | CROPDMG |
---|---|
Drought | 13.92 |
Flood | 5.48 |
Hurricane/Typhoon | 5.35 |
Hail | 2.64 |
Flash Flood | 1.45 |
Extreme Cold/Wind Chill | 1.33 |
Frost/Freeze | 1.09 |
Thunderstorm Wind | 1.09 |
Heavy Rain | 0.73 |
Tropical Storm | 0.68 |
Looking at the figure, we can clearly see floods cause the greatest property damage by a significant margin. In fact, the top 5 severe events are related to warm climates in reference to winds and water. For crop damage, the opposite is true with drought being number one. Still, the top 5 events also seem to be related to warm climates and flooding is a major issue. We can conclude that more resources must be made available during warmer months and storm seasons to prevent damages and more vigilance taken by people for their safety during those times.