Impact of weather events in the US
Synopsis
Using the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database I explore which weather events are most harmful to population health and economy in the US.
Regarding the population health, the most frequent events are usually the most harmful ones such as tornadoes, floods, thunderstorm winds, hurricanes and lightnings. An exception is the excessive heat event which is the 20th most frequent, but the deadliest with 88 deaths per year on average during the 2000-2011 period. Regarding the economy, the most frequent events are not necessarily the most harmful. The first four most harmful economically events are hydro-meteorological events such as floods, hail, storms, hurricanes. During 2000-2011 damage done to property was 14 times greater than the damage done to crops.Below is a detailed analysis.
Data processing
I start with downloading the data, unzipping it, and reading it in.
if(!file.exists("data.csv")) {
if(!file.exists("data.bz2")){
#Download
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url,"data.bz2",method="curl")
}
# Unzip
zz <- readLines(gzfile("data.bz2"))
zz <- iconv(zz, "latin1", "ASCII", sub="")
writeLines(zz, "data.csv")
rm(zz)
}
## Read data in
data<-read.csv("data.csv", sep=",", quote = "\"", header=TRUE)
The data set has 902297 observations and 37 variables (columns). Below are the first six rows of the first 5 columns of the data set I am working with as well as the names of all columns.
head(data[,1:5])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## 1 1 4/18/1950 0:00:00 0130 CST 97
## 2 1 4/18/1950 0:00:00 0145 CST 3
## 3 1 2/20/1951 0:00:00 1600 CST 57
## 4 1 6/8/1951 0:00:00 0900 CST 89
## 5 1 11/15/1951 0:00:00 1500 CST 43
## 6 1 11/15/1951 0:00:00 2000 CST 77
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
For my analysis I needed to extract the year information from the BGN_DATE
column.
# Add a new year column to the dataset
data<-data.frame(Year=format(as.Date(data$BGN_DATE,format="%m/%d/%Y"),"%Y"),data)
head(data[,1:5])
## Year STATE__ BGN_DATE BGN_TIME TIME_ZONE
## 1 1950 1 4/18/1950 0:00:00 0130 CST
## 2 1950 1 4/18/1950 0:00:00 0145 CST
## 3 1951 1 2/20/1951 0:00:00 1600 CST
## 4 1951 1 6/8/1951 0:00:00 0900 CST
## 5 1951 1 11/15/1951 0:00:00 1500 CST
## 6 1951 1 11/15/1951 0:00:00 2000 CST
After looking at the most frequent events I noticed duplicates THUNDERSTORM WIND, THUNDERSTORM WINDS, TSTM WIND, MARINE TSTM WIND. These are the same thing, so I replaced them with TSTM WIND.
# Replace duplicate THUDNDERSTORM WIND, etc. with TTSM WIND
data$EVTYPE = sapply(data$EVTYPE,function(x) gsub("THUNDERSTORM WINDS|MARINE TSTM WIND|THUNDERSTORM WIND","TSTM WIND",x))
The data set has scaling factors for two variables which are needed for analysis. Namely, property damage PROPDMG
and crop damage CROPDMG
. They have corresponding scaling columns PROPDMGEXP
and CROPDMGEXP
. The scaling columns contain information about how to scale the values in the columns PROPDMG
and CROPDMG
. For example, a value in the PROPDMG
column that has a scaling factor “k” in the PROPDMGEXP
column should be multiplied by $10^3$. I use the following scheme for the scaling factors.
## Scaling exponent Occurences Scaling factor
## 1 465934 10^0
## 2 - 1 10^0
## 3 ? 8 10^0
## 4 + 5 10^0
## 5 0 216 10^0
## 6 1 25 10^1
## 7 2 13 10^2
## 8 3 4 10^3
## 9 4 4 10^4
## 10 5 28 10^5
## 11 6 4 10^6
## 12 7 5 10^7
## 13 8 1 10^8
## 14 B 40 10^9
## 15 h 1 10^2
## 16 H 6 10^2
## 17 K 424665 10^3
## 18 m 7 10^6
## 19 M 11330 10^6
Some values in the PROPDMG
and CROPDMG
did not have scaling factors specified or had symbols like +,-,?
. I decided not to scale the values that had those symbols. The rest is intuitive, namely, “b” and “B” stand for billion and etc. In the code below I create two new columns, PROPDMGSCALE
and CROPDMGSCALE
, with property damage and crop damage scaled.
scale.func <- function(x) {
# Function that replaces a factor with a numeric
if(x %in% 0:8) {x<-as.numeric(x)}
else if(x %in% c("b","B")) {x<-10^9} # billion
else if(x %in% c("m","M")) {x<-10^6} # million/mega
else if(x %in% c("k","K")) {x<-10^3} # kilo
else if(x %in% c("h","H")) {x<-10^2} # hundred
else x<-10^0
}
# Apply scale.func with sapply
data$PROPDMGSCALE <- sapply(data$PROPDMGEXP,scale.func) * data$PROPDMG
data$CROPDMGSCALE <- sapply(data$CROPDMGEXP,scale.func) * data$CROPDMG
I also created a plotting function for horizontal barplots, so I dont have to repeat code:
plot.k<- function(df,title){
# A function that plots a barplot with presets. Arguments are a matrix-like dataset and a plot title
barplot(main=title,sort(df,decreasing=FALSE),las=1,horiz=TRUE,cex.names=0.75,col=c("lightblue"))
}
Results
According to the project instructions, with time the recording of the weather events improved. In the multiline graph I show the number of occurences of 10 most frequent weather events by year during 1950-2011. The number of occurences of the events increased drastically due to improvements in the process of recording those events.
# Get table of counts of events by year
dat = as.data.frame(table(data[,c("Year","EVTYPE")]))
# 10 most frequent events
a = sort(apply(table(data[,c("Year","EVTYPE")]),2,sum),decreasing=TRUE)[1:10]
dat = dat[dat$EVTYPE %in% names(a),]
# Modify year column to be in the unambiguos date format %Y-%m-%d
dat$Year = paste0(dat$Year,"-01-01")
dat$Year = as.numeric(as.POSIXct(dat$Year))
# Create Rickshaw graph
# require(devtools)
# install_github('ramnathv/rCharts')
require(rCharts)
r2 <- Rickshaw$new()
r2$layer(
Freq ~ Year,
data = dat,
type = "line",
groups = "EVTYPE",
height = 340,
width = 700
)
# turn on built in features
r2$set(
hoverDetail = TRUE,
xAxis = TRUE,
yAxis = TRUE,
shelving = FALSE,
legend = TRUE,
slider = TRUE,
highlight = TRUE
)
r2$show('iframesrc', cdn = TRUE)
Below is the list of average occurences per year during 1950-2011 of 20 most frequent events.
# List of average occurences per year
avg_occ = sort(apply(table(data$Year,data$EVTYPE),2,sum),decreasing=TRUE)[1:20]/61
data.frame(`Avg Occurences` = avg_occ)
## Avg.Occurences
## TSTM WIND 5401.98361
## HAIL 4732.14754
## TORNADO 994.29508
## FLASH FLOOD 889.78689
## FLOOD 415.18033
## HIGH WIND 331.34426
## LIGHTNING 258.26230
## HEAVY SNOW 257.50820
## HEAVY RAIN 192.18033
## WINTER STORM 187.42623
## WINTER WEATHER 115.18033
## FUNNEL CLOUD 112.11475
## MARINE TSTM WIND 95.27869
## WATERSPOUT 62.22951
## STRONG WIND 58.45902
## URBAN/SML STREAM FLD 55.60656
## WILDFIRE 45.26230
## BLIZZARD 44.57377
## DROUGHT 40.78689
## ICE STORM 32.88525
In the table below I show the annual averages of property damage, crop damage, total damage, number of injuries, deaths and occurences during 1950-2011. I highlight the time period because later in my analysis I focus on the 2000-2011 period. The table is initially sorted by the average number of occurences per year. You may play around with the table.
# http://rstudio.github.io/DT/
# require(devtools)
# install_github('ramnathv/htmlwidgets')
# install_github("rstudio/DT")
require(htmlwidgets)
require(DT)
# Create data frame for datatable
datatab <- data.frame(
`Property Damage` = tapply(data$PROPDMGSCALE,list(data$EVTYPE),sum)/61,
`Crop Damage` = tapply(data$CROPDMGSCALE,list(data$EVTYPE),sum)/61,
`Total Damage` =
tapply(data$PROPDMGSCALE,list(data$EVTYPE),sum)/61 + tapply(data$CROPDMGSCALE,list(data$EVTYPE),sum)/61,
Deaths = tapply(data$FATALITIES,list(data$EVTYPE),sum)/61,
Injuries = tapply(data$INJURIES,list(data$EVTYPE),sum)/61,
Occurences = matrix(table(data[,"EVTYPE"])/61)
)
# Create datatable initially sorted by avg number of occurences
datatable(format(datatab,big.mark = ',',scientific = F, digits = 1),
colnames = c('Property Damage,$' = 2, 'Crop Damage,$' = 3, 'Total Damage,$' = 4),
caption = 'Table: Average annual values, 1950-2011',
options = list(order = list(list(6, 'desc'))),
extensions = 'Responsive'
)
Let’s see whether the most frequent events have a significant effect on the health of population and the infrastructure.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
I summarize the number of total injuries and deaths by the event type over the period 1950 to 2011.
# Table of total deaths and injuries by event type during 1950-2011
Tdeaths<-sort(tapply(data$FATALITIES,list(data$EVTYPE),sum),decreasing=TRUE)
Tinj<-sort(tapply(data$INJURIES,list(data$EVTYPE),sum),decreasing=TRUE)
# Total deaths by event type
head(Tdeaths)
## TORNADO EXCESSIVE HEAT FLASH FLOOD HEAT LIGHTNING
## 5633 1903 978 937 816
## TSTM WIND
## 710
# Total injuries by event type
head(Tinj)
## TORNADO TSTM WIND FLOOD EXCESSIVE HEAT LIGHTNING
## 91346 9361 6789 6525 5230
## HEAT
## 2100
Next, I calculate the average number of deaths and injuries for years 2000-2011. I chose the period 2000-2011 because in the earlier years there are generally fewer events recorded due to poor recording process as was stated in the project instructions. Recent years have more accurate records.
# Boolean to subset data
years<-data$Year %in% 2000:2010
# Average deaths and injuries per year
TdeathsAve<-sort(tapply(data[years,]$FATALITIES,list(data[years,]$EVTYPE),sum),decreasing=TRUE)/11
TinjAve<-sort(tapply(data[years,]$INJURIES,list(data[years,]$EVTYPE),sum),decreasing=TRUE)/11
# Average deaths per year by event type
head(TdeathsAve)
## EXCESSIVE HEAT TORNADO FLASH FLOOD LIGHTNING RIP CURRENT
## 88.81818 55.09091 48.36364 40.00000 28.27273
## FLOOD
## 18.90909
# Average injuries per year by event type
head(TinjAve)
## TORNADO EXCESSIVE HEAT LIGHTNING TSTM WIND
## 822.72727 324.54545 254.45455 253.45455
## HURRICANE/TYPHOON WILDFIRE
## 115.90909 72.27273
Let’s plot the averages.
par(mfcol=c(1,2))
par(mar=c(3,7,2,3))
rowChart = plot.k(TdeathsAve[1:10],'Avg # of deaths per year')
text(x= sort(TdeathsAve[1:10])+20, y= rowChart, labels=as.character(round(sort(TdeathsAve[1:10]),2)),xpd=TRUE)
rowChart = plot.k(TinjAve[1:10],'Avg # of injuries per year')
text(x= sort(TinjAve[1:10])+200, y= rowChart, labels=as.character(round(sort(TinjAve[1:10]),2)),xpd=TRUE)
par(mfcol=c(1,1))
par(mar=c(5.1,4.1,4.1,2.1))
On average tornados and excessive heat cause the most number of deaths and injuries per year for the period 2000-2011. The most frequent events, tornado, flash flood and thundestorm winds have a significant effect on the population health. However, hail, the second most frequent event, has an insignificant effect on population health. Excessive heat which is 20th most frequent event with $27.06452$ occurences per year is the most deadly. Thus, in general, the events that are most frequent are significantly harmful to the population health. Therefore, countermeasures should be provided on a constant basis. This means that the amortisation and review of the countermeasure infrastructure needs to be frequent as well.
Across the United States, which types of events have the greatest economic consequences?
I calculate the total damage over the period 1950-2011 and average damage for years 2000 to 2011. The total damage is the sum of property damage and the crop damage.
# Total damage table by event type
TtotalDmg<-sort(tapply(data$PROPDMGSCALE,list(data$EVTYPE),sum)
+ tapply(data$CROPDMGSCALE,list(data$EVTYPE),sum)
,decreasing=TRUE)
totalDmg = data.frame(row.names = 1:length(TtotalDmg), event = rownames(TtotalDmg),cost = TtotalDmg,`pretty_cost` = format(TtotalDmg,big.mark = ',',scientific = F,digits = 4))
# Show events whose total damage is more than 1bil$ during 1950-2011
totalDmg[totalDmg$cost > 1000000000,c(1,3)]
## event pretty_cost
## 1 FLOOD 150,319,678,257.00
## 2 HURRICANE/TYPHOON 71,913,712,800.00
## 3 TORNADO 57,352,115,822.50
## 4 STORM SURGE 43,323,541,000.00
## 5 HAIL 18,758,223,268.50
## 6 FLASH FLOOD 17,562,131,144.30
## 7 DROUGHT 15,018,672,000.00
## 8 HURRICANE 14,610,229,010.00
## 9 TSTM WIND 10,868,962,514.50
## 10 RIVER FLOOD 10,148,404,500.00
## 11 ICE STORM 8,967,041,560.00
## 12 TROPICAL STORM 8,382,236,550.00
## 13 WINTER STORM 6,715,441,255.00
## 14 HIGH WIND 5,908,617,595.00
## 15 WILDFIRE 5,060,586,800.00
## 16 STORM SURGE/TIDE 4,642,038,000.00
## 17 HURRICANE OPAL 3,191,846,000.00
## 18 WILD/FOREST FIRE 3,108,626,330.00
## 19 HEAVY RAIN/SEVERE WEATHER 2,500,000,000.00
## 20 TORNADOES, TSTM WIND, HAIL 1,602,500,000.00
## 21 HEAVY RAIN 1,427,647,890.00
## 22 EXTREME COLD 1,360,710,400.00
## 23 SEVERE THUNDERSTORM 1,205,560,000.00
## 24 FROST/FREEZE 1,103,566,000.00
## 25 HEAVY SNOW 1,067,242,257.00
Let’s look at the average total economic damage per year during 2000-2011.
# Average damage per year table for the period 2000-2011 by event type
TtotalDmgAve<-sort(tapply(data[years,]$PROPDMGSCALE,list(data[years,]$EVTYPE),sum)
+ tapply(data[years,]$CROPDMGSCALE,list(data[years,]$EVTYPE),sum),
decreasing=TRUE)/11
# Turn into a dataframe with pretty cost column
totalDmgAve = data.frame(row.names = 1:length(TtotalDmgAve), event = rownames(TtotalDmgAve),cost = TtotalDmgAve,`pretty_cost` = format(TtotalDmgAve,big.mark = ',',scientific = F,digits = 4))
# Show events whose average total damage is more than 1mil$ per year during 2000-2011
totalDmgAve[totalDmgAve$cost > 1000000,c(1,3)]
## event pretty_cost
## 1 FLOOD 11,912,769,002.73
## 2 HURRICANE/TYPHOON 6,537,610,254.55
## 3 STORM SURGE 3,924,630,454.55
## 4 HAIL 1,203,609,870.00
## 5 FLASH FLOOD 1,028,091,082.73
## 6 DROUGHT 904,622,090.91
## 7 TORNADO 894,962,851.82
## 8 TROPICAL STORM 676,727,122.73
## 9 TSTM WIND 516,427,969.09
## 10 HIGH WIND 487,030,401.82
## 11 STORM SURGE/TIDE 418,303,909.09
## 12 WILDFIRE 399,638,581.82
## 13 HURRICANE 315,616,910.00
## 14 ICE STORM 265,541,210.91
## 15 WILD/FOREST FIRE 214,942,875.45
## 16 WINTER STORM 124,723,290.91
## 17 FROST/FREEZE 98,601,454.55
## 18 HEAVY RAIN 76,104,003.64
## 19 LIGHTNING 50,392,913.64
## 20 EXCESSIVE HEAT 45,060,181.82
## 21 LANDSLIDE 29,403,818.18
## 22 HEAVY SNOW 28,213,493.64
## 23 STRONG WIND 18,882,801.82
## 24 COASTAL FLOOD 17,231,505.45
## 25 EXTREME COLD 14,428,000.00
## 26 BLIZZARD 11,092,090.91
## 27 TSUNAMI 8,229,818.18
## 28 HIGH SURF 7,571,136.36
## 29 FREEZE 4,850,000.00
## 30 TSTM WIND/HAIL 4,450,095.45
## 31 LAKE-EFFECT SNOW 3,569,272.73
## 32 WINTER WEATHER 3,086,545.45
## 33 COASTAL FLOODING 1,553,909.09
## 34 EXTREME WINDCHILL 1,545,454.55
Below I plot the average total economic damage per year during 2000-2011 of events whose average economic damage is greater than 1 million.
par(mar=c(2,8,2,6.4))
# Plot average economic damage > than 1mil$ per year during 2000-2011
rowChart = plot.k(totalDmgAve[totalDmgAve$cost > 1000000,]$cost,'Avg Damage per Year in $, 2000-2011')
text(x = sort(totalDmgAve[totalDmgAve$cost > 1000000,2])+1780000000,
y = rowChart,
labels= format(sort(totalDmgAve[totalDmgAve$cost > 1000000,2]),big.mark = ',',scientific = F,digits = 4),
xpd=TRUE)
Lastly, let’s look at the average economic damage per year by damage type during 2000-2011. The average annual damage done to property by weather events is 14 times greater than the average annual damage done to crops.
# Average damage per year during 2000-2011 by damage type
format(data.frame(
prop_dmg=sum(data[years,]$PROPDMGSCALE/11),
crop_dmg=sum(data[years,]$CROPDMGSCALE/11),
row.names = '$ annualy'),
big.mark = ',',digits = 4, scientific = F)
## prop_dmg crop_dmg
## $ annualy 28,173,521,423 2,084,491,549
The fact that floods have had the biggest damage in economic terms makes sense. It is a devastating force of nature. Water can crush, water can flow. In the 2000s a bunch of storms and hurricanes took place in the US whose appearences are correlated the floods. However, hurricanes, storms, droughts are not even in the top 16 most frequent events. They are just most devastating to the infrastracture and crops. During 2000-2011 the damage done to property was 14 times greater than the damage done to crops. Therefore, the countermeasures need to focus on ensuring the security of the property.