Data Analytics Colliquium: Dr. Brandt on Statistics and Time Series
Notes on Methods
Changepoint Models
Basic time series intervention or inference
Does the data change/break at a given point?
When researchers looked at the change made by the mandatory seatbelt law, they found that gasoline prices changed around the same time, meaning that there were covariates not originally accounted for.
The chicken farmer salmonella vaccination intervention problem shows a pre-intervention effect of farmers vaccinating chickens before the policy was implemented.
Binary segmentation
Will there be a break soon given the data today?
Does not infer causes
There is a minimal length to the number of observations to use.
Minimize number of breakpoints to use: more breakpoints will always create a better fit to the data, but may not truly represent a difference in data distribution
L1/L0 & lasso regularization
Where are the optimal number breaks and how many are there?
Used for causal inference
Can either iteratively search for each changepoint, or use penalty regression model to estimeate all changepoints at once
Cost function between better fit and number of changepoints to rationalize addition of changepoints
Lasso regression will overstate changepoints, then use cost function to prune
Multivariate versions are easier to implement in comparison to other techniques
Bayesian methods
Have to specify full probability model
Exercise: Recreate Plots
Load in times series test data with average rainfall per year
data <-read_csv("dataVisualization_assignment10_timeSeriesData.csv")
Rows: 23 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): Year, average_rainfall
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rownames(data) <- data$Year
Warning: Setting row names on a tibble is deprecated.
head(data, 3)
# A tibble: 3 × 2
Year average_rainfall
<dbl> <dbl>
1 2000 0.85
2 2001 0.284
3 2002 0.136
Line plot showing change in independent variable over time
ggplot(data, aes(Year, average_rainfall)) +geom_line() +# create line plot# customize look of plot and labelslabs(title ="Average Rainfall over Time") +xlab("Year") +ylab("Average Rainfall Per Day (Inches)") +theme_light() +theme(text =element_text(family="Avenir"))
Line plot showing change in independent variable over time with y-axis labels shown on both left and right
ggplot(data, aes(Year, average_rainfall)) +geom_line() +scale_y_continuous(sec.axis=sec_axis(trans=~.*1)) +# create second y axislabs(title ="Average Rainfall over Time") +xlab("Year") +ylab("Average Rainfall Per Day (Inches)") +theme_light() +theme(text =element_text(family="Avenir"))
Line plot showing y variable over time with vertical lines showing change points
ggplot(data, aes(Year, average_rainfall)) +geom_line() +# add lines at potential change pointsgeom_vline(xintercept=2006, color="firebrick", linetype="dashed") +geom_vline(xintercept=2008, color="firebrick", linetype="dashed") +geom_vline(xintercept=2010, color="firebrick", linetype="dashed") +geom_vline(xintercept=2018, color="firebrick", linetype="dashed") +labs(title ="Average Rainfall over Time") +xlab("Year") +ylab("Average Rainfall Per Day (Inches)") +theme_light() +theme(text =element_text(family="Avenir"))