Course: MA5781 - Time Series Analysis
Professor: Yeonwoo Rho
Project Report:
Time Series Analysis of Domestic Flight Delays in United States
By:
To Analyse and Forecast Airline Flight delays based on various Seasons
Overview of the Dataset:
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
The Question:
Is there any seasonal pattern in Airline Flight delays, and can we predict it?
Approach:
The approach for the solution is pretty much going by the textbook, the approximate steps of whose are as follows:
Project team:
- Dhairya Kothari
- Bharath Inti
- Ling Zhang
Faculty Advisor:
- Dr. Yeonwoo Rho
Resources:
- R
- Kaggle
- Time Series Text Book
- Various R Packages
References:
- Textbook: Cryer and Chan, "Time Series Analysis with Applications in R," 2nd Edition
- Kaggle: https://www.kaggle.com/
- R Packages: http://cran.r-project.org
Project Outline
Our aim is to Analyse and Forecast Airline Flight delays based on various Seasons. The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information of all the flight related delay statistics are available BTS website. We have data from 1988 to 2016. Raw data are made available to the public at the time the Air Travel Consumer Report is released. We plan to analyze delay of flights in the USA and find pattern in them, so that we can potentially forecast on the delay of flights.
The Question:
Is there any seasonal pattern in Airline Flight delays, and can we predict it ?
2. Introduction (Background)
Flight delay is becoming a bigger problem as the number of flights increases and the number of air travel increases. There are an all time high number of people flying within the country, going out of the country as well as coming in the country. If we are someone who is a frequent flier, then we know that delayed flights not only has become a big and frequent problem nowadays, but affects our scheduling as well as many missing their connecting flights
Our goal is to find if there exists any patterns in the flight delay and understand the distribution of delayed flights.
3. Data
We obtained more than 20 years’ data of the performance of domestic flights operated by all the carriers in the USA, acquired from Bureau of Transportation Statistics. And the data shows daily flights from 1988 to 2016. Since it’s more than 20 years’ data, the size of it is more than 10 GB, and it took us more than one hour to compile and preprocess the rich dataset.
Our main interest is arrival delay in the flight data, so we consider the flights that arrived late for delay data. Here is an example to explain:
Figure 1. Data obtained from Bureau of Transportation Statistics
As shown in Figure 1. We only choose air carrier delay, weather delay, national aviation system delay, security delay and aircraft arriving late as the flight delay. This pie chart is a part of our data, and we can see about 21% of total flights from Jun, 2003 to Dec, 2008 are delayed flights in this example.
Before we do the analysis for delay flight data, we also consider the influence of total traffic to delayed flights, for that we first decide to see the percentage of delay data in total flight. After getting the percentage of delayed flights, we wanted to figure out, what is the average delay time on those flights based on average monthly delay time of flights and see if there exist any patterns.
Finally, we analyse both these two parts we are interested in:
In our project, we use the data from 1988 to 2013 as training data and the data from 2014 to 2016 as testing data.
4. Approach
The final approach is very similar to the scope of our textbook and according to the slides provided by Dr. Rho. The Textbook approach and the approximate steps of whose are as follows:
Our Approach and the outline of our modelling is as follows:
5. Collaboration Plan
Workflow:
6. Model Specifications
6.1 - Percentage of flights delayed
The part 1 of our analysis deals with the number of flights which were delayed to the total operations of flights aggregated monthly. We took this data as a ratio / percentage and created a time series plot as seen below.
The above time series plot is from the years 1988 to 2013. We can observe a clear seasonality throughout the months where there are three peaks overall in a year. One at the beginning in January and February, secondly during the summer months May, June probably due to vacation time for families, and lastly in the Christmas time. Apart from the seasonality there is no apparent trend in the data, the mean percentage of flights delayed varies from 10% to 30% for all these 26 years. So, by observing the time series plot as we have stable seasonality, the deterministic models seem sensible here. In case if they do not work out well, we can later use SARIMA model.
6.1.1 Seasonal means model
We initially applied a cosine trend model, to look how well it fits our data but as it didn’t cover all the peaks appropriately we moved on to the seasonal means model. By setting the season function to our monthly data we got a far better fit with the following p-values and R2 scores.
Seasonal Means Statistics | ||
p-value | < 2.2e-16 | Models is statistically significant |
R2 | 0.35 | Explains 35% of Seasonality |
6.1.2 Residual Analysis
Now, we need to observe the residuals plot to determine whether the data satisfies our conditions of zero mean and homoscedasticity. The below is a plot of residual data from the seasonal means model and as we can observe the data is mostly oscillating near zero and the variance too doesn’t vary much, and is thus homoscedastic.
6.1.3 Verify Normality Condition
The next condition we need to check for is normality. This can be done using QQplot and Shapiro Wilk test. Apart from these Ljung Box test can also be used to verify normality.
The QQplot fairly indicates normality, but let us also verify this with Shapiro Wilk test which resulted a p-value of 0.52, by which we cannot reject the null hypothesis that data is normal. As these conditions are met, we only have two more conditions to be met before proceeding to construct a time series model. These are stationarity and independence.
6.1.4 Verify Stationarity Condition
We can verify stationarity with Augmented Dickey-Fuller Test (ADF) and Phillips-Perron Unit Root Test (PP). These two models determine whether we can proceed further or not. Even though stationarity conditions should be verified at the beginning of the analysis, we can mostly find out if our data is not stationary from the time series plots. As there were no such indications in our time series plot, we moved on with the assumption that our data is stationary and now verified if our assumption holds true with the ADF and PP test. Both of these tests results have shown that stationarity holds true with p-values 0.02 and 0.01 respectively for ADF and PP tests, thus rejecting the null hypothesis that data is not stationary.
6.1.4 Verify Independence Condition
We later move on to check the last condition which is independence. To build a time series model, we need our data to be dependent on its previous lags. Independence can be analyzed through Runs test and ACF plots. The Runs test produced a p-values of 1.88e-17 by which we can reject our null hypothesis that data is independent. This is also supported by the ACF plot which strongly supported an Auto Regression model where each lag depended on its previous lag decaying slowly.
Below is a table representing the results of all the above tests. As our seasonal means model satisfies our initial assumptions and conditions we proceed to build a time series model.
Seasonal Means Model - Residual Analysis | ||
Residual plot | Close to zero mean | Zero mean condition met |
Residual plot | Almost constant variance | Homoscedasticity met |
QQ plot | Mostly normal, tails deviating slightly | Normality condition met |
Shapiro Wilk test | 0.52 | Normality satisfied |
ADF test | 0.02 | Stationarity satisfied |
PP test | 0.01 | Data is Stationary |
Runs test | < 1.88e-17 | Data is dependent |
ACF | Lags are related and close | Data is dependent |
Part 2 --- Average Monthly Delay time of delayed flights.
6.2 Average Monthly Delay in Flights
From the part one we know the number of flights delayed to the total number of flights, and its pattern. Here, we want to look at the amount of delay (average delay over a month) of the flights. ATC defines delayed flight as a flight which is more than 15 minutes late than it’s scheduled arrival time.
A monthly average of the amount of delay is converted into a time series data, plot of whose is seen below.
We see that the average delay during 1988 was 35-40 minutes, which was 60-65 minutes by the end of 2012. We also see a seasonal high every year over december which can be explained by the increase in total traffic during the holiday season. On and after September 11, 2001, we see a noticeable dip in flight delays, which is explained by the huge number of cancellations following the attack.
6.2.1 Seasonal Means + Linear Trend Model
From the data we also observe there is some seasonality and also an underlying linear trend. We fit various trends and models, to finally settle on Seasonal means + linear trend model. As that model, explains the seasonality and upward linear trend better than other models.
The assumptions of the model all hold true with the model statistics as:
Seasonal Means + Linear Trend Model Statistics | ||
p-value | < 2.2e-16 | Model is statistically significant |
R2 | 0.8329 | Accounts for 83.3% of Seasonality |
So we move ahead with seasonal + linear trend:
lm(AvgDelay.ts ~ season(AvgDelay.ts) + time(AvgDelay.ts)) |
6.2.2 Model Residual Analysis
Now we check for the model residual assumptions
Tests on the Model Residuals | ||
Shapiro Wilk Test | p-value = 0.9359 | proving normality |
Runs Test | p-value = 1.9e-09 | proving dependence |
ADF Test | p-value < 0.01 | satisfy stationarity and no unit roots |
PP Test | p-value < 0.01 | satisfy stationarity and no unit roots |
Checking the ACF plot of the residuals for model independence:
Model residuals does not look independent from the ACF. It satisfies all the conditions so we move ahead with model fitting and Diagnostics with Seasonal Means and Linear Trend.
7. Model fitting and Diagnostics
7.1 Percentage of flights delayed
For this section we need to identify the probable candidate models through the ACF, PACF, EACF and auto arima functions. Each of these functions may suggest similar or different models for which we observe the AIC, AICc and BIC scores to select one model with the least scores . Lets us start our model selection from the ACF plot.
7.1.1 ACF plot for percentage of flights delayed
An ACF plot is typically viewed to observe a Moving Averages model, but in our case the ACF plot strongly suggests an Auto Regression model. This can be identified by the successive lags decaying gradually which only happens with AR model as each lag is dependent on the previous lags. Thus we need to further analyse the PACF plot to identify which lag is significant.
7.1.2 PACF plot for percentage of flights delayed
In the PACF plot. Lags 1 and 2 are significant and we can consider this for AR(2) model, though there are some exceptions for lags 9 and 22, they can be ignored to build our model.
As the lag 3 and 4 also touches the boundary line, we may consider that for AR(3) and AR(4) if need be. But for now we are only considering and moving ahead with AR(2) from the PACF plot:
AR(2) equation is given as Yt = φ1Yt−1 + φ2Yt−2 + et ; {et} ∼ WN(0, σ2e ).
7.1.3 EACF plot for percentage of flights delayed
The EACF plot suggests us to use ARMA(1,1) at the least. This could be a more viable model but let us also check the auto arima to finalise our model selection. ARMA(1,1) equation is:
Yt = φYt−1 + et − θet−1 ; {et} ∼ WN(0, σ2e ).
7.1.4 Auto Arima for percentage of flights delayed
As we see above the auto arima also suggested us an ARMA(1,1) model and by this we can actually move ahead with the ARMA model, but we will verify the AIC, BIC scores anyway.
7.1.5 AIC, AICc and BIC scores evaluation
Below is a table of scores obtained for these two models and they clearly indicate that we select ARMA(1,1) model. Therefore we proceed further for model diagnostics on ARMA(1,1) model.
Time Series Model | AIC score | AICc score | BIC score |
AR(2) model | -1401.65 | -1401.57 | -1390.42 |
ARMA(1,1) model | -1407.31 | -1407.23 | -1396.08 |
7.1.6 Model diagnostics for ARMA(1,1) model
The above are the plots of standardized residuals, ACF and Ljung Box test. In the first plot we can identify that the values are mostly random and close to zero mean and also do not vary much with a constant variance indicate homoscedasticity. The ACF plot is also satisfactory with almost all lags lying within the boundary and also the Box test has p-values above zero indicating normality. Therefore ARMA(1,1) is our final model and we can get to the last step of forecasting. ARMA(1,1) with co-efficients can be viewed below.
Yt = 0.87Yt−1 + et + 0.45et−1 ; {et} ∼ WN(0, σ2e ).
Part 2 --- Average Monthly Delay time of delayed flights.
7.2.1 Model determination
We determine the ARIMA (p,d,q) with ACF, PACF, EACF and auto.arima().
ACF
The ACF plot shows more weightage on AR then MA due to the slight trading off behaviour of the lags checking on PACF.
PACF
PACF gives us AR(2) and maybe AR(1) can also be considered, as lags after 2 are in bounds. There are some lags coming slightly off bounds (drifting behaviour), but that can be explained with the linear trend not being completely linear.
EACF
EACF suggests ARMA(1,1).
Auto.arima() Outputs:
Auto ARIMA suggests ARMA(1,1) as well.
7.2.2 Checking Candidate Models
Based on the above plots we test the following candidate models:
Model | AIC | AICc | BIC |
AR(2) | 1473.47 | 1473.55 | 1484.7 |
ARMA (1,1) | 1470.24 | 1470.31 | 1481.47 |
ARMA(1,1) is the best fit by AIC, AICc and BIC scores.
7.2.3 Diagnostics of ARMA (1,1) Model
Residuals are evenly distributed across zero mean with no apparent patterns.
ACF lags are within the interval bounds.
P-values are just above the significance line, some are on it and there are some points which dip just below the 0.05 value, but those can be forgiven. We can safely forecast with this model.
8. Forecasting
8.1 Percentage of average Monthly delay flight.
We have chosen to forecast three years from 2014 January to 2016 December. For this purpose we have initially set apart the actual data for these years from our dataset. We are going to use it now to observe how accurately our model is forecasting compared to the original data.
In the above figure, the black lines are the original data and the red lines are the forecasted data with the green lines being upper and lower bounds. As we can see our model forecast fits well with the original data and it is very accurate particularly for year 2015.
8.2 Forecasting Average Monthly Delay time
As we can see in the above graph, our forecasts match with the real values, there is some drifts from the mean forecast (red line) but the values are in the green interval bounds.
9. Conclusion
9.1 Conclusion for percentage of flights delayed
9.2 Conclusions on Delay Time:
10. References
Resources:
11. Appendix (R Codes)
Part 1 --- Percentage of average Monthly delay flight.
TS1 data import:
ts1 <- read.csv("tms1.csv", header = T)
ts1 <- ts1[,2:3]
tst1 <- ts1[313:348,]
ts1 <- ts1[-(313:348),]
full1 <- rbind(ts1,tst1)
library(TSA)
library(xts)
PD.ts <- ts(ts1[,2], start = c(1988,1), end = c(2013,12), frequency = 12)
full1.ts <- ts(full1$prcntdelay, start = c(1988,1), frequency = 12)
plot(as.xts(PD.ts), main="Percent of Flights Delayed by Operations", major.format = "%Y-%m", type = "o")
We can see something like a seasonal trend here where an year typically has three peaks - at the beginning in Jan,Feb, then in summer vacation and again towards the end during christmas.
COSINE TREND:
## Cosine Trends
har=harmonic(PD.ts,6)
PD.cs = lm(PD.ts ~ har)
summary(PD.cs)
##
## Call:
## lm(formula = PD.ts ~ har)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.092308 -0.019904 -0.001538 0.022692 0.085385
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1838782 0.0019076 96.390 < 2e-16 ***
## harcos(2*pi*t) 0.0097039 0.0026978 3.597 0.000376 ***
## harcos(4*pi*t) 0.0274359 0.0026978 10.170 < 2e-16 ***
## harcos(6*pi*t) -0.0075000 0.0026978 -2.780 0.005779 **
## harcos(8*pi*t) -0.0022436 0.0026978 -0.832 0.406277
## harcos(10*pi*t) 0.0004885 0.0026978 0.181 0.856446
## harcos(12*pi*t) -0.0071474 0.0019076 -3.747 0.000215 ***
## harsin(2*pi*t) 0.0085086 0.0026978 3.154 0.001774 **
## harsin(4*pi*t) -0.0084382 0.0026978 -3.128 0.001934 **
## harsin(6*pi*t) -0.0038462 0.0026978 -1.426 0.155007
## harsin(8*pi*t) -0.0083272 0.0026978 -3.087 0.002213 **
## harsin(10*pi*t) -0.0087009 0.0026978 -3.225 0.001398 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0337 on 300 degrees of freedom
## Multiple R-squared: 0.3757, Adjusted R-squared: 0.3528
## F-statistic: 16.41 on 11 and 300 DF, p-value: < 2.2e-16
plot(PD.ts)
cs.ts <- ts(fitted(PD.cs), start = c(1988,1), frequency = 12)
lines(cs.ts,col=2, lty=2)
SEASONAL MEANS:
# Seasonal Means model
sp = season(PD.ts)
PD.sm = lm(PD.ts~sp)
summary(PD.sm)
##
## Call:
## lm(formula = PD.ts ~ sp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.092308 -0.019904 -0.001538 0.022692 0.085385
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.204615 0.006608 30.964 < 2e-16 ***
## spFebruary -0.009231 0.009345 -0.988 0.32408
## spMarch -0.013077 0.009345 -1.399 0.16276
## spApril -0.039615 0.009345 -4.239 2.99e-05 ***
## spMay -0.038077 0.009345 -4.074 5.91e-05 ***
## spJune 0.003846 0.009345 0.412 0.68096
## spJuly -0.005385 0.009345 -0.576 0.56493
## spAugust -0.017308 0.009345 -1.852 0.06501 .
## spSeptember -0.068077 0.009345 -7.284 2.86e-12 ***
## spOctober -0.046923 0.009345 -5.021 8.84e-07 ***
## spNovember -0.042692 0.009345 -4.568 7.19e-06 ***
## spDecember 0.027692 0.009345 2.963 0.00329 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0337 on 300 degrees of freedom
## Multiple R-squared: 0.3757, Adjusted R-squared: 0.3528
## F-statistic: 16.41 on 11 and 300 DF, p-value: < 2.2e-16
#sm.ts <- ts(PD.sm$residuals, start = c(1988,1), frequency = 12)
#plot(sm.ts)
plot(PD.sm$residuals)
abline(h=0,col=2)
Residuals are close to mean zero and the variance also doesn't vary too much, hence homoskedastic.
We will plot QQplot to check normality:
qqnorm(PD.sm$residuals)
qqline(PD.sm$residuals)
We will use Shapiro Wilk test to check normality:
shapiro.test(PD.sm$residuals)
##
## Shapiro-Wilk normality test
##
## data: PD.sm$residuals
## W = 0.99556, p-value = 0.5189
runs(PD.sm$residuals)
## $pvalue
## [1] 1.88e-17
##
## $observed.runs
## [1] 83
##
## $expected.runs
## [1] 156.9744
##
## $n1
## [1] 158
##
## $n2
## [1] 154
##
## $k
## [1] 0
Box.test(PD.sm$residuals, lag = 1, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: PD.sm$residuals
## X-squared = 118.4, df = 1, p-value < 2.2e-16
Shapiro wilk satisfied normality and also runs test declared that our models.
Next ACF plot:
acf(PD.sm$residuals, main="ACF plot of seasonal means residuals")
So, it is not independent.
PACF plot:
pacf(PD.sm$residuals, main="PACF plot of seasonal means residuals")
PACF says that model is AR(2) - can also consider AR(3), AR(4).
EACF plot:
eacf(PD.sm$residuals)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x x x x x
## 1 x o o o o o o x o o o o o o
## 2 x x o o o o o x o x o o o o
## 3 x x x o o o o o o x o o o o
## 4 x x x o o o o x o x o o o o
## 5 x o x o o o o o o x o o o o
## 6 x x o o o o o o o x o o o o
## 7 x x x o x o o o o x o o o o
EACF suggests ARMA(1,1).
Stationarity:
adf.test(PD.sm$residuals)
##
## Augmented Dickey-Fuller Test
##
## data: PD.sm$residuals
## Dickey-Fuller = -3.7983, Lag order = 6, p-value = 0.01946
## alternative hypothesis: stationary
pp.test(PD.sm$residuals)
## Warning in pp.test(PD.sm$residuals): p-value smaller than printed p-value
##
## Phillips-Perron Unit Root Test
##
## data: PD.sm$residuals
## Dickey-Fuller Z(alpha) = -130.13, Truncation lag parameter = 5,
## p-value = 0.01
## alternative hypothesis: stationary
ADF test contradicts stationarity, but PP test supports stationarity.
auto.arima() function:
library(forecast)
##
## Attaching package: 'forecast'
## The following objects are masked from 'package:TSA':
##
## fitted.Arima, plot.Arima
## The following object is masked from 'package:nlme':
##
## getResponse
auto.arima(PD.sm$residuals)
## Series: PD.sm$residuals
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## 0.8685 -0.4457
## s.e. 0.0421 0.0803
##
## sigma^2 estimated as 0.0006341: log likelihood=706.65
## AIC=-1407.31 AICc=-1407.23 BIC=-1396.08
auto.arima suggested ARIMA(1,0,1), which is nothing but ARMA(1,1) suggested by EACF.
Check AIC, AICC and BIC scores:
# Candidate models could be AR(2), ARMA(1,1), or ARIMA(1,0,1)
ar2=Arima(PD.sm$residuals,order=c(2,0,0),include.mean=F) #
arma11=Arima(PD.sm$residuals,order=c(1,0,1),include.mean=F)
arima101=Arima(PD.sm$residuals,order=c(1,0,1),include.mean=F) #
ar2
## Series: PD.sm$residuals
## ARIMA(2,0,0) with zero mean
##
## Coefficients:
## ar1 ar2
## 0.4754 0.2293
## s.e. 0.0553 0.0553
##
## sigma^2 estimated as 0.0006458: log likelihood=703.83
## AIC=-1401.65 AICc=-1401.57 BIC=-1390.42
arma11
## Series: PD.sm$residuals
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## 0.8685 -0.4457
## s.e. 0.0421 0.0803
##
## sigma^2 estimated as 0.0006341: log likelihood=706.65
## AIC=-1407.31 AICc=-1407.23 BIC=-1396.08
arima101
## Series: PD.sm$residuals
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## 0.8685 -0.4457
## s.e. 0.0421 0.0803
##
## sigma^2 estimated as 0.0006341: log likelihood=706.65
## AIC=-1407.31 AICc=-1407.23 BIC=-1396.08
The best model is ARIMA(1,0,1) & ARMA(1,1).
Model Diagnostics:
tsdiag(arma11)
Forecasting:
newdata=data.frame(sp=as.factor(sp[1:36]))
PD.prd <- PD.sm$coefficients
predxreg=predict(PD.sm,newdata)
predx=predict(arma11,n.ahead=36)
pr=predx$pred+predxreg
uci=pr+2*predx$se
lci=pr-2*predx$se
pr=ts(pr,start=2014,freq=12)
uci=ts(uci,start=2014,freq=12)
lci=ts(lci,start=2014,freq=12)
ymin=min(c(as.vector(lci),PD.ts))-.1
ymax=max(c(as.vector(uci),PD.ts))+.1
plot(full1.ts,ylab="Percentage",xlim=c(1998,2016),ylim=c(ymin,ymax),main="Forecast of Percent Flights Delayed")
lines(pr,col=2)
lines(uci,col=3)
lines(lci,col=3)
Part 2 --- Average Monthly Delay time of delayed flights.
TS2 in months
ts2 <- read.csv("tms2.csv")
ts2 <- ts2[,2:3]
full2 <- ts2
tst2 <- ts2[313:348,]
ts2 <- ts2[-(313:348),]
library(TSA)
library(xts)
AD.ts <- ts(ts2[,2], start = c(1988,1), end = c(2013,12), frequency = 12)
full2.ts <- ts(full2$avgdelay, start = c(1988,1), frequency = 12)
plot(as.xts(AD.ts), main="Average Flight Delay (in minutes)", major.format = "%Y-%m", type = "o")
LINEAR TREND:
AD.ln=lm(AD.ts~time(AD.ts))
summary(AD.ln)
##
## Call:
## lm(formula = AD.ts ~ time(AD.ts))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0408 -2.7173 -0.2001 2.4522 11.4454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.701e+03 5.728e+01 -29.69 <2e-16 ***
## time(AD.ts) 8.737e-01 2.863e-02 30.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.795 on 310 degrees of freedom
## Multiple R-squared: 0.7503, Adjusted R-squared: 0.7495
## F-statistic: 931.7 on 1 and 310 DF, p-value: < 2.2e-16
plot(AD.ln)
ln.ts=ts(resid(AD.ln),start=c(1988,1),freq=12)
plot(ln.ts)
abline(h=0,col=2)
SEASONAL MEANS:
# Seasonal Means model
sp = season(AD.ts)
AD.sm = lm(AD.ts~sp)
summary(AD.sm)
##
## Call:
## lm(formula = AD.ts ~ sp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.2020 -6.6133 0.6631 5.8316 15.4784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.9367 1.4492 33.079 <2e-16 ***
## spFebruary -0.7920 2.0494 -0.386 0.6994
## spMarch -0.2152 2.0494 -0.105 0.9164
## spApril -1.3576 2.0494 -0.662 0.5082
## spMay -0.2594 2.0494 -0.127 0.8994
## spJune 3.6049 2.0494 1.759 0.0796 .
## spJuly 3.7126 2.0494 1.811 0.0711 .
## spAugust 1.6295 2.0494 0.795 0.4272
## spSeptember -2.1650 2.0494 -1.056 0.2916
## spOctober -3.5712 2.0494 -1.743 0.0824 .
## spNovember -2.9101 2.0494 -1.420 0.1567
## spDecember 0.3739 2.0494 0.182 0.8554
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.389 on 300 degrees of freedom
## Multiple R-squared: 0.08397, Adjusted R-squared: 0.05038
## F-statistic: 2.5 on 11 and 300 DF, p-value: 0.005098
sm.ts <- ts(AD.sm$residuals, start = c(1988,1), frequency = 12)
plot(sm.ts)
#plot(AD.sm$residuals)
abline(h=0,col=2)
SEASONAL MEANS + LINEAR:
sp = season(AD.ts)
tm = time(AD.ts)
AD.smln = lm(AD.ts~sp+tm)
summary(AD.smln)
##
## Call:
## lm(formula = AD.ts ~ sp + tm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1783 -2.0925 -0.1185 2.0488 8.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1707.0889 46.8185 -36.462 < 2e-16 ***
## spFebruary -0.8651 0.8598 -1.006 0.315141
## spMarch -0.3614 0.8598 -0.420 0.674546
## spApril -1.5769 0.8598 -1.834 0.067655 .
## spMay -0.5518 0.8599 -0.642 0.521539
## spJune 3.2394 0.8599 3.767 0.000199 ***
## spJuly 3.2739 0.8599 3.807 0.000170 ***
## spAugust 1.1177 0.8599 1.300 0.194687
## spSeptember -2.7499 0.8600 -3.198 0.001534 **
## spOctober -4.2292 0.8600 -4.918 1.45e-06 ***
## spNovember -3.6412 0.8600 -4.234 3.06e-05 ***
## spDecember -0.4303 0.8601 -0.500 0.617223
## tm 0.8773 0.0234 37.489 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.1 on 299 degrees of freedom
## Multiple R-squared: 0.8393, Adjusted R-squared: 0.8329
## F-statistic: 130.1 on 12 and 299 DF, p-value: < 2.2e-16
plot(AD.smln)
smln.ts <- ts(AD.smln$residuals, start = c(1988,1), frequency = 12)
plot(smln.ts)
abline(h=0,col=2)
We will use Shapiro Wilk test to check normality:
shapiro.test(AD.smln$residuals)
##
## Shapiro-Wilk normality test
##
## data: AD.smln$residuals
## W = 0.99763, p-value = 0.9359
runs(AD.smln$residuals)
## $pvalue
## [1] 1.9e-09
##
## $observed.runs
## [1] 104
##
## $expected.runs
## [1] 156.9423
##
## $n1
## [1] 159
##
## $n2
## [1] 153
##
## $k
## [1] 0
Box.test(AD.sm$residuals, lag = 1, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: AD.sm$residuals
## X-squared = 261.75, df = 1, p-value < 2.2e-16
Shapiro wilk satisfied normality and also runs test declared that our data is not independent.
Next ACF plot:
acf(AD.smln$residuals, main="ACF plot of seasonal means residuals")
So, it is not independent.
PACF plot:
pacf(AD.smln$residuals, main="PACF plot of seasonal means residuals")
Pacf suggests AR2 - though there are few lags which are just crossing boundary line.
EACF plot:
eacf(AD.smln$residuals)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x x x x x
## 1 x o o o o o o o o o o o o o
## 2 x x o o o o o o o o o o o o
## 3 x x o o o o o o o o o o o o
## 4 x x o o o o o o o o o o o o
## 5 x x o o o o o o o o o o o o
## 6 x x o x o o o o o o o o o o
## 7 x x x x x x o o o o o o o o
EACF suggests ARMA(1,1).
Stationarity:
adf.test(AD.smln$residuals)
## Warning in adf.test(AD.smln$residuals): p-value smaller than printed p-
## value
##
## Augmented Dickey-Fuller Test
##
## data: AD.smln$residuals
## Dickey-Fuller = -4.1068, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
pp.test(AD.smln$residuals)
## Warning in pp.test(AD.smln$residuals): p-value smaller than printed p-value
##
## Phillips-Perron Unit Root Test
##
## data: AD.smln$residuals
## Dickey-Fuller Z(alpha) = -170.8, Truncation lag parameter = 5,
## p-value = 0.01
## alternative hypothesis: stationary
Both ADF and PP satisfy stationarity.
auto.arima() function:
library(forecast)
##
## Attaching package: 'forecast'
## The following objects are masked from 'package:TSA':
##
## fitted.Arima, plot.Arima
## The following object is masked from 'package:nlme':
##
## getResponse
auto.arima(AD.smln$residuals)
## Series: AD.smln$residuals
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## 0.8414 -0.4781
## s.e. 0.0561 0.0955
##
## sigma^2 estimated as 6.425: log likelihood=-732.12
## AIC=1470.24 AICc=1470.31 BIC=1481.47
auto.arima also suggests ARMA(1,1).
Check AIC, AICC and BIC scores:
# Candidate models could be AR(2) & ARMA(1,1)
ar2=Arima(AD.smln$residuals,order=c(2,0,0),include.mean=F) #
arma11=Arima(AD.smln$residuals,order=c(1,0,1),include.mean=F)
ar2
## Series: AD.smln$residuals
## ARIMA(2,0,0) with zero mean
##
## Coefficients:
## ar1 ar2
## 0.3959 0.2292
## s.e. 0.0551 0.0553
##
## sigma^2 estimated as 6.493: log likelihood=-733.74
## AIC=1473.47 AICc=1473.55 BIC=1484.7
arma11
## Series: AD.smln$residuals
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## 0.8414 -0.4781
## s.e. 0.0561 0.0955
##
## sigma^2 estimated as 6.425: log likelihood=-732.12
## AIC=1470.24 AICc=1470.31 BIC=1481.47
ARMA(1,1) is the best model by AIC and BIC scores.
Model Diagnostics:
tsdiag(arma11)
Forecasting:
# let us do the forecasting for the next three years
newtm=seq(from=2014,to=2016.917,length=36)
newdata=data.frame(sp=as.factor(sp[1:36]),tm=newtm)
predxreg=predict(AD.smln,newdata)
predx=predict(arma11,n.ahead=36)
pr=predx$pred+predxreg
uci=pr+2*predx$se
lci=pr-2*predx$se
# Code the predicted values as time series to plot them as prediction intervals
pr=ts(pr,start=c(2014,1),end = c(2016,12),freq=12)
uci=ts(uci,start=c(2014,1),end = c(2016,12),freq=12)
lci=ts(lci,start=c(2014,1),end = c(2016,12),freq=12)
ymin=min(c(as.vector(lci),AD.ts))-.1
ymax=max(c(as.vector(uci),AD.ts))+.1
plot(full2.ts,ylab="Delay Time (in minutes)",xlim=c(1998,2016),ylim=c(ymin,ymax),main="Forecast of Average Delay Time of Flights")
lines(pr,col=2)
lines(uci,col=3)
lines(lci,col=3)