Course: MA5781 - Time Series Analysis

Professor: Yeonwoo Rho

Project Report:

Time Series Analysis of Domestic Flight Delays in United States

By:

To Analyse and Forecast Airline Flight delays based on various Seasons

Overview of the Dataset:

The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.

The Question:

Is there any seasonal pattern in Airline Flight delays, and can we predict it?

Approach:

The approach for the solution is pretty much going by the textbook, the approximate steps of whose are as follows:

  1. Check for Stationarity, and its removal
  2. Trends & Seasonality, and their removal
  3. Model selection and specification, Evaluate multiple models and possibly compare those for finding the best fit
  4. Parameter estimation
  5. Model diagnostics and Evaluation
  6. Forecasting Results and Visualization

Project team:

- Dhairya Kothari

- Bharath Inti

- Ling Zhang

 

Faculty Advisor:

- Dr. Yeonwoo Rho

Resources:

- R

- Kaggle

- Time Series Text Book

- Various R Packages

References:

- Textbook: Cryer and Chan, "Time Series Analysis with Applications in R," 2nd Edition

- Kaggle: https://www.kaggle.com/

- R Packages: http://cran.r-project.org

Project Outline

  1.  Abstract
  2.  Introduction
  3.  Data
  4.  Approach
  5.  Collaboration Plan
  6.  Model Specifications
  7.  Model fitting and Diagnostics
  8.  Forecasting
  9.  Conclusion
  10.  References
  11.  Appendix (R codes) 

  1. Abstract

Our aim is to Analyse and Forecast Airline Flight delays based on various Seasons. The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information of all the flight related delay statistics are available BTS website. We have data from 1988 to 2016. Raw data are made available to the public at the time the Air Travel Consumer Report is released. We plan to analyze delay of flights in the USA and find pattern in them, so that we can potentially forecast on the delay of flights.

The Question:

Is there any seasonal pattern in Airline Flight delays, and can we predict it ?

2. Introduction (Background)

Flight delay is becoming a bigger problem as the number of flights increases and the number of air travel increases. There are an all time high number of people flying within the country, going out of the country as well as coming in the country. If we are someone who is a frequent flier, then we know that delayed flights not only has become a big and frequent problem nowadays, but affects our scheduling as well as many missing their connecting flights

Our goal is to find if there exists any patterns in the flight delay and understand the distribution of delayed flights.

3. Data

We obtained more than 20 years’ data of the performance of domestic flights operated by all the carriers in the USA, acquired from Bureau of Transportation Statistics. And the data shows daily flights from 1988 to 2016. Since it’s more than 20 years’ data, the size of it is more than 10 GB, and it took us more than one hour to compile and preprocess the rich dataset.

Our main interest is arrival delay in the flight data, so we consider the flights that arrived late for delay data. Here is an example to explain:

WechatIMG1.jpeg

Figure 1. Data obtained from Bureau of Transportation Statistics

As shown in Figure 1. We only choose air carrier delay, weather delay, national aviation system delay, security delay and aircraft arriving late as the flight delay. This pie chart is a part of our data, and we can see about 21% of total flights from Jun, 2003 to Dec, 2008 are delayed flights in this example.

Before we do the analysis for delay flight data, we also consider the influence of total traffic to delayed flights, for that we first decide to see the percentage of delay data in total flight. After getting the percentage of delayed flights, we wanted to figure out, what is the average delay time on those flights based on average monthly delay time of flights and see if there exist any patterns. 

Finally, we analyse both these two parts we are interested in:

In our project, we use the data from 1988 to 2013 as training data and the data from 2014 to 2016 as testing data.

4. Approach

The final approach is very similar to the scope of our textbook and according to the slides provided by Dr. Rho. The Textbook approach and the approximate steps of whose are as follows:

  1. Check for Stationarity (Ch.2)
  2. Trends & Seasonality, and their removal (Ch.3)
  3. Model selection and specification, Evaluate multiple models and possibly compare those for finding the best fit (Ch.4 & 5)
  4. Parameter estimation (Ch.7)
  5. Model diagnostics and Evaluation (Ch.8)
  6. Forecasting Results and Visualization (Ch.9)

Our Approach and the outline of our modelling is as follows:

5. Collaboration Plan

Workflow:

6. Model Specifications 

6.1 - Percentage of flights delayed

The part 1 of our analysis deals with the number of flights which were delayed to the total operations of flights aggregated monthly. We took this data as a ratio / percentage and created a time series plot as seen below.

tms1_analysis_files/figure-docx/unnamed-chunk-2-1.png

The above time series plot is from the years 1988 to 2013. We can observe a clear seasonality throughout the months where there are three peaks overall in a year. One at the beginning in January and February, secondly during the summer months May, June probably due to vacation time for families, and lastly in the Christmas time. Apart from the seasonality there is no apparent trend in the data, the mean percentage of flights delayed varies from 10% to 30% for all these 26 years. So, by observing the time series plot as we have stable seasonality, the deterministic models seem sensible here. In case if they do not work out well, we can later use SARIMA model.

6.1.1 Seasonal means model

We initially applied a cosine trend model, to look how well it fits our data but as it didn’t cover all the peaks appropriately we moved on to the seasonal means model. By setting the season function to our monthly data we got a far better fit with the following p-values and R2 scores.

Seasonal Means Statistics

p-value

< 2.2e-16

Models is statistically significant

R2

0.35

Explains 35% of Seasonality

6.1.2 Residual Analysis

Now, we need to observe the residuals plot to determine whether the data satisfies our conditions of zero mean and homoscedasticity. The below is a plot of residual data from the seasonal means model and as we can observe the data is mostly oscillating near zero and the variance too doesn’t vary much, and is thus homoscedastic.

tms1_analysis_files/figure-docx/unnamed-chunk-4-1.png 

6.1.3 Verify Normality Condition

The next condition we need to check for is normality. This can be done using QQplot and Shapiro Wilk test. Apart from these Ljung Box test can also be used to verify normality.

tms1_analysis_files/figure-docx/unnamed-chunk-5-1.png

The QQplot fairly indicates normality, but let us also verify this with Shapiro Wilk test which resulted a p-value of 0.52, by which we cannot reject the null hypothesis that data is normal. As these conditions are met, we only have two more conditions to be met before proceeding to construct a time series model. These are stationarity and independence.

6.1.4 Verify Stationarity Condition

We can verify stationarity with Augmented Dickey-Fuller Test (ADF) and Phillips-Perron Unit Root Test (PP). These two models determine whether we can proceed further or not. Even though stationarity conditions should be verified at the beginning of the analysis, we can mostly find out if our data is not stationary from the time series plots. As there were no such indications in our time series plot, we moved on with the assumption that our data is stationary and now verified if our assumption holds true with the ADF and PP test. Both of these tests results have shown that stationarity holds true with p-values 0.02 and 0.01 respectively for ADF and PP tests, thus rejecting the null hypothesis that data is not stationary.

6.1.4 Verify Independence Condition

We later move on to check the last condition which is independence. To build a time series model, we need our data to be dependent on its previous lags. Independence can be analyzed through Runs test and ACF plots. The Runs test produced a p-values of 1.88e-17 by which we can reject our null hypothesis that data is independent. This is also supported by the ACF plot which strongly supported an Auto Regression model where each lag depended on its previous lag decaying slowly.

tms1_analysis_files/figure-docx/unnamed-chunk-7-1.png

Below is a table representing the results of all the above tests. As our seasonal means model satisfies our initial assumptions and conditions we proceed to build a time series model.

Seasonal Means Model - Residual Analysis

Residual plot

Close to zero mean

Zero mean condition met

Residual plot

Almost constant variance

Homoscedasticity met

QQ plot

Mostly normal, tails deviating slightly

Normality condition met

Shapiro Wilk test

0.52

Normality satisfied

ADF test

0.02

Stationarity satisfied

PP test

0.01

Data is Stationary

Runs test

< 1.88e-17

Data is dependent

ACF

Lags are related and close

Data is dependent

Part 2 --- Average Monthly Delay time of delayed flights.

6.2 Average Monthly Delay in Flights

From the part one we know the number of flights delayed to the total number of flights, and its pattern. Here, we want to look at the amount of delay (average delay over a month) of the flights. ATC defines delayed flight as a flight which is more than 15 minutes late than it’s scheduled arrival time.

A monthly average of the amount of delay is converted into a time series data, plot of whose is seen below.

We see that the average delay during 1988 was 35-40 minutes, which was 60-65 minutes by the end of 2012. We also see a seasonal high every year over december which can be explained by the increase in total traffic during the holiday season. On and after September 11, 2001, we see a noticeable dip in flight delays, which is explained by the huge number of cancellations following the attack.

6.2.1 Seasonal Means + Linear Trend Model

From the data we also observe there is some seasonality and also an underlying linear trend. We fit various trends and models, to finally settle on Seasonal means + linear trend model. As that model, explains the seasonality and upward linear trend better than other models.

The assumptions of the model all hold true with the model statistics as:

Seasonal Means + Linear Trend Model Statistics

p-value

< 2.2e-16

Model is statistically significant

R2

0.8329

Accounts for 83.3% of Seasonality

So we move ahead with seasonal + linear trend:

lm(AvgDelay.ts ~ season(AvgDelay.ts) + time(AvgDelay.ts))

6.2.2 Model Residual Analysis

Now we check for the model residual assumptions

Tests on the Model Residuals

Shapiro Wilk Test

p-value = 0.9359

proving normality

Runs Test

p-value = 1.9e-09

proving dependence

ADF Test

p-value < 0.01

satisfy stationarity and no unit roots

PP Test

p-value < 0.01

satisfy stationarity and no unit roots

Checking the ACF plot of the residuals for model independence:

Model residuals does not look independent from the ACF. It satisfies all the conditions so we move ahead with model fitting and Diagnostics with Seasonal Means and Linear Trend.

7. Model fitting and Diagnostics

7.1 Percentage of flights delayed

For this section we need to identify the probable candidate models  through the ACF, PACF, EACF and auto arima functions. Each of these functions may suggest similar or different models for which we observe the AIC, AICc and BIC scores to select one model with the least scores . Lets us start our model selection from the ACF plot.

7.1.1 ACF plot for percentage of flights delayed

An ACF plot is typically viewed to observe a Moving Averages model, but in our case the ACF plot strongly suggests an Auto Regression model. This can be identified by the successive lags decaying gradually which only happens with AR model as each lag is dependent on the previous lags. Thus we need to further analyse the PACF plot to identify which lag is significant.

tms1_analysis_files/figure-docx/unnamed-chunk-7-1.png

7.1.2 PACF plot for percentage of flights delayed

In the PACF plot. Lags 1 and 2 are significant and we can consider this for AR(2) model, though there are some exceptions for lags 9 and 22, they can be ignored to build our model.

tms1_analysis_files/figure-docx/unnamed-chunk-8-1.png

As the lag 3 and 4 also touches the boundary line, we may consider that for AR(3) and AR(4) if need be. But for now we are only considering and moving ahead with AR(2) from the PACF plot:

AR(2) equation is given as Yt = φ1Yt−1 + φ2Yt−2 + et ;         {et} ∼ WN(0, σ2e ).

7.1.3 EACF plot for percentage of flights delayed

The EACF plot suggests us to use ARMA(1,1) at the least. This could be a more viable model but let us also check the auto arima to finalise our model selection. ARMA(1,1) equation is:

Yt = φYt−1 + et − θet−1         ;         {et} ∼ WN(0, σ2e ).

7.1.4 Auto Arima for percentage of flights delayed

As we see above the auto arima also suggested us an ARMA(1,1) model and by this we can actually move ahead with the ARMA model, but we will verify the AIC, BIC scores anyway.

7.1.5 AIC, AICc and BIC scores evaluation

Below is a table of scores obtained for these two models and they clearly indicate that we select ARMA(1,1) model. Therefore we proceed further for model diagnostics on ARMA(1,1) model.

Time Series Model

AIC score

AICc score

BIC score

AR(2) model

-1401.65

-1401.57

-1390.42

ARMA(1,1) model

-1407.31

-1407.23

-1396.08

7.1.6 Model diagnostics for ARMA(1,1) model

tms1_analysis_files/figure-docx/unnamed-chunk-13-1.png

The above are the plots of standardized residuals, ACF and Ljung Box test. In the first plot we can identify that the values are mostly random and close to zero mean and also do not vary much with a constant variance indicate homoscedasticity. The ACF plot is also satisfactory with almost all lags lying within the boundary and also the Box test has p-values above zero indicating normality. Therefore ARMA(1,1) is our final model and we can get to the last step of forecasting. ARMA(1,1) with co-efficients can be viewed below.

Yt = 0.87Yt−1 + et + 0.45et−1 ;         {et} ∼ WN(0, σ2e ).

Part 2 --- Average Monthly Delay time of delayed flights.

7.2.1 Model determination

We determine the ARIMA (p,d,q) with ACF, PACF, EACF and auto.arima().

ACF

The ACF plot shows more weightage on AR then MA due to the slight trading off behaviour of the lags checking on PACF.

PACF

PACF gives us AR(2) and maybe AR(1) can also be considered, as lags after 2 are in bounds. There are some lags coming slightly off bounds (drifting behaviour), but that can be explained with the linear trend not being completely linear.

EACF

eacf.PNG

EACF suggests ARMA(1,1).

Auto.arima() Outputs:

auto.arima.PNG

Auto ARIMA suggests ARMA(1,1) as well.

7.2.2 Checking Candidate Models

Based on the above plots we test the following candidate models:

Model

AIC

AICc

BIC

AR(2)

1473.47

1473.55

1484.7

ARMA (1,1)

1470.24

1470.31

1481.47

ARMA(1,1) is the best fit by AIC, AICc and BIC scores.

7.2.3 Diagnostics of ARMA (1,1) Model

Residuals are evenly distributed across zero mean with no apparent patterns.

ACF lags are within the interval bounds.

P-values are just above the significance line, some are on it and there are some points which dip just below the 0.05 value, but those can be forgiven. We can safely forecast with this model.

8. Forecasting

8.1 Percentage of average Monthly delay flight.

We have chosen to forecast three years from 2014 January to 2016 December. For this purpose we have initially set apart the actual data for these years from our dataset. We are going to use it now to observe how accurately our model is forecasting compared to the original data.

tms1_analysis_files/figure-docx/unnamed-chunk-14-1.png

In the above figure, the black lines are the original data and the red lines are the forecasted data with the green lines being upper and lower bounds. As we can see our model forecast fits well with the original data and it is very accurate particularly for year 2015.

8.2 Forecasting Average Monthly Delay time

As we can see in the above graph, our forecasts match with the real values, there is some drifts from the mean forecast (red line) but the values are in the green interval bounds.

9. Conclusion

9.1 Conclusion for percentage of flights delayed

9.2 Conclusions on Delay Time:

10. References

 

Resources:

11. Appendix (R Codes)

Part 1 --- Percentage of average Monthly delay flight.

TS1 data import:

ts1 <- read.csv("tms1.csv", header = T)
ts1 <- ts1[,2:3]
tst1 <- ts1[313:348,]
ts1 <- ts1[-(313:348),]
full1 <- rbind(ts1,tst1)

library(TSA)
library(xts)
PD.ts <- ts(ts1[,2], start = c(1988,1), end = c(2013,12), frequency = 12)
full1.ts <- ts(full1$prcntdelay,  start = c(1988,1), frequency = 12)
plot(as.xts(PD.ts), main="Percent of Flights Delayed by Operations", major.format = "%Y-%m", type = "o")

tms1_analysis_files/figure-docx/unnamed-chunk-2-1.png

We can see something like a seasonal trend here where an year typically has three peaks - at the beginning in Jan,Feb, then in summer vacation and again towards the end during christmas.

COSINE TREND:

## Cosine Trends
har=harmonic(PD.ts,6)
PD.cs = lm(PD.ts ~ har)
summary(PD.cs)

##
## Call:
## lm(formula = PD.ts ~ har)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.092308 -0.019904 -0.001538  0.022692  0.085385
##
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.1838782  0.0019076  96.390  < 2e-16 ***
## harcos(2*pi*t)   0.0097039  0.0026978   3.597 0.000376 ***
## harcos(4*pi*t)   0.0274359  0.0026978  10.170  < 2e-16 ***
## harcos(6*pi*t)  -0.0075000  0.0026978  -2.780 0.005779 **
## harcos(8*pi*t)  -0.0022436  0.0026978  -0.832 0.406277    
## harcos(10*pi*t)  0.0004885  0.0026978   0.181 0.856446    
## harcos(12*pi*t) -0.0071474  0.0019076  -3.747 0.000215 ***
## harsin(2*pi*t)   0.0085086  0.0026978   3.154 0.001774 **
## harsin(4*pi*t)  -0.0084382  0.0026978  -3.128 0.001934 **
## harsin(6*pi*t)  -0.0038462  0.0026978  -1.426 0.155007    
## harsin(8*pi*t)  -0.0083272  0.0026978  -3.087 0.002213 **
## harsin(10*pi*t) -0.0087009  0.0026978  -3.225 0.001398 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0337 on 300 degrees of freedom
## Multiple R-squared:  0.3757, Adjusted R-squared:  0.3528
## F-statistic: 16.41 on 11 and 300 DF,  p-value: < 2.2e-16

plot(PD.ts)
cs.ts <- ts(fitted(PD.cs), start = c(1988,1), frequency = 12)
lines(cs.ts,col=2, lty=2)

tms1_analysis_files/figure-docx/unnamed-chunk-3-1.png

SEASONAL MEANS:

# Seasonal Means model
sp = season(PD.ts)
PD.sm = lm(PD.ts~sp)    
summary(PD.sm)

##
## Call:
## lm(formula = PD.ts ~ sp)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.092308 -0.019904 -0.001538  0.022692  0.085385
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.204615   0.006608  30.964  < 2e-16 ***
## spFebruary  -0.009231   0.009345  -0.988  0.32408    
## spMarch     -0.013077   0.009345  -1.399  0.16276    
## spApril     -0.039615   0.009345  -4.239 2.99e-05 ***
## spMay       -0.038077   0.009345  -4.074 5.91e-05 ***
## spJune       0.003846   0.009345   0.412  0.68096    
## spJuly      -0.005385   0.009345  -0.576  0.56493    
## spAugust    -0.017308   0.009345  -1.852  0.06501 .  
## spSeptember -0.068077   0.009345  -7.284 2.86e-12 ***
## spOctober   -0.046923   0.009345  -5.021 8.84e-07 ***
## spNovember  -0.042692   0.009345  -4.568 7.19e-06 ***
## spDecember   0.027692   0.009345   2.963  0.00329 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0337 on 300 degrees of freedom
## Multiple R-squared:  0.3757, Adjusted R-squared:  0.3528
## F-statistic: 16.41 on 11 and 300 DF,  p-value: < 2.2e-16

#sm.ts <- ts(PD.sm$residuals, start = c(1988,1), frequency = 12)
#plot(sm.ts)
plot(PD.sm$residuals)
abline(h=0,col=2)

tms1_analysis_files/figure-docx/unnamed-chunk-4-1.png

Residuals are close to mean zero and the variance also doesn't vary too much, hence homoskedastic.

We will plot QQplot to check normality:

qqnorm(PD.sm$residuals)
qqline(PD.sm$residuals)

tms1_analysis_files/figure-docx/unnamed-chunk-5-1.png

We will use Shapiro Wilk test to check normality:

shapiro.test(PD.sm$residuals)

##
##  Shapiro-Wilk normality test
##
## data:  PD.sm$residuals
## W = 0.99556, p-value = 0.5189

runs(PD.sm$residuals)

## $pvalue
## [1] 1.88e-17
##
## $observed.runs
## [1] 83
##
## $expected.runs
## [1] 156.9744
##
## $n1
## [1] 158
##
## $n2
## [1] 154
##
## $k
## [1] 0

Box.test(PD.sm$residuals, lag = 1, type = "Ljung-Box")

##
##  Box-Ljung test
##
## data:  PD.sm$residuals
## X-squared = 118.4, df = 1, p-value < 2.2e-16

Shapiro wilk satisfied normality and also runs test declared that our models.

Next ACF plot:

acf(PD.sm$residuals, main="ACF plot of seasonal means residuals")

tms1_analysis_files/figure-docx/unnamed-chunk-7-1.png

So, it is not independent.

PACF plot:

pacf(PD.sm$residuals, main="PACF plot of seasonal means residuals")

tms1_analysis_files/figure-docx/unnamed-chunk-8-1.png 

PACF says that model is AR(2) - can also consider AR(3), AR(4).

EACF plot:

eacf(PD.sm$residuals)

## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x x  x  x  x
## 1 x o o o o o o x o o o  o  o  o
## 2 x x o o o o o x o x o  o  o  o
## 3 x x x o o o o o o x o  o  o  o
## 4 x x x o o o o x o x o  o  o  o
## 5 x o x o o o o o o x o  o  o  o
## 6 x x o o o o o o o x o  o  o  o
## 7 x x x o x o o o o x o  o  o  o

EACF suggests ARMA(1,1).

Stationarity:

adf.test(PD.sm$residuals)

##
##  Augmented Dickey-Fuller Test
##
## data:  PD.sm$residuals
## Dickey-Fuller = -3.7983, Lag order = 6, p-value = 0.01946
## alternative hypothesis: stationary

pp.test(PD.sm$residuals)

## Warning in pp.test(PD.sm$residuals): p-value smaller than printed p-value

##
##  Phillips-Perron Unit Root Test
##
## data:  PD.sm$residuals
## Dickey-Fuller Z(alpha) = -130.13, Truncation lag parameter = 5,
## p-value = 0.01
## alternative hypothesis: stationary

ADF test contradicts stationarity, but PP test supports stationarity.

auto.arima() function:

library(forecast)

##
## Attaching package: 'forecast'

## The following objects are masked from 'package:TSA':
##
##     fitted.Arima, plot.Arima

## The following object is masked from 'package:nlme':
##
##     getResponse

auto.arima(PD.sm$residuals)

## Series: PD.sm$residuals
## ARIMA(1,0,1) with zero mean    
##
## Coefficients:
##          ar1      ma1
##       0.8685  -0.4457
## s.e.  0.0421   0.0803
##
## sigma^2 estimated as 0.0006341:  log likelihood=706.65
## AIC=-1407.31   AICc=-1407.23   BIC=-1396.08

auto.arima suggested ARIMA(1,0,1), which is nothing but ARMA(1,1) suggested by EACF.

Check AIC, AICC and BIC scores:

# Candidate models could be AR(2), ARMA(1,1), or ARIMA(1,0,1)
ar2=Arima(PD.sm$residuals,order=c(2,0,0),include.mean=F)    #
arma11=Arima(PD.sm$residuals,order=c(1,0,1),include.mean=F)
arima101=Arima(PD.sm$residuals,order=c(1,0,1),include.mean=F)   #
ar2

## Series: PD.sm$residuals
## ARIMA(2,0,0) with zero mean    
##
## Coefficients:
##          ar1     ar2
##       0.4754  0.2293
## s.e.  0.0553  0.0553
##
## sigma^2 estimated as 0.0006458:  log likelihood=703.83
## AIC=-1401.65   AICc=-1401.57   BIC=-1390.42

arma11

## Series: PD.sm$residuals
## ARIMA(1,0,1) with zero mean    
##
## Coefficients:
##          ar1      ma1
##       0.8685  -0.4457
## s.e.  0.0421   0.0803
##
## sigma^2 estimated as 0.0006341:  log likelihood=706.65
## AIC=-1407.31   AICc=-1407.23   BIC=-1396.08

arima101

## Series: PD.sm$residuals
## ARIMA(1,0,1) with zero mean    
##
## Coefficients:
##          ar1      ma1
##       0.8685  -0.4457
## s.e.  0.0421   0.0803
##
## sigma^2 estimated as 0.0006341:  log likelihood=706.65
## AIC=-1407.31   AICc=-1407.23   BIC=-1396.08

The best model is ARIMA(1,0,1) & ARMA(1,1).

Model Diagnostics:

tsdiag(arma11)

tms1_analysis_files/figure-docx/unnamed-chunk-13-1.png

Forecasting:

newdata=data.frame(sp=as.factor(sp[1:36]))
PD.prd <- PD.sm$coefficients
predxreg=predict(PD.sm,newdata)
predx=predict(arma11,n.ahead=36)
pr=predx$pred+predxreg
uci=pr+2*predx$se
lci=pr-2*predx$se

pr=ts(pr,start=2014,freq=12)
uci=ts(uci,start=2014,freq=12)
lci=ts(lci,start=2014,freq=12)

ymin=min(c(as.vector(lci),PD.ts))-.1
ymax=max(c(as.vector(uci),PD.ts))+.1

plot(full1.ts,ylab="Percentage",xlim=c(1998,2016),ylim=c(ymin,ymax),main="Forecast of Percent Flights Delayed")
lines(pr,col=2)
lines(uci,col=3)
lines(lci,col=3)

tms1_analysis_files/figure-docx/unnamed-chunk-14-1.png

Part 2 --- Average Monthly Delay time of delayed flights.

TS2 in months

ts2 <- read.csv("tms2.csv")
ts2 <- ts2[,2:3]
full2 <- ts2
tst2 <- ts2[313:348,]
ts2 <- ts2[-(313:348),]

library(TSA)
library(xts)
AD.ts <- ts(ts2[,2], start = c(1988,1), end = c(2013,12), frequency = 12)
full2.ts <- ts(full2$avgdelay,  start = c(1988,1), frequency = 12)
plot(as.xts(AD.ts), main="Average Flight Delay (in minutes)", major.format = "%Y-%m", type = "o")

tms2_analysis_files/figure-docx/unnamed-chunk-2-1.png

LINEAR TREND:

AD.ln=lm(AD.ts~time(AD.ts))
summary(AD.ln)

##
## Call:
## lm(formula = AD.ts ~ time(AD.ts))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -9.0408 -2.7173 -0.2001  2.4522 11.4454
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.701e+03  5.728e+01  -29.69   <2e-16 ***
## time(AD.ts)  8.737e-01  2.863e-02   30.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.795 on 310 degrees of freedom
## Multiple R-squared:  0.7503, Adjusted R-squared:  0.7495
## F-statistic: 931.7 on 1 and 310 DF,  p-value: < 2.2e-16

plot(AD.ln)

tms2_analysis_files/figure-docx/unnamed-chunk-3-1.pngtms2_analysis_files/figure-docx/unnamed-chunk-3-2.pngtms2_analysis_files/figure-docx/unnamed-chunk-3-3.pngtms2_analysis_files/figure-docx/unnamed-chunk-3-4.png

ln.ts=ts(resid(AD.ln),start=c(1988,1),freq=12)
plot(ln.ts)
abline(h=0,col=2)

tms2_analysis_files/figure-docx/unnamed-chunk-3-5.png

SEASONAL MEANS:

# Seasonal Means model
sp = season(AD.ts)
AD.sm =  lm(AD.ts~sp)  
summary(AD.sm)

##
## Call:
## lm(formula = AD.ts ~ sp)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -16.2020  -6.6133   0.6631   5.8316  15.4784
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47.9367     1.4492  33.079   <2e-16 ***
## spFebruary   -0.7920     2.0494  -0.386   0.6994    
## spMarch      -0.2152     2.0494  -0.105   0.9164    
## spApril      -1.3576     2.0494  -0.662   0.5082    
## spMay        -0.2594     2.0494  -0.127   0.8994    
## spJune        3.6049     2.0494   1.759   0.0796 .  
## spJuly        3.7126     2.0494   1.811   0.0711 .  
## spAugust      1.6295     2.0494   0.795   0.4272    
## spSeptember  -2.1650     2.0494  -1.056   0.2916    
## spOctober    -3.5712     2.0494  -1.743   0.0824 .  
## spNovember   -2.9101     2.0494  -1.420   0.1567    
## spDecember    0.3739     2.0494   0.182   0.8554    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.389 on 300 degrees of freedom
## Multiple R-squared:  0.08397,    Adjusted R-squared:  0.05038
## F-statistic:   2.5 on 11 and 300 DF,  p-value: 0.005098

sm.ts <- ts(AD.sm$residuals, start = c(1988,1), frequency = 12)
plot(sm.ts)
#plot(AD.sm$residuals)
abline(h=0,col=2)

tms2_analysis_files/figure-docx/unnamed-chunk-4-1.png

SEASONAL MEANS + LINEAR:

sp = season(AD.ts)
tm = time(AD.ts)
AD.smln =  lm(AD.ts~sp+tm)  
summary(AD.smln)

##
## Call:
## lm(formula = AD.ts ~ sp + tm)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -9.1783 -2.0925 -0.1185  2.0488  8.3213
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1707.0889    46.8185 -36.462  < 2e-16 ***
## spFebruary     -0.8651     0.8598  -1.006 0.315141    
## spMarch        -0.3614     0.8598  -0.420 0.674546    
## spApril        -1.5769     0.8598  -1.834 0.067655 .  
## spMay          -0.5518     0.8599  -0.642 0.521539    
## spJune          3.2394     0.8599   3.767 0.000199 ***
## spJuly          3.2739     0.8599   3.807 0.000170 ***
## spAugust        1.1177     0.8599   1.300 0.194687    
## spSeptember    -2.7499     0.8600  -3.198 0.001534 **
## spOctober      -4.2292     0.8600  -4.918 1.45e-06 ***
## spNovember     -3.6412     0.8600  -4.234 3.06e-05 ***
## spDecember     -0.4303     0.8601  -0.500 0.617223    
## tm              0.8773     0.0234  37.489  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.1 on 299 degrees of freedom
## Multiple R-squared:  0.8393, Adjusted R-squared:  0.8329
## F-statistic: 130.1 on 12 and 299 DF,  p-value: < 2.2e-16

plot(AD.smln)

tms2_analysis_files/figure-docx/unnamed-chunk-5-1.pngtms2_analysis_files/figure-docx/unnamed-chunk-5-2.pngtms2_analysis_files/figure-docx/unnamed-chunk-5-3.pngtms2_analysis_files/figure-docx/unnamed-chunk-5-4.png

smln.ts <- ts(AD.smln$residuals, start = c(1988,1), frequency = 12)
plot(smln.ts)
abline(h=0,col=2)

tms2_analysis_files/figure-docx/unnamed-chunk-5-5.png

We will use Shapiro Wilk test to check normality:

shapiro.test(AD.smln$residuals)

##
##  Shapiro-Wilk normality test
##
## data:  AD.smln$residuals
## W = 0.99763, p-value = 0.9359

runs(AD.smln$residuals)

## $pvalue
## [1] 1.9e-09
##
## $observed.runs
## [1] 104
##
## $expected.runs
## [1] 156.9423
##
## $n1
## [1] 159
##
## $n2
## [1] 153
##
## $k
## [1] 0

Box.test(AD.sm$residuals, lag = 1, type = "Ljung-Box")

##
##  Box-Ljung test
##
## data:  AD.sm$residuals
## X-squared = 261.75, df = 1, p-value < 2.2e-16

Shapiro wilk satisfied normality and also runs test declared that our data is not independent.

Next ACF plot:

acf(AD.smln$residuals, main="ACF plot of seasonal means residuals")

tms2_analysis_files/figure-docx/unnamed-chunk-7-1.png

So, it is not independent.

PACF plot:

pacf(AD.smln$residuals, main="PACF plot of seasonal means residuals")

tms2_analysis_files/figure-docx/unnamed-chunk-8-1.png 

Pacf suggests AR2 - though there are few lags which are just crossing boundary line.

EACF plot:

eacf(AD.smln$residuals)

## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x x  x  x  x
## 1 x o o o o o o o o o o  o  o  o
## 2 x x o o o o o o o o o  o  o  o
## 3 x x o o o o o o o o o  o  o  o
## 4 x x o o o o o o o o o  o  o  o
## 5 x x o o o o o o o o o  o  o  o
## 6 x x o x o o o o o o o  o  o  o
## 7 x x x x x x o o o o o  o  o  o

EACF suggests ARMA(1,1).

Stationarity:

adf.test(AD.smln$residuals)

## Warning in adf.test(AD.smln$residuals): p-value smaller than printed p-
## value

##
##  Augmented Dickey-Fuller Test
##
## data:  AD.smln$residuals
## Dickey-Fuller = -4.1068, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

pp.test(AD.smln$residuals)

## Warning in pp.test(AD.smln$residuals): p-value smaller than printed p-value

##
##  Phillips-Perron Unit Root Test
##
## data:  AD.smln$residuals
## Dickey-Fuller Z(alpha) = -170.8, Truncation lag parameter = 5,
## p-value = 0.01
## alternative hypothesis: stationary

Both ADF and PP satisfy stationarity.

auto.arima() function:

library(forecast)

##
## Attaching package: 'forecast'

## The following objects are masked from 'package:TSA':
##
##     fitted.Arima, plot.Arima

## The following object is masked from 'package:nlme':
##
##     getResponse

auto.arima(AD.smln$residuals)

## Series: AD.smln$residuals
## ARIMA(1,0,1) with zero mean    
##
## Coefficients:
##          ar1      ma1
##       0.8414  -0.4781
## s.e.  0.0561   0.0955
##
## sigma^2 estimated as 6.425:  log likelihood=-732.12
## AIC=1470.24   AICc=1470.31   BIC=1481.47

auto.arima also suggests ARMA(1,1).

Check AIC, AICC and BIC scores:

# Candidate models could be AR(2) & ARMA(1,1)
ar2=Arima(AD.smln$residuals,order=c(2,0,0),include.mean=F)  #
arma11=Arima(AD.smln$residuals,order=c(1,0,1),include.mean=F)
ar2

## Series: AD.smln$residuals
## ARIMA(2,0,0) with zero mean    
##
## Coefficients:
##          ar1     ar2
##       0.3959  0.2292
## s.e.  0.0551  0.0553
##
## sigma^2 estimated as 6.493:  log likelihood=-733.74
## AIC=1473.47   AICc=1473.55   BIC=1484.7

arma11

## Series: AD.smln$residuals
## ARIMA(1,0,1) with zero mean    
##
## Coefficients:
##          ar1      ma1
##       0.8414  -0.4781
## s.e.  0.0561   0.0955
##
## sigma^2 estimated as 6.425:  log likelihood=-732.12
## AIC=1470.24   AICc=1470.31   BIC=1481.47

ARMA(1,1) is the best model by AIC and BIC scores.

Model Diagnostics:

tsdiag(arma11)

tms2_analysis_files/figure-docx/unnamed-chunk-13-1.png

Forecasting:

# let us do the forecasting for the next three years
newtm=seq(from=2014,to=2016.917,length=36)
newdata=data.frame(sp=as.factor(sp[1:36]),tm=newtm)
predxreg=predict(AD.smln,newdata)

predx=predict(arma11,n.ahead=36)
pr=predx$pred+predxreg
uci=pr+2*predx$se
lci=pr-2*predx$se

# Code the predicted values as time series to plot them as prediction intervals
pr=ts(pr,start=c(2014,1),end = c(2016,12),freq=12)
uci=ts(uci,start=c(2014,1),end = c(2016,12),freq=12)
lci=ts(lci,start=c(2014,1),end = c(2016,12),freq=12)

ymin=min(c(as.vector(lci),AD.ts))-.1
ymax=max(c(as.vector(uci),AD.ts))+.1

plot(full2.ts,ylab="Delay Time (in minutes)",xlim=c(1998,2016),ylim=c(ymin,ymax),main="Forecast of Average Delay Time of Flights")
lines(pr,col=2)
lines(uci,col=3)
lines(lci,col=3)

tms2_analysis_files/figure-docx/unnamed-chunk-14-1.png