- Presentation of the project
- Description of the time series and data preparation
- Stationarity analysis
- Box-Jenkins methodology to build a SARIMA process
- Prediction
- Conclusion
Edgar Jullien, Antoine Settelen, Simon Weiss
03/11/2020
Data : For this project, we used a dataset hosted by the city of Los Angeles.
The organization has an open data platform and they update their information according the amount of data that is brought in. We found the dataset trough an open dataset in Kaggle website. We download the data and upload it in our working github repository. The dataset has only 3 notebooks written in Python and only one notebook has implemented a prediction using SARIMA model.
Thus, it seemed relevant to us to write this script in R by applying the Box Jenkins method in order to make predictions on this dataset using the methods seen during the Time Series and Business Data course by Anne VANHEMS. If this script doesn’t contain a lot of errors, we hope to be able to publish it on the platform.
Lax Airport
Lax_init <- read.csv("data/los-angeles-international-airport-passenger-traffic-by-terminal.csv") Lax_init %>% head(n=3)
## DataExtractDate ReportPeriod Terminal ## 1 2014-05-01T00:00:00.000 2006-01-01T00:00:00.000 Imperial Terminal ## 2 2014-05-01T00:00:00.000 2006-01-01T00:00:00.000 Imperial Terminal ## 3 2014-05-01T00:00:00.000 2006-01-01T00:00:00.000 Misc. Terminal ## Arrival_Departure Domestic_International Passenger_Count ## 1 Arrival Domestic 490 ## 2 Departure Domestic 498 ## 3 Arrival Domestic 753
Lax_init %>% str
## 'data.frame': 5870 obs. of 6 variables: ## $ DataExtractDate : Factor w/ 60 levels "2014-05-01T00:00:00.000",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ ReportPeriod : Factor w/ 159 levels "2006-01-01T00:00:00.000",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ Terminal : Factor w/ 11 levels "Imperial Terminal",..: 1 1 2 2 3 3 3 4 4 4 ... ## $ Arrival_Departure : Factor w/ 2 levels "Arrival","Departure": 1 2 1 2 1 2 2 1 1 2 ... ## $ Domestic_International: Factor w/ 2 levels "Domestic","International": 1 1 1 1 1 1 2 1 2 1 ... ## $ Passenger_Count : int 490 498 753 688 401535 389745 561 98991 163067 93672 ...
The row data we have loaded contains 5870 observation for the 8 following features:
- DataExtractDate
- ReportPeriod
- Terminal
- Arrival_Departure - Domestic_International
- Passenger_Count - year and month
Lax_init %>% anyNA ## [1] FALSE
We don’t have NA value.
In order to use ts package in R, we have to do some Data preparation.
Since we want to predict the number of passenger with our model, we will only keep the feature Passenger Count and Report Period associated to each observation.
So first we convert the Report Perdiod feature into a Date type.
Lax_init[,2]<-Lax_init[,2] %>% as.Date()
Lax_init$year<-as.numeric(format(Lax_init[,2],'%Y')) Lax_init$months<-months(Lax_init[,2]) Lax_init %>% setDT
-We create a new Lax dataset with only the data we are interested in for the prediction.
Lax_data=Lax_init[, lapply(.SD, sum, na.rm=TRUE), by=list(year,months),.SDcols=c("Passenger_Count")]
sum(Lax_init[months=="janvier"&year==2006][,6])
## [1] 4756159
Lax_data[1]
## year months Passenger_Count ## 1: 2006 janvier 4756159
It is the same ! We can pass to load our data our ts package and do our first plot.
traf <- ts(Lax_data[,3], start=c(2006,1), frequency=12) str(traf)
## Time-Series [1:159, 1] from 2006 to 2019: 4756159 4250155 5090294 5087874 5177408 5510023 5919723 5745243 4707983 4898421 ... ## - attr(*, "dimnames")=List of 2 ## ..$ : NULL ## ..$ : chr "Passenger_Count"
From this first plot, we can first notice that this time series has for the most part of it a positive linear trend, a multiplicative seasonal patterns, and some irregular patterns. This plot suggests to use what we cal a SARIMA model to do our forecasting.
We find the same conclusion from visual observation of time series : i.e a mainly positive linar trend over time, a seasonal patterns with some irregular patterns in the random component.
We can notice that in average, passenger traffic tends to decrease between January and February then increases between march and august to decrease in the month of September and then remain stable between October and December.
This concludes our description part. Let’s move to stationary analysis and transformation.
- Blue : confidence interval. If the values were inside the interval, it would mean that there enough close to zeros to be non significant.
- From the ACF, we see that values are outside the confidence interval represented by the blue lines. It means that they are all different from zero and so significant, i.e they depend on time. We can have the same observation from pacf.
acf(ts(traf, frequency=1))
pacf(ts(traf, frequency=1))
We perform a logarithmic transformation to remove the cycle amplification. The range cycle has become stable
ltraf <- log(traf) plot.ts(ltraf)
There is still the presence of a trend and a seasonal component. This trend seems to be deterministic, linear and non-stochastic (but we will use a stochastic modelisation for the trend using integration within ARIMA)
The serie is always a non-stationary series. let’s confirm it with Autocorrelograms
acf(ts(ltraf, frequency=1))
pacf(ts(ltraf, frequency=1))
dltraf_1 <- diff(ltraf, 1) plot(dltraf_1)
The trend component seems to have disappeared, not the seasonality at t=12
acf(ts(dltraf_1, frequency=1))
A linear regression of degree 1 seems to be sufficient. However, the lm1 model cannot be used to make predictions. Because there is still self-correction in the residuals. Let’s refine the model by modeling the residuals.
We will use a stochastic modelisation for the trend using integration within SARIMA
pacf(ts(dltraf_1, frequency=1), lag.max = 60)
dltraf_12 <- diff(dltraf_1, 12) library(tseries) kpss.test(dltraf_12)
## Warning in kpss.test(dltraf_12): p-value greater than printed p-value
## ## KPSS Test for Level Stationarity ## ## data: dltraf_12 ## KPSS Level = 0.043809, Truncation lag parameter = 4, p-value = 0.1
stationnary, we accept H0 and we choose as seasonality s=12. Now we have found a series which is stationary. We can move the identification step
In this part, we will apply our first identification to build our SARIMA (p,d,q)(P,D,Q)[S] model
We will apply those rules :
Identifying the order of difference:
Identifying the seasonal part of the model:
plot(dltraf_12)
acf(ts(dltraf_12, frequency=1), lag.max = 40)
pacf(ts(dltraf_12, frequency=1), lag.max = 40)
First estimation :
q=1 #MA because for k>(q=1), acf(ts) = 0 (taking ac count of the seasonality)
p=2 #AR because for k>(q=2), pacf(ts) = 0 (taking ac count of the seasonality)
d=1 #we removed trend by taking the diff of order 1
Q=1 #MA because some significant coefficients in the ACF remain at order 12 (but not at order 24)
P=2 #AR because some significant coefficients in the PACF remain at order 12 & at order 24
D=1 #(seasonality diff) we removed seasonality diff by taking the diff or order 1
Now that we have identified our parameters, we can build our arima model. We use P=1 as we got an error with P=2 using method = ‘ML’.
mod1 <- arima(ltraf, c(2, 1, 1),seasonal = list(order = c(1, 1, 1), period = 12), method='ML') mod1
## ## Call: ## arima(x = ltraf, order = c(2, 1, 1), seasonal = list(order = c(1, 1, 1), period = 12), ## method = "ML") ## ## Coefficients: ## ar1 ar2 ma1 sar1 sma1 ## -0.4109 -0.1961 -0.0842 0.0665 -0.8480 ## s.e. 0.2256 0.1185 0.2195 0.1139 0.1182 ## ## sigma^2 estimated as 0.0004553: log likelihood = 347.52, aic = -685.05
We have an Akaike Information Criteria (AIC) of -685. AIC is widely used measure of a statistical model. It basically quantifies 1) the goodness of fit, and 2) the simplicity/parsimony, of the model into a single statistic which will be compared with other AIC of different model (the lower the better).
Let’s fit our model with our data and plot it compared to real data (black line for real data and red line for model estimation)
This first plot is rather reassuring as for the good identification of our parameters. The model seems to follow our actual data well. However, we can observe that some parts of the model do not fit well data based on the real data for instance, at year 2009.
fit1 <- fitted(mod1) plot.ts(cbind(ltraf,fit1),plot.type='single',col=c('black','red'))
We should now pass to test the significance of our coefficient. First call our coef from our model
mod1$coef
## ar1 ar2 ma1 sar1 sma1 ## -0.41093841 -0.19609459 -0.08416079 0.06645508 -0.84798167
mod1$var.coef
## ar1 ar2 ma1 sar1 sma1 ## ar1 0.0509079363 0.0204434779 -0.0461009866 -0.0006709274 -0.0015930694 ## ar2 0.0204434779 0.0140409957 -0.0190395404 -0.0004260127 0.0002612597 ## ma1 -0.0461009866 -0.0190395404 0.0481805529 0.0003237873 0.0004914196 ## sar1 -0.0006709274 -0.0004260127 0.0003237873 0.0129827419 -0.0086908526 ## sma1 -0.0015930694 0.0002612597 0.0004914196 -0.0086908526 0.0139725898
Then compute our T-stat
tstat <- mod1$coef/sqrt(diag(mod1$var.coef)) tstat
## ar1 ar2 ma1 sar1 sma1 ## -1.8213105 -1.6548806 -0.3834192 0.5832369 -7.1737793
And then compute our p value.
pvalue <- 2*(1-pnorm(abs(tstat))) pvalue
## ar1 ar2 ma1 sar1 sma1 ## 6.855967e-02 9.794873e-02 7.014089e-01 5.597339e-01 7.296386e-13
sma1 is significant (pvalue<5) ar1, ar2, sar1, ma1 are not significant in this first model (pvalue>5%) We will remove each by each the parameters not significant and re-estimate the model without them.
First, we choose to remove ar1 and ar2 (q=0)
mod1 <- arima(ltraf, c(0, 1, 1),seasonal = list(order = c(1, 1, 1), period = 12), method='ML') mod1
## ## Call: ## arima(x = ltraf, order = c(0, 1, 1), seasonal = list(order = c(1, 1, 1), period = 12), ## method = "ML") ## ## Coefficients: ## ma1 sar1 sma1 ## -0.4438 0.0604 -0.8524 ## s.e. 0.0647 0.1132 0.1184 ## ## sigma^2 estimated as 0.0004627: log likelihood = 346.17, aic = -686.34
We can notice that our AIC is slightly better than before (which was -685). That means that our model is fitting better with the data.
We check now our the p-value from this second model (pvalue2)
mod1$coef
## ma1 sar1 sma1 ## -0.4438192 0.0604057 -0.8524315
mod1$var.coef
## ma1 sar1 sma1 ## ma1 0.0041848527 -0.0001940421 -0.0004592256 ## sar1 -0.0001940421 0.0128138621 -0.0087010351 ## sma1 -0.0004592256 -0.0087010351 0.0140145734
tstat <- mod1$coef/sqrt(diag(mod1$var.coef)) pvalue2 <- 2*(1-pnorm(abs(tstat))) pvalue2
## ma1 sar1 sma1 ## 6.854295e-12 5.935996e-01 5.995204e-13
We have pvalue<5% for sma1 and ma1 We remove sar1 and re-train our model (p=0 and P=O)
mod1 <- arima(ltraf, c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12), method='ML') mod1
## ## Call: ## arima(x = ltraf, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12), ## method = "ML") ## ## Coefficients: ## ma1 sma1 ## -0.4431 -0.8135 ## s.e. 0.0648 0.0831 ## ## sigma^2 estimated as 0.0004675: log likelihood = 346.03, aic = -688.05
Our AIC is better than before, meaning that our model fit better with our data.
We check p-value 3
mod1$coef
## ma1 sma1 ## -0.4431101 -0.8135224
mod1$var.coef
## ma1 sma1 ## ma1 0.0041980244 -0.0005626495 ## sma1 -0.0005626495 0.0069058423
tstat <- mod1$coef/sqrt(diag(mod1$var.coef)) pvalue3 <- 2*(1-pnorm(abs(tstat))) pvalue3
## ma1 sma1 ## 7.977841e-12 0.000000e+00
All our coefficient are significant here. So we have here SARIMA (0,1,1)(0,1,1)[12]
Test auto.arima
We check with auto arima our final coefficient. The auto.arima function returns best ARIMA model according to either AIC, AICc or BIC value (default value is AIC). The function conducts a search over possible model within the order constraints provided.
Conclusion from auto.arima : We have here the same model from auto.arima ! We can pass to the residual diagnostic.
auto.arima(ltraf,stationary =FALSE, seasonal = TRUE,trace=TRUE)
## ## Fitting models using approximations to speed things up... ## ## ARIMA(2,1,2)(1,1,1)[12] : -582.5078 ## ARIMA(0,1,0)(0,1,0)[12] : -501.5074 ## ARIMA(1,1,0)(1,1,0)[12] : -552.73 ## ARIMA(0,1,1)(0,1,1)[12] : -599.6856 ## ARIMA(0,1,1)(0,1,0)[12] : -536.6893 ## ARIMA(0,1,1)(1,1,1)[12] : -585.1416 ## ARIMA(0,1,1)(0,1,2)[12] : -597.6106 ## ARIMA(0,1,1)(1,1,0)[12] : -557.6347 ## ARIMA(0,1,1)(1,1,2)[12] : -583.9723 ## ARIMA(0,1,0)(0,1,1)[12] : -570.1869 ## ARIMA(1,1,1)(0,1,1)[12] : -597.46 ## ARIMA(0,1,2)(0,1,1)[12] : -599.3585 ## ARIMA(1,1,0)(0,1,1)[12] : -593.8621 ## ARIMA(1,1,2)(0,1,1)[12] : -598.282 ## ## Now re-fitting the best model(s) without approximations... ## ## ARIMA(0,1,1)(0,1,1)[12] : -685.8835 ## ## Best model: ARIMA(0,1,1)(0,1,1)[12]
## Series: ltraf ## ARIMA(0,1,1)(0,1,1)[12] ## ## Coefficients: ## ma1 sma1 ## -0.4431 -0.8135 ## s.e. 0.0648 0.0831 ## ## sigma^2 estimated as 0.0004954: log likelihood=346.03 ## AIC=-686.05 AICc=-685.88 BIC=-677.1
Difference between fit data and raw data : every time there is a pic, it means that the model has not fitted well the data.
res1<- mod1$residuals
What we expect is that residuals are close to 0 as possible We want to check the white noise assumption and the normality of the residuals (Gaussian WN)
plot(res1)
Autocorrelation of the residuals (White noise assumption). What we expect is that all the coefficients are none significant, that is the deal with white noise. All coefficient seem to be none significant
acf(ts(res1, frequency=1))
What we expect is that all the coefficients are non significant, that is the deal with white noise it is not perfect but what would be like an issue if the first or the second autocorellation coefficient would have been significant. Here, 1 significant coefficient at lag 4
Same results for pacf.
pacf(ts(res1, frequency=1),lag.max = 60)
Ljung-box test to check for the significance of ACF
Box.test(res1,lag=20,type="Ljung-Box")
## ## Box-Ljung test ## ## data: res1 ## X-squared = 17.835, df = 20, p-value = 0.5983
pvalue = 59% larger than 5% so we accept H0: white noise assumption.
Conclusion : we have a white noise.
We compute standardized residuals (residuals divided by the standard deviation)
res_stand <- res1/sqrt(mod1$sigma2) summary(res_stand)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -3.38079 -0.44527 0.05866 0.08838 0.60990 3.18516
if the normality assumption is satisfied, the standardized residuals should lie in between -2 and 2 (with 95% of chance)
It should be 95% of chance but here obviously there are some pics around 2007, 2009, 2011, 2014. We can identify outliers in the dataset it corresponds to the lowest and highest values of the residuals.
plot(res_stand) abline(a=2,b=0,col="red") abline(a=-2,b=0,col="red")
2 options with this kind of situation :
We will choose second option. But, before doing that, we need to check if it is necessary by testing the normality of the residuals.
Here, from the qq plot we can notice that our model data does not follow qqline which follows normal distribution. Let’s confirm this observation with Shapiro-wilk normality test.
qqnorm(res1) qqline(res1)
Shapiro-test
shapiro.test(res1)
## ## Shapiro-Wilk normality test ## ## data: res1 ## W = 0.96687, p-value = 0.0007304
jarque.bera.test(res1)
## ## Jarque Bera Test ## ## data: res1 ## X-squared = 27.518, df = 2, p-value = 1.058e-06
pvalue <<<< 5%, we reject normality (H0 : normal, H1 non normal). We need to add a dummy in the model (second option)
Constance variance with a plot if the variance is constant, the variance should not depend on the difference of the date (should not depend on t)
sq.res1 <- (res1)^2 acf(ts(sq.res1,frequency=1),lag.max = 60)
We have an issue : the first order coef of the ACF and the 22 are significant. So there is a link between the value of the residuals at date t, and the square value of the residuals at date 1 and date 22
=> It may be due to the outlier we identified
=> Or maybe we should add a model more complex.
We choose to assume that it may be due to the outlier we identified below.
library(TSA) Htest <- McLeod.Li.test(mod1, plot=FALSE) Htest
## $p.values ## [1] 0.02595353 0.08293222 0.12233251 0.20306255 0.11050012 0.16076819 ## [7] 0.15305892 0.17554340 0.22384896 0.28577922 0.27898877 0.35242940 ## [13] 0.42173875 0.46126810 0.53682497 0.56148771 0.62341534 0.62078869 ## [19] 0.54792599 0.51664544 0.46847274
The McLeod.Li test confirms our first conclusion.
So, we will now build a SARIMAX model which is a SARIMA model with “dummyfied” variables.
We can identify outliers in our dataset
- 2009 : 1 min , 2011 : 1 max , 2014 : 1 min and max
plot(res_stand) abline(a=2,b=0,col="red") abline(a=-2,b=0,col="red")
min(res_stand)
## [1] -3.380792
max(res_stand)
## [1] 3.185162
We decide to cut at [-3,3]
out <- which(res_stand < -3 | res_stand > 3) out
## [1] 35 65 100 101
#split all outliers into 4 groups to create 4 dummies out1 <- out[1] out1
## [1] 35
out2 <- out[2] out2
## [1] 65
out3 <- out[3] out3
## [1] 100
out4 <- out[4] out4
## [1] 101
It corresponds to the observations 35, 65, 100,11
With the zoo package we associate with this observation the date.
library(zoo)
## ## Attaching package: 'zoo'
## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric
index(res_stand)[out1] #date of the outliers novembre 2008, Mai 2011, Mars 2014, Avril 2014
## [1] 2008.833
res_stand[out1] # value of the outlier.
## [1] -3.344748
We split all outliers into 4 groups to create 4 dummies
#first dummy Lax_data$dum_1 <- 0 Lax_data$dum_1[out1] <- 1 length(Lax_data$dum_1)
## [1] 159
#second dummy Lax_data$dum_2 <- 0 Lax_data$dum_2[out2] <- 1 #third dummy Lax_data$dum_3 <- 0 Lax_data$dum_3[out3] <- 1 #fourth dummy Lax_data$dum_4 <- 0 Lax_data$dum_4[out4] <- 1
For notice, in order to incorporate multiple dummies variable, we had to create a cbind matrix into xreg parameters.
mod2 <- arima(ltraf, c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12), method='ML',xreg=cbind(unlist(Lax_data$dum_1), unlist(Lax_data$dum_2), unlist(Lax_data$dum_3), unlist(Lax_data$dum_4) )) ncol(cbind(unlist(Lax_data$dum_1), unlist(Lax_data$dum_2), unlist(Lax_data$dum_3), unlist(Lax_data$dum_4)))
## [1] 4
mod2
## ## Call: ## arima(x = ltraf, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12), ## xreg = cbind(unlist(Lax_data$dum_1), unlist(Lax_data$dum_2), unlist(Lax_data$dum_3), ## unlist(Lax_data$dum_4)), method = "ML") ## ## Coefficients: ## ma1 sma1 xreg1 xreg2 xreg3 xreg4 ## -0.3622 -0.6969 -0.0318 0.0445 0.0643 -0.0465 ## s.e. 0.0716 0.0792 0.0148 0.0145 0.0154 0.0154 ## ## sigma^2 estimated as 0.0003598: log likelihood = 367.66, aic = -723.31
xreg : additional variable I want to take into account
By adding xreg, we will be able to control the misfit at the years written below. We have even gain small AIC !
Plot of the fitted value…From this plot, we can notice that our final model fits way better than better !
Let’s re-run our residuals analysis in this model.
fit2 <- fitted(mod2) plot.ts(cbind(ltraf,fit2),plot.type='single',col=c('black','red'))
What we expect is that residuals are close to 0 as possible
res2<- mod2$residuals plot(res2)
res_stand_2 <- res2/sqrt(mod2$sigma2) summary(res_stand_2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.84628 -0.45063 0.05699 0.05435 0.57510 3.18416
plot(res_stand_2) abline(a=2,b=0,col="red") abline(a=-2,b=0,col="red")
qqnorm(res2) qqline(res2)
With our dummies, now the residuals follow a normal distribution. Is it always a white noise ?
shapiro.test(res2)
## ## Shapiro-Wilk normality test ## ## data: res2 ## W = 0.98972, p-value = 0.301
jarque.bera.test(res2)
## ## Jarque Bera Test ## ## data: res2 ## X-squared = 2.7449, df = 2, p-value = 0.2535
# Ljung-box test to check for the significance of ACF Box.test(res2,lag=20,type="Ljung-Box")
## ## Box-Ljung test ## ## data: res2 ## X-squared = 14.529, df = 20, p-value = 0.8027
#pvalue = 59% larger than 5% so we accept H0: white noise assumption
Conclusion => We have a Gaussian white noise now.
acf(ts(res2, frequency=1))
No, it seems that there is no more significant coefficients which would say that we have homoscedaticity (don’t need to use garch model)
pacf(ts(res2, frequency=1),lag.max = 60)
Htest <- McLeod.Li.test(mod2, plot=FALSE) Htest
## $p.values ## [1] 0.5394025 0.2348160 0.3195137 0.4196310 0.2527477 0.3339364 0.1219653 ## [8] 0.1173913 0.1656550 0.1694314 0.1164518 0.1479106 0.1682702 0.2058382 ## [15] 0.1457133 0.1678925 0.2017905 0.1765099 0.1946852 0.2091475 0.2556971
H0 is always accepted : homoscedaticity here !
Now that we have a model built, we want to use it to make forecasts. But first we should validate the model with an in-sample analysis.
First, let’s assess the quality of the fit
cb80 <- mod2$sigma2^.5*qnorm(0.9) plot(cbind(ltraf,fit2-cb80,fit2+cb80),plot.type='single',lty=c(1,2,2))
and the Proportion of points in the confidence bound
indi <- (ltraf-(fit2-cb80))>0&(fit2+cb80-ltraf)>0 prop <- 100*sum(indi)/length(indi) prop
## [1] 81.13208
Here, the proportion is larger than 80%, then the fit is considered good.
Let’s confirm our model with an out-of-sample analysis.
In order to do that, we will split our Lax data into two samples : one train dataset and a test dataset.
We will re-run our model on train datasets, forecast results and use accuracy analysis which will compare forecast results and test dataset in order to identify the best accuracy of model. We will use the forecast library to forecast models.
library(forecast)
Split train and test lax dataset.
train.traf <- window(traf,start=2006,end=2017) test.traf <- window(traf,start=2018,end=2019)
Log transformation
train.ltraf<-train.traf %>% log test.ltraf<-test.traf %>% log
Run our different models on train.ltraf
train.mod1 <- forecast::Arima(train.ltraf, c(2, 1, 1),seasonal = list(order = c(1, 1, 1), period = 12), method='ML') train.mod2 <- forecast::Arima(train.ltraf, c(0, 1, 1),seasonal = list(order = c(1, 1, 1), period = 12), method='ML') train.mod3 <- forecast::Arima(train.ltraf, c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12), method='ML')
Since we are only interresed here in accuracy, we will not use SARIMAX and dummy variable. But we keep in mind that SARIMAX had better AIC.
Let’s plot the fitted value compared to raw data…
plot(train.ltraf) lines(fitted(train.mod1), col='red')
…and do our first forecast with train dataset.
We can now plot accuracy of our mod1 by comparing forecast results and test dataset.
tsforecasts1 <- forecast(train.mod1)
a1<-accuracy(tsforecasts1,test.ltraf) a1
## ME RMSE MAE MPE MAPE ## Training set 0.002665415 0.02182828 0.01564985 0.01695709 0.10103583 ## Test set -0.012314044 0.01864545 0.01443758 -0.07782445 0.09129407 ## MASE ACF1 Theil's U ## Training set 0.3062591 -0.01690299 NA ## Test set 0.2825356 0.41918976 0.2209551
Mod2
fitted value
plot(train.ltraf) lines(fitted(train.mod2), col='red')
Forecast and accuracy
tsforecasts2 <- forecast(train.mod2) a2<-accuracy(tsforecasts2,test.ltraf) a2
## ME RMSE MAE MPE MAPE MASE ## Training set 0.00272312 0.02216916 0.01580817 0.01732124 0.10208006 0.3093574 ## Test set -0.01358437 0.01966256 0.01523886 -0.08586378 0.09635771 0.2982164 ## ACF1 Theil's U ## Training set -0.07248321 NA ## Test set 0.42861768 0.2328797
We can notice that mod2 has poorer performance metric as it gained 0,01 point from previous model.
Mod3 fitted value
plot(train.ltraf) lines(fitted(train.mod3), col='red')
tsforecasts3 <- forecast(train.mod3) a3<-accuracy(tsforecasts3,test.ltraf) a3
## ME RMSE MAE MPE MAPE MASE ## Training set 0.002738695 0.02224358 0.01585625 0.01742216 0.1023913 0.3102983 ## Test set -0.017114488 0.02263862 0.01766283 -0.10820970 0.1116790 0.3456521 ## ACF1 Theil's U ## Training set -0.07211241 NA ## Test set 0.45917165 0.2678647
Let’s use performance metrics Root Mean Squared Error and Mean Absolute Error. The closer to Zero those 2 metrics are, the better. We can notice that first, all our model have good performance metrics, they are enough close to zero to best called as good model.
What we can notice here is that we have an opposition between AIC conclusion on performance of model and performance metrics conclusion.
Indeed, we can notice that our RMSE and MAE are slightly increasing as we passed to mod1 to mod3, i.e, mod3 are doing less good than mod1. The explanation of this fact remains open.
Nonetheless, we choose to keep SARIMAX mod4 as our best model by choosing AIC our unique information criteria.
Let’s finally pass to our prediction part.
First, we need to supply new values of our regression. Newxreg should indeed be a matrix that contains values of the regressors we want to predict.
trick <- matrix(rep(0, 12), ncol=4) length(trick)
## [1] 12
mod2
## ## Call: ## arima(x = ltraf, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12), ## xreg = cbind(unlist(Lax_data$dum_1), unlist(Lax_data$dum_2), unlist(Lax_data$dum_3), ## unlist(Lax_data$dum_4)), method = "ML") ## ## Coefficients: ## ma1 sma1 xreg1 xreg2 xreg3 xreg4 ## -0.3622 -0.6969 -0.0318 0.0445 0.0643 -0.0465 ## s.e. 0.0716 0.0792 0.0148 0.0145 0.0154 0.0154 ## ## sigma^2 estimated as 0.0003598: log likelihood = 367.66, aic = -723.31
pred <- predict(mod2, n.ahead = 12, newxreg = trick)
Let’s plot our forecast.
ts.plot(traf,2.718^pred$pred, log = "y", lty = c(1,3))
We plot just from 2010.
ts.plot(traf,xlim=c(2010,2020.500)) lines(2.718^(pred$pred), col="red") lines(2.718^(pred$pred-1.96*pred$se),col=4,lty=2) lines(2.718^(pred$pred+1.96*pred$se),col=4,lty=2)
Using data for LAX passengers we built a SARIMA and SARIMAX model to predict number of passengers one year ahead.
We saw that the model suits the testing data pretty well. To see how well our model does, it will be fascinating to track the real number of passengers for the next few months.
Nonetheless, the model doesn’t take into account external events such as Covid-19 outbreak which dramatically changed the airports traffic.