An AR(p) model is a regression on its own lags and is estimated by OLS, provided the errors are white noise and the process is stationary. MA and ARMA models cannot use OLS directly because the lagged shocks are unobserved, so they are estimated iteratively. The order is chosen with information criteria — AIC and BIC — which add a penalty for the number of parameters; BIC penalises more heavily and so tends to choose smaller models.
How to read these notes
These notes follow the estimation and model-selection part of a time-series course. They assume you know what AR, MA and ARMA models are and have read the correlogram. The correlogram suggests which models are plausible; this note is about turning that into estimated coefficients and a final choice of order.
1. Estimating an AR(p) by OLS
The key observation is that an autoregressive model is just a linear regression in disguise. The AR(p) is
The dependent variable is \(Y_t\) and the regressors are a constant and the lagged values \(Y_{t-1}, \dots, Y_{t-p}\). These lags are all observed, so we can run an ordinary OLS regression to estimate \(\alpha, \phi_1, \dots, \phi_p\). For the AR(1) the slope estimate is the familiar formula
For OLS on an AR model to be consistent in large samples we need (i) the errors to be serially uncorrelated — if they were correlated with their past, they would be correlated with the lagged dependent variable and exogeneity would fail — and (ii) the process to be stationary (\(|\phi_1|<1\) in the AR(1) case). With a unit root the variance is infinite and the usual OLS theory breaks down.
One practical point: forming \(p\) lags costs the first \(p\) observations, so an AR(p) is estimated on \(T-p\) observations. When comparing models of different orders for the same series, estimate them all over the same sample period so the comparison is fair.
2. Why MA models need iteration
Moving-average models cannot be estimated by OLS, and the reason is instructive. Consider the MA(1):
The "regressor" here is \(\varepsilon_{t-1}\) — but unlike a lagged value of \(Y\), the lagged shock is not observed, so it cannot be plugged into a regression. The standard work-around assumes \(\varepsilon_0 = 0\) and rewrites the model as \(\varepsilon_t = Y_t - \theta_1 \varepsilon_{t-1}\). Starting from a guess for \(\theta_1\), you recursively build a series of pseudo-residuals, use them as a proxy for the unobserved shocks in an artificial regression, obtain an improved estimate, and repeat.
The procedure cycles between (a) constructing a residual series from the current coefficient estimate and (b) re-estimating the coefficient using those residuals, stopping when two consecutive iterations give essentially the same value. The same idea extends to any MA(q) and to full ARMA models. Modern software does this automatically.
Despite needing iteration to estimate, MA and ARMA models are tested exactly like AR models: the estimators are asymptotically normal, so single coefficients use the t-distribution and joint restrictions use the F-distribution.
3. The model-selection problem
The estimation theory assumes you already know the correct orders \(p\) and \(q\). In reality you do not — you have to infer them from the data. The correlogram gives a first indication (a quickly declining ACF hints at an AR, for instance), but it rarely pins down the order precisely. The modern approach uses information criteria.
You cannot just pick the model with the smallest residual variance or the highest R-squared: adding lags always improves in-sample fit, even when the extra coefficients are truly zero. A good criterion must charge for that extra complexity.
4. AIC and BIC
The two criteria in widespread use are the Akaike Information Criterion (AIC) and the Schwarz (Bayesian) Information Criterion (SIC/BIC). Writing \(\hat\sigma^2 = \text{RSS}/T^{*}\) for the estimated error variance (with \(T^{*}\) the number of observations used and \(p+q+1\) parameters including the intercept):
Each is the sum of two parts. The first term, \(\log\hat\sigma^2\), rewards goodness of fit — it falls as the model fits better. The second term is a penalty that rises with the number of parameters. You compute the criterion across a grid of candidate orders and choose the \((p,q)\) that minimises it.
Both criteria balance fit against parsimony. Adding a lag lowers \(\hat\sigma^2\) (better fit) but raises the penalty (more parameters). The criterion only falls if the improvement in fit outweighs the extra cost — so it avoids rewarding lags that merely chase noise.
5. AIC versus BIC: which to trust
The only difference between the two formulas is the weight on each parameter: \(2\) for AIC versus \(\log T^{*}\) for BIC. Since \(\log T^{*} > 2\) for any realistic sample size, BIC penalises extra parameters more heavily. The consequence, for AR model order, is
so AIC tends to choose models with at least as many parameters as BIC, and often more. The criteria need not agree.
If the true process is an AR of order \(p_0\), then as the sample grows BIC is consistent — it selects the true \(p_0\). AIC, by contrast, can over-specify the order even asymptotically, sometimes choosing too many lags. For this reason some practitioners prefer BIC.
That said, in finite samples both criteria work well, and the asymptotic over-fitting of AIC matters less when data are limited. The sensible advice is to combine the criteria with hypothesis tests on individual coefficients and, crucially, with a serial-correlation test on the residuals.
6. Putting it together
A complete model-selection routine for a single series looks like this:
- Inspect the correlogram to narrow down plausible orders.
- Estimate a small range of candidate ARMA(p,q) models over a common sample.
- Compare AIC and BIC across them, noting that AIC may favour a larger model.
- Check that coefficients are significant and — decisively — that the residuals show no serial correlation.
For quarterly UK GDP growth, for example, the declining correlogram points to an AR(1); estimating it gives a significant AR coefficient, both AIC and BIC favour it over a competing MA(2), and its residuals pass the serial-correlation test while the MA(2)'s do not. A model is only adequate when it survives all of these checks together, not on the strength of one number alone.
Econometrics & time-series tuition
Estimating ARMA models and choosing the order with AIC and BIC is a core skill in time-series courses and dissertations. For 1-1 help with estimation, model selection or EViews/R/Stata output, see econometrics tuition, statistics tuition or university economics tuition.
Companion videos: the Time Series Econometrics playlist on @economaths includes ARIMA estimation in R.