What is an instrumental variable?

An instrumental variable z is a variable that is uncorrelated with the regression error u (exogeneity) but correlated with the endogenous regressor x (relevance or identification). These two conditions let us recover a consistent estimate of the slope even when OLS is biased by endogeneity.

Why is OLS inconsistent under endogeneity?

OLS converges in probability to the true coefficient plus a bias term equal to E[x x']^{-1} E[x u]. If the regressor is correlated with the error, E[x u] is non-zero, so this bias term does not vanish even as the sample size grows. The estimator is therefore inconsistent.

What is a weak instrument?

A weak instrument is one that is only slightly correlated with the endogenous regressor. The identification condition technically holds, but the correlation is so small that the IV/2SLS estimator has a very large variance and its sampling distribution is poorly approximated by a normal distribution in finite samples. A common rule of thumb is that the first-stage F-statistic should exceed 10.

Instrumental Variables & Weak Instruments

When a regressor is correlated with the error term, OLS is biased and inconsistent. Instrumental variables and two-stage least squares give a consistent alternative, but only when the instrument is exogenous and sufficiently correlated with the endogenous regressor; a weak instrument produces a high-variance, badly behaved estimator.

How to read these notes

These notes follow the structure of the lecture material on endogeneity and instrumental variables. They begin from the failure of ordinary least squares under endogeneity, build up the instrumental variables (IV) and two-stage least squares (2SLS) estimators, and finish with the weak instrument problem and the classic quarter-of-birth example. The earlier sections assume only basic regression and probability; the later sections are pitched at advanced undergraduate and master's level.

1. The problem: endogeneity and the failure of OLS

Consider the linear model written in the usual notation,

y_i = x_i′β₀ + u_i

OLS relies on the exogeneity assumption that the regressors are uncorrelated with the error, \(E[x_i u_i]=0\). When this fails we say there is endogeneity, \(E[x_i u_i] e 0\). The consequence is immediate once we look at where OLS converges:

β̂_OLS →^p β₀ + E[x_ix_i′]⁻¹E[x_iu_i]

Core idea

Under exogeneity the bias term \(E[x_ix_i']^{-1}E[x_iu_i]\) is zero, so OLS is consistent. Under endogeneity it is non-zero and does not disappear as the sample grows. OLS estimates some pseudo-true value \(\beta_* e\beta_0\) rather than the parameter of interest.

The lecture notes group the common causes of endogeneity into three recurring cases:

Cause	What happens
Omitted variable bias	A variable that belongs in the model is left out and is correlated with an included regressor, so its effect is absorbed into the error.
Reverse causality	The dependent variable also affects the regressor, so cause and effect run both ways.
Measurement error	The regressor is observed with error that is correlated with the true value, pulling the estimate toward zero.

The canonical example in the notes is the returns to schooling regression of wages on years of education:

w_i = α₀ + β₀s_i + u_i

Natural ability raises wages but is not measured, so it sits inside the error: \(u_i=\delta_0 a_i+v_i\). Because ability is also positively correlated with schooling, the OLS bias \(\delta_0\,\mathrm{Cov}(s_i,a_i)/\mathrm{Var}(s_i)\) is positive, and OLS overestimates the true return to education.

2. The instrumental variables estimator

An instrumental variable is a way of isolating variation in the endogenous regressor that is unrelated to the error. We look for a variable \(z_i\) that satisfies two conditions:

The two IV conditions

IV1 — Exogeneity: \(E[z_i u_i]=0\). The instrument is uncorrelated with the error.
IV2 — Identification (relevance): \(E[x_i z_i']\) is full rank. The instrument is correlated with the endogenous regressor.

Together these two conditions imply the population relationship \(\beta_0 = E[z_i x_i']^{-1}E[z_i y_i]\). Replacing the population moments with their sample averages gives the IV estimator,

β̂_IV = ( Σ z_ix_i′ )⁻¹ Σ z_iy_i

In the simplest case of one endogenous regressor and one instrument, the two conditions reduce to clean covariance statements: exogeneity becomes \(\mathrm{Cov}(z_{1i},u_i)=0\) and identification becomes \(\mathrm{Cov}(z_{1i},x_{1i}) e 0\). The instrument must move with the regressor but not with the error.

The estimator is consistent under these assumptions. Writing it as the truth plus a noise term,

β̂_IV = β₀ + ( N⁻¹ Σ z_ix_i′ )⁻¹ ( N⁻¹ Σ z_iu_i )

By the law of large numbers the first bracket converges to \(E[z_ix_i']^{-1}\) (which exists precisely because IV2 holds), while the second converges to \(E[z_iu_i]=0\) by IV1. So \(\hat\beta_{IV}\to^p\beta_0\), and under further regularity conditions \(\sqrt{N}(\hat\beta_{IV}-\beta_0)\) is asymptotically normal.

What makes a credible instrument? The notes use the returns-to-schooling literature: distance to college (Card, 1995), quarter of birth (Angrist & Krueger, 1991) and the Vietnam draft lottery (Angrist & Krueger, 1992) have all been proposed as instruments that may be uncorrelated with ability (IV1) but related to schooling (IV2). The Acemoglu & Robinson (2001) study of institutions and GDP uses settler mortality in the 17th-19th centuries as an instrument for the quality of institutions.

3. Two-stage least squares with more instruments than regressors

When there are more instruments than endogenous regressors, the model is over-identified and we use two-stage least squares (2SLS). The recipe is exactly what its name suggests:

2SLS in two steps

Stage 1: regress the endogenous regressor \(x_i\) on the full instrument set \(z_i\) and keep the fitted values \(\hat x_i\).
Stage 2: regress \(y_i\) on the fitted values \(\hat x_i\).

Using the projection matrix \(P_Z = Z(Z'Z)^{-1}Z'\), the estimator can be written compactly as

β̂_2SLS = (X′P_ZX)⁻¹X′P_Zy

The first stage replaces the endogenous regressor with the part of it that is explained by the (exogenous) instruments, stripping out the component correlated with the error. When the number of instruments equals the number of endogenous regressors, 2SLS collapses back to the simple IV estimator. Under the 2SLS assumptions the estimator is consistent and asymptotically normal with variance \(\sigma^2 Q_{XP_ZX}^{-1}\), where \(Q_{XP_ZX}=Q_{XZ}Q_{ZZ}^{-1}Q_{XZ}'\).

4. The weak instrument problem

The identification condition IV2/2SLS3 is not a yes/no matter in practice. It can hold technically while being almost false, and this is where the trouble starts.

Weak instrument

An instrument is weak when its correlation with the endogenous regressor is non-zero but small. Identification just about holds, but the estimator behaves badly: its variance is large and its sampling distribution is poorly approximated by the normal distribution in finite samples.

To see how serious this is, compare the asymptotic variances of OLS and 2SLS in the single-regressor case. The notes show that the ratio is

AVar(β̂_2SLS) / AVar(β̂_OLS) = 1 / ρ²

where \(\rho=\mathrm{Corr}(z_{1i},x_{1i})\) is the correlation between instrument and regressor. As the instrument gets weaker, \(\rho\to 0\) and this ratio explodes. The price of using IV instead of OLS is paid in variance, and a weak instrument makes that price enormous.

The fully unidentified case is even starker. If \(E[z_ix_i]=0\) exactly, then \(N^{-1}Z'X\to^p 0\). Both \(N^{-1/2}Z'X\) and \(N^{-1/2}Z'u\) converge to correlated normal random variables, and the 2SLS estimator converges in distribution to

β̂_2SLS →^d β₀ + ζ_zu / ζ_zx

a ratio of two normals. This limit is not centred at \(\beta_0\), so 2SLS is inconsistent, and its variance is in fact infinite. A weak instrument sits on the spectrum between this disaster and the well-behaved strongly-identified case.

5. Testing the identification assumption: the first-stage F

Because weak instruments are so damaging, we test for them. With one endogenous regressor, one instrument and no controls, the first stage is

x_i = π₀₀ + π₀₁ z_1i + v_i

and identification is equivalent to \(\pi_{01} e 0\), because \(\mathrm{Cov}(x_{1i},z_{1i})=\pi_{01}\sigma_{z1}^2\). We simply estimate the first stage by OLS and test \(\pi_{01}=0\) with a t-test. With several instruments we run a joint F-test that all first-stage coefficients are zero.

Rule of thumb

The first-stage F-statistic is used as a measure of instrument strength. A widely used rule of thumb (Staiger & Stock, 1997) is that \(F>10\) is needed for 2SLS to be reasonably well behaved. A small F is a warning that the instruments may be weak.

6. A cautionary tale: quarter of birth

The lecture material closes with Angrist & Krueger's (1991) famous use of quarter of birth as an instrument for years of schooling. Because children must turn six by a fixed cut-off date to start school, and because compulsory schooling laws set a minimum leaving age, those born in different quarters end up with slightly different amounts of compulsory education. Quarter of birth is plausibly exogenous — it is hard to argue that the season you were born in directly affects your wage other than through schooling.

The problem is relevance. The effect of quarter of birth on schooling is tiny, so the instrument is weak. Bound, Baker & Jaeger (1995) showed that the first-stage F-statistic is small, and — strikingly — that replacing the real quarter-of-birth instrument with randomly simulated quarter-of-birth data produces very similar 2SLS estimates. That is the signature of a weak instrument: the procedure manufactures plausible-looking results from essentially no identifying variation. The lesson is that a credible exogeneity story is necessary but not sufficient; the instrument must also be strong.

Econometrics tuition

These notes support students working through endogeneity, IV and 2SLS at undergraduate and master's level. For 1-1 help with identification, weak instruments, the first-stage F-test or empirical IV projects, see econometrics tuition, university economics tuition or PhD econometrics tuition.

Free videos: the @economaths channel has worked videos on IV motivation, two-stage least squares and measurement error.

Instrumental variables and the weak instrument problem

How to read these notes

1. The problem: endogeneity and the failure of OLS

2. The instrumental variables estimator

3. Two-stage least squares with more instruments than regressors

4. The weak instrument problem

5. Testing the identification assumption: the first-stage F

6. A cautionary tale: quarter of birth

Need help with this topic?

Instrumental variables and the weak instrument problem

How to read these notes

1. The problem: endogeneity and the failure of OLS

2. The instrumental variables estimator

3. Two-stage least squares with more instruments than regressors

4. The weak instrument problem

5. Testing the identification assumption: the first-stage F

6. A cautionary tale: quarter of birth

Related study notes

Need help with this topic?