Ordinary least squares. OLS, is the most widely used method in econometrics and statistics. If you've taken any quantitative economics or data science course, you've encountered it. But most textbooks explain it in a way that makes it seem more mysterious than it is. This guide explains it clearly.
What does OLS actually do?
OLS fits a straight line through a scatter of data points. Specifically, it finds the line that minimises the sum of the squared vertical distances between each data point and the line itself. Those vertical distances are called residuals.
The "least squares" in the name refers to this minimisation: you're finding the line that makes the squared residuals as small as possible in total.
If you have a dependent variable Y (say, wages) and an independent variable X (say, years of education), OLS finds the values of the intercept and slope that best fit the data in this squared-residual sense.
Why square the residuals?
Two reasons. First, squaring makes all residuals positive, so positive and negative errors don't cancel out. Second, squaring penalises large errors more heavily than small ones, which is usually what we want.
The Gauss-Markov assumptions
OLS has some remarkable properties, but only when certain conditions hold. These are the Gauss-Markov assumptions:
- Linearity: The true relationship between Y and X is linear (or linear in parameters).
- Random sampling: The data are a random sample from the population.
- No perfect multicollinearity: The independent variables aren't exact linear combinations of each other.
- Zero conditional mean: The expected value of the error term, given X, is zero. This is the crucial one.
- Homoskedasticity: The variance of the error term is constant across all values of X.
When these hold, the Gauss-Markov theorem says OLS is BLUE, the Best Linear Unbiased Estimator. In plain English: among all linear estimators that don't systematically over- or underestimate, OLS has the smallest variance. That's a powerful guarantee.
When does OLS break down?
The zero conditional mean assumption. E(u|X) = 0, is the one that causes the most problems in practice. It fails when:
- Omitted variable bias: A variable that affects Y is correlated with X but left out of the model.
- Reverse causality: Y also causes X, creating a circular relationship.
- Measurement error: X is measured with error that's correlated with the true value.
When this assumption fails, OLS estimates are biased, they systematically get the wrong answer. This is why econometrics spends so much time on identification strategies: instrumental variables, difference-in-differences, regression discontinuity, all designed to make E(u|X) = 0 plausible.
Heteroskedasticity: a common problem, an easy fix
If the variance of the errors isn't constant (heteroskedasticity), OLS is still unbiased but no longer efficient, and standard errors are wrong. The fix is simple: use heteroskedasticity-robust standard errors (sometimes called "White standard errors"). Most software does this with a single option.
A note on interpretation
The OLS slope coefficient tells you the average change in Y associated with a one-unit increase in X, holding all other variables constant. The "holding constant" part is crucial and often misunderstood. It means you're comparing people/observations that are identical in all other included variables.
OLS gives you association, not causation, unless you have a credible identification strategy that rules out confounding.
Want to understand this more deeply?
If you're studying econometrics at university and finding it difficult, or preparing for an exam. I offer 1-1 online tuition that goes from the basics through to advanced topics. Get in touch for a free consultation.