What does R-squared measure?

R-squared, the coefficient of determination, is the proportion of the total variation in the dependent variable that is explained by the regression. It equals the explained sum of squares divided by the total sum of squares, and lies between 0 and 1. An R-squared of 0.6 means the model accounts for 60% of the variation in the outcome.

Why is adjusted R-squared better than R-squared?

Ordinary R-squared never falls when you add a regressor, even an irrelevant one, so it cannot be used to compare models with different numbers of variables. Adjusted R-squared applies a penalty for the number of parameters, so it only rises if a new variable improves fit by more than would be expected by chance. That makes it more honest for model comparison.

Does a high R-squared mean a good model?

No. A high R-squared only means the model fits the sample data closely; it says nothing about whether the model is correctly specified or whether the coefficients are unbiased. A model can have a high R-squared and still suffer from omitted-variable bias, and a causally valid model can have a low R-squared. Fit and validity are separate questions.

Goodness of Fit: R-squared and Adjusted R-squared

R-squared (the coefficient of determination) is the share of the variation in the dependent variable that the regression explains: explained sum of squares over total sum of squares, between 0 and 1. Because it never falls when you add a regressor, the adjusted R-squared applies a penalty for the number of parameters. A high R-squared means good in-sample fit — not that the model is correctly specified.

How to read these notes

These notes are for a student who has met OLS and wants to interpret the goodness-of-fit numbers that every regression package reports. We build R-squared from the sum-of-squares decomposition, explain the adjusted version, and — most importantly — explain what these numbers do and do not tell you.

1. Splitting the variation: TSS, ESS and RSS

Goodness of fit asks a simple question: how much of the variation in the dependent variable \(y\) has the regression explained? To answer it we split the total variation in \(y\) about its mean into two pieces. Writing \(\hat y_i\) for the fitted value and \(\bar y\) for the sample mean:

\underbrace{\sum_i (y_i - \bar y)^2}_{\text{TSS}} = \underbrace{\sum_i (\hat y_i - \bar y)^2}_{\text{ESS}} + \underbrace{\sum_i (y_i - \hat y_i)^2}_{\text{RSS}}

The three sums of squares

TSS (total sum of squares) is the total variation in \(y\). ESS (explained sum of squares) is the part the model captures. RSS (residual sum of squares) is the leftover — the part the model fails to explain, the sum of squared residuals that OLS minimises.

This decomposition holds exactly for an OLS regression that includes an intercept. It says the variation we want to explain is neatly partitioned into "explained" and "unexplained" components.

2. R-squared: the coefficient of determination

R-squared is the explained share of the total:

R^2 = \frac{\text{ESS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}

It lies between 0 and 1. An \(R^2\) of 0 means the regressors explain none of the variation in \(y\) (the model does no better than the mean); an \(R^2\) of 1 means the model fits the data perfectly, with every residual zero. An \(R^2\) of 0.6 means the model accounts for 60% of the variation in the outcome.

Interpretation

R-squared is a measure of in-sample fit: how closely the fitted line tracks the data points you used to estimate it. It is not a measure of how correct the model is, nor of how well it will predict new data.

3. The problem: R-squared never falls

Here is the catch that makes raw R-squared dangerous for comparing models. Adding any regressor to a model — even a completely irrelevant one — can only reduce the residual sum of squares (or leave it unchanged), because OLS can always set the new coefficient to zero if it does not help. Since \(R^2 = 1 - \text{RSS}/\text{TSS}\) and TSS is fixed, this means:

\text{adding a regressor} \ \Rightarrow\ R^2 \text{ cannot decrease}

So you can always inflate R-squared by throwing in more variables, whether or not they belong in the model. This makes raw R-squared useless for choosing between models with different numbers of regressors: the bigger model will always look at least as good, even if its extra variables are pure noise.

The same logic appears in time-series model selection: a richer ARMA model always fits the sample better, which is precisely why information criteria penalise the number of parameters. R-squared has no such penalty.

4. Adjusted R-squared: charging for parameters

The adjusted R-squared fixes this by deflating the fit by the number of parameters used. With \(n\) observations and \(k\) regressors (excluding the intercept):

\bar R^2 = 1 - \frac{\text{RSS}/(n-k-1)}{\text{TSS}/(n-1)}

The numerator and denominator are now adjusted for their degrees of freedom. The effect is a trade-off: adding a variable lowers the RSS (which raises \(\bar R^2\)) but also spends a degree of freedom (which lowers it). The adjusted R-squared therefore only rises if the new variable improves fit by more than would be expected from a useless variable.

Key differences

Unlike \(R^2\), the adjusted \(\bar R^2\) can fall when you add a variable, and it can even be negative for a very poor model. Because it penalises complexity, it is the appropriate goodness-of-fit measure when comparing models with different numbers of regressors.

5. Why a high R-squared is not the goal

Students often treat maximising R-squared as the objective of empirical work. It is not. R-squared measures fit, not correctness, and the two come apart in both directions.

High R-squared, bad model. A regression can fit the sample beautifully and still suffer from omitted-variable bias, reverse causality or measurement error — the coefficients can be biased no matter how high R-squared is. Goodness of fit says nothing about whether the exogeneity assumption holds.
Low R-squared, good model. In cross-sectional microeconomic data, an R-squared of 0.1 or 0.2 is common and perfectly acceptable. Individual behaviour is noisy; a credibly identified causal effect can have a low R-squared because much of the variation in the outcome is genuinely idiosyncratic.

What you should care about is whether the model is correctly specified and the coefficients are credibly identified — questions addressed by the diagnostic tests and identification strategies, not by R-squared. R-squared is a useful descriptive summary; it is a poor objective.

6. A note on time series and forecasting

R-squared values tend to be much higher in time-series regressions than in cross-sections, because economic time series are persistent and trending — a model can track them closely without capturing anything causal. A "spurious regression" of one trending series on another can produce an R-squared near 1 with no genuine relationship at all. For model choice in time series, lean on the information criteria and residual diagnostics rather than R-squared.

Econometrics & statistics tuition

Interpreting regression output — R-squared, adjusted R-squared, coefficients and standard errors — is a skill that pays off in every empirical course. For 1-1 help, see econometrics tuition, statistics tuition or university economics tuition.

Goodness of fit: R² and adjusted R²

How to read these notes

1. Splitting the variation: TSS, ESS and RSS

2. R-squared: the coefficient of determination

3. The problem: R-squared never falls

4. Adjusted R-squared: charging for parameters

5. Why a high R-squared is not the goal

6. A note on time series and forecasting

Need help with this topic?

Goodness of fit: R² and adjusted R²

How to read these notes

1. Splitting the variation: TSS, ESS and RSS

2. R-squared: the coefficient of determination

3. The problem: R-squared never falls

4. Adjusted R-squared: charging for parameters

5. Why a high R-squared is not the goal

6. A note on time series and forecasting

Related study notes

Need help with this topic?