Jon Minton’s Blog - Demystifying and Disenchanting Statistical Significance

Background

In the long series of statistical posts I’ve produced so far, one thing I’ve not discussed much is hypothesis testing, the concept of statistical significance, and the related concept of the P value. If you’re familiar with how statistics is often used and reported within many academic fields, this lack of focus on statistical significance may appear an oversight. Indeed, in my new role as a statistician working with clinical trials data, they’re essential, arguably the main purpose of going to the great expense and effort involved in conducting the trials.

In the posts so far, the idea of hypothesis testing has perhaps been most salient within the series on bootstrapping and resampling methods. Here we discussed the idea that, if we compare a summary statistic observed in our data, with a distribution of that same summary statistic that we would expect to see if there were no real relationship between two or more variables, then we can get an estimate of how probable it would be to see a summary statistic as extreme or more extreme than the value we did observe in the real data, if there really were no real relationship between the variables of interest.

Whether using resampling and bootstrapping methods, or more conventional methods, the underyling aim and approach in hypothesis testing is the same. There are always the same elements in play, even if the approaches used to derive, estimate or represent some of these elements differ between methods. These elements comprise:

Observation: An observed summary statistic;
Null Distribution: A distribution of hypothetical observed summary statistics which would be expected under the Null Hypothesis
Decision Rule: A pre-established rule for comparing the Observation against Null Distribution, from which we make one of two binary declarations:
- That we reject the Null Hypothesis;
- That we fail to reject the Null Hypothesis.

Aim

The aim of this post is to reintroduce, demystify and disenchant the contents of statistical model summaries of the type that show P values and stars against different coefficients in a model. My underlying argument is these tables are often grossly overinterpreted, especially by many academic and clinical practitioners, as answering hypotheses we’re actually interested in having answered. Once we understand what they contain, and can make them ourselves, we can instead have a more realistic interpretation of the kinds of information they contain, and what we should use them for. The specific role of such tables, I argue, is mainly to provide some ‘quick-and-dirty’ information about whether we should look at simplifying the model we’re using to represent the data, and not necessarily to dispositively ‘prove’ or ‘disprove’ that one factor of substantive interest substantively influences another factor of substantive interest. All too often, statistical model summary tables are treated as the last stage of a piece of analysis, when instead I suggest they should really be considered one of the first stages.

Getting started

Let’s get started by running a regression equation on a very well known dataset :

data(iris)
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Let’s say we want to create a linear regression model to predict Petal.Length based on all other available variables:

mod_full <- lm(
   Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width + Species, 
   data = iris
)

The R Linear Regression Summary Table

The standard summary output for this model is as follows:

summary(mod_full)


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width + 
    Species, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78396 -0.15708  0.00193  0.14730  0.65418 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.11099    0.26987  -4.117 6.45e-05 ***
Petal.Width        0.60222    0.12144   4.959 1.97e-06 ***
Sepal.Length       0.60801    0.05024  12.101  < 2e-16 ***
Sepal.Width       -0.18052    0.08036  -2.246   0.0262 *  
Speciesversicolor  1.46337    0.17345   8.437 3.14e-14 ***
Speciesvirginica   1.97422    0.24480   8.065 2.60e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2627 on 144 degrees of freedom
Multiple R-squared:  0.9786,    Adjusted R-squared:  0.9778 
F-statistic:  1317 on 5 and 144 DF,  p-value: < 2.2e-16

The Coefficients section of the summary contains the following columns:

Name of coefficient (implicitly)
Estimate
Std. Error
t value
Pr(>|t|)
Number of stars (implicitly)

Cultish Conventions to avoid and deprogram yourself of

If you’ve been taught to (mis)use statistics in the standard way, you might be drawn to the right-most column: Number of stars, and to looking for those rows with at least two stars. You’ll notice that the smaller the value in the column to its left, Pr(>|z|), which is the P value, the more stars there tend to be. In this case the row for the coefficient Sepal.Width has the largest P value, and so just has a single star next to it; all the other coefficients have much smaller P values, and all have three stars next to them.

The standard misuse of statistics in many academic fields encourages a fixation with the number of stars next to tables of coefficients like the one above, and in particular to looking for at least as many stars as correspond to a P value of 0.05 or lower (one star or more, in this case). This learned pathology is known as ‘star gazing’, and the fixation on finding stars on coefficients which correspond to P values of less than 0.05 has been termed the sizeless stare in the book The Cult of Statistical Significance (McCloskey and Ziliak 2008). The rule of thumb, which all too often becomes a de facto rule of academic (and clinical) life, is that if the P value is less than 0.05 then its effects on the outcome being modelled have been shown to be ‘statistically significant’ - meaning it merits publication and promotion, and the research has been a ‘success’ - and if it is above 0.05 then it is not ‘statistically significant’ - meaning it does not merit publication and promotion, and the research has been a ‘failure’.

But what do the coefficients, the t values, the P values, and the stars, actually mean? Where do they come from? How does they relate to hypothesis testing, with its decision rules, null distributions and P values? Which hypotheses are being tested? Are they hypotheses actually worth testing?

Reframing statistical models

To start to answer these questions, it is worth taking a few steps back, and entertaining the following propositions :

A statistical model is a data reduction tool;

The number of observations in our dataset is a fixed budget we can choose to spend in different ways through different statistical model specifications.

What do I mean by this? Well, say the dataset contains 100 observations. This is our data budget, that we spend whenever we fit a model based on the dataset. And say our statistical model contains five coefficients. What this means is that we are trying to represent - in a simplified and stylised way - the kinds of relationships observed between variables in our data through a much smaller number of coefficients than we had observations in our data. We are trying to give the gist of what we’re seeing in our data, which required 100 observations to express, through a model that only takes five coefficients to express. The model, therefore, is a representation of the data that’s only one-twentieth (i.e. 100 divided by five) as big. This is what I mean by proposition 1), A statistical model is a data reduction tool. If we were to save the model, it would take less space on our hard drive than the data from which it’s derived; if we were to send the model, it would take less time to send than our data.

And what do I mean by fixed data budget? Well, if we are going to try and represent a lot of observations with a small number of model coefficients, it’s important to use as much information as is available to us from the dataset to make select the best possible coefficients. We could try to represent our 100 observation dataset through a model with four coefficients, or five coefficients, or six coefficients, for example. Of these three model specifications, the four coefficient model is the simplest, and the six coefficient model is the most complicated. But for all three model specifications the data budget available to calibrate it is the same: 100 observations. So, if using the five or six coefficient models, our fixed data budget has to be stretched a bit further compared with the simplest, four coefficient model.

Data Budgets and how to spend them

So, what is the data budget, and what’s the number of coefficients, in our dataset and model respectively? First the data budget:

nrow(iris)

[1] 150

150 observations. That’s the budget we’re spending to produce our model.

Now what’s the number of coefficients of our model?

length(coefficients(mod_full))

[1] 6

Six coefficients. If we take a ratio of the number of observations (our data budget) to the number of model coefficients, i.e. divide 150 by six, we end up with a factor of 25. We could call this quantity something like our ‘compression ratio’ from going from the data to our representation of the data (i.e. the model).

However a more common metric to see is the difference between our data budget and our number of model coefficients, i.e. 150 minus six rather than 150 divided by six. In this example 150 - 6 = 144. This value, 144, is reported somewhere in the model summary. Let’s look at it again:

summary(mod_full)


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width + 
    Species, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78396 -0.15708  0.00193  0.14730  0.65418 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.11099    0.26987  -4.117 6.45e-05 ***
Petal.Width        0.60222    0.12144   4.959 1.97e-06 ***
Sepal.Length       0.60801    0.05024  12.101  < 2e-16 ***
Sepal.Width       -0.18052    0.08036  -2.246   0.0262 *  
Speciesversicolor  1.46337    0.17345   8.437 3.14e-14 ***
Speciesvirginica   1.97422    0.24480   8.065 2.60e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2627 on 144 degrees of freedom
Multiple R-squared:  0.9786,    Adjusted R-squared:  0.9778 
F-statistic:  1317 on 5 and 144 DF,  p-value: < 2.2e-16

The number 144 is shown on the third line from the bottom, and the bottom line:

144 degrees of freedom
144 DF

So, we know from this that DF stands for degrees of freedom, and can maybe infer from this that, if this number is reported twice, it’s probably quite important.

What are degrees of freedom? Continuing the analogy of the number of observations being our ‘data budget’, we can think of the degrees of freedom as the maximum amount of our remaining data budget we could spend on our model before going ‘bankrupt’. What does ‘bankruptcy’ look like in modelling terms? Well, given a model is a data reduction tool, we obviously can’t have a model with more coefficients than there are observations in our dataset. This wouldn’t be data reduction; it would be data expansion: trying to get something from nothing.

Data Bankruptcy

Let’s explore this a bit further by looking at a slightly simpler model, but using much less data: We’ll fit four coefficients, but using only six, five, four or three observations from the data:

set.seed(11)
iris_6 <- iris[sample(1:nrow(iris),6, replace=FALSE),]
iris_5 <- iris[sample(1:nrow(iris),5, replace=FALSE),]
iris_4 <- iris[sample(1:nrow(iris),4, replace=FALSE),]
iris_3 <- iris[sample(1:nrow(iris),3, replace=FALSE),]


mod_6 <- lm(Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, data = iris_6)
mod_5 <- lm(Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, data = iris_5)
mod_4 <- lm(Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, data = iris_4)
mod_3 <- lm(Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, data = iris_3)

summary(mod_6)


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, 
    data = iris_6)

Residuals:
      34      144      113       37       60      118 
-0.08918 -0.27835 -0.19963 -0.02312  0.27089  0.31938 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)   -1.6225     1.4939  -1.086    0.391  
Petal.Width    2.1564     0.6215   3.470    0.074 .
Sepal.Length   0.3062     0.5510   0.556    0.634  
Sepal.Width    0.2372     0.6914   0.343    0.764  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3881 on 2 degrees of freedom
Multiple R-squared:  0.9889,    Adjusted R-squared:  0.9722 
F-statistic: 59.38 on 3 and 2 DF,  p-value: 0.01661

summary(mod_5)


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, 
    data = iris_5)

Residuals:
      140        62        85        93        39 
 0.026739 -0.063968  0.019746  0.006685  0.010799 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)    6.5524     0.8724   7.511   0.0843 .
Petal.Width    2.7874     0.1612  17.294   0.0368 *
Sepal.Length  -0.4326     0.1218  -3.552   0.1747  
Sepal.Width   -1.3058     0.2018  -6.470   0.0976 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0732 on 1 degrees of freedom
Multiple R-squared:  0.9994,    Adjusted R-squared:  0.9977 
F-statistic: 588.7 on 3 and 1 DF,  p-value: 0.03029

summary(mod_4)


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, 
    data = iris_4)

Residuals:
ALL 4 residuals are 0: no residual degrees of freedom!

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)      12.3        NaN     NaN      NaN
Petal.Width       1.5        NaN     NaN      NaN
Sepal.Length     -7.0        NaN     NaN      NaN
Sepal.Width       6.5        NaN     NaN      NaN

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 3 and 0 DF,  p-value: NA

summary(mod_3)


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width, 
    data = iris_3)

Residuals:
ALL 3 residuals are 0: no residual degrees of freedom!

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)
(Intercept)   -3.0012        NaN     NaN      NaN
Petal.Width    2.4198        NaN     NaN      NaN
Sepal.Length   0.6914        NaN     NaN      NaN
Sepal.Width        NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 2 and 0 DF,  p-value: NA

When there are six observations, the four coefficient model has two degrees of freedom.
When there are five observations, the four coefficient model has one degree of freedom.
When there are four observations, the four coefficient model has no degrees of freedom.
When there are three observations, the four coefficient model is converted into a three coefficient model (See the ‘NA’ value reported for the coefficient Sepal.Width) and has no degrees of freedom.

So, it’s clear number of observations = number of model coefficients represents the hard limit on the level of complexity of a model that can be fit from a given dataset. To reiterate: a model cannot make something out of nothing.

And if we look at the coefficients themselves, we can see they seem quite ‘wild’: The coefficients for the intercept, for example, change from -1.6 to 6.6 to 12.3 to -3.0 with each reduction of just a single observation. This means that, ideally, the number of observations should be much greater than the number of coefficients, otherwise there’s not much information available to fit (choose the best value of) each coefficient well. When we have more information to go on, the model fit improves.

Getting away with it

Given more observations for each model coefficient tends to be better, we now have a third proposition:

If we can get away with it, a simpler model is preferable to a more complicated model.

So, can we? Get away with it, I mean? What does ‘get away with it’ even mean?

To a large extent, the whole hypothesis testing framework 9as presented in standard statistical model summary tables0 can be thought of as a way of formalising this concept of ‘getting away with’ using a simpler model instead of a more complex model. When can we decide that the simpler model is good enough, and that we can use it instead of the more complex model, allowing us to spread our data budget a bit further?

Restrictions and Nests

Consider the following two model specifications:

1. Y ~ 4 * intercept + 0.3 * thing_one + 0.0 * thing_two
1. Y ~ 4 * intercept + 0.3 * thing_one

The numbers represent in the above represent the coefficients, which we can imagine stacking into a single vector object. For the first specification these are \(\beta = \{ 4, 0.3, 0 \}\), and for the second specification these are \(\beta = \{4, 0.3 \}\). It should be clear that in practice these two model specifications are really exactly the same: because the coefficient on thing_two is 0, the predicted value, Y, won’t change no matter how much the value of thing_two changes, because zero times anything is still zero.

Now consider this slight variation of the above:

1. Y ~ around 4 * intercept + around 0.3 * thing_one + around 0.1 * thing_two
1. Y ~ around 4 * intercept + around 0.3 * thing_one + exactly 0 * thing_two

Here I’m using the word ‘around’ to indicate that there’s uncertainty about the true value of the coefficients for the intercept, thing_one and thing_two. This uncertainty comes from having a finite number of observations from which to fit the coefficients in the model, i.e. a limited data budget to spend on calibrating our model’s coefficients. The same data budget could either: i) be used to fit \(k\) parameters less precisely; or ii) \(k-1\) parameters more precisely.

Statistical Tests in the context of Data Budgeting

The fundamental role of the standard hypothesis tests reported in a statistical model summary table is (or ought to be) to help us choose between these two model options: i) a more complicated model fit less precisely; or ii) a less complicated model fit more precisely. This trade-off exists, and always exists, because of the fixed data budget available to us.

Restricted and Unrestricted Models

Within the terminology of model testing and comparison, the model of specification A is referred to as the unrestricted model, and the model of specification B is referred to as the restricted model. Additionally, model B is said to be nested within model A. For model A, there are three parameters whose coefficient values are determined by the data: the data budget is spent three ways; for model B, there are two parameters whose coefficient values are determined by the data: the data budget is spent two ways. Model B is known as a restricted model because the coefficient on the third variable, thing_two, is not determined by (fitted using) the data, but instead imposed - forced to be zero - by who-ever or what-ever specified this model.

Note that, whenever we do not include a variable in a dataset in a model specification used to represent that dataset, we are automatically, implicitly, applying model restrictions and constraints of the form above: not including a term in a model is always equivalent to assuming the value of the model coefficient of that term is exactly zero. Because of this, we can almost always think of a given model as both a restricted version of a more complicated model, and as an unrestricted version of a less complicated model.

T scores as cheap model pruning advice

How does this line of argument relate to the kinds of hypothesis-test-based information commonly displayed as part of a statistical model summary? Well, the P values reported for each coefficient in a model summary are in effect tests for a pairwise Null hypothesis that the true value of that coefficient is zero. They are also approximately, but not exactly, equivalent to the comparing the model fit with a pairwise restricted version of the model in which the coefficient of interest is constrained to zero.

Rebuilding statistical model summary tables

Where does the information required to conduct these pairwise Null hypothesis tests come from? Let’s look again at the main model we fit and see if we can derive the summary table information ourselves:

summary(mod_full)


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width + 
    Species, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78396 -0.15708  0.00193  0.14730  0.65418 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.11099    0.26987  -4.117 6.45e-05 ***
Petal.Width        0.60222    0.12144   4.959 1.97e-06 ***
Sepal.Length       0.60801    0.05024  12.101  < 2e-16 ***
Sepal.Width       -0.18052    0.08036  -2.246   0.0262 *  
Speciesversicolor  1.46337    0.17345   8.437 3.14e-14 ***
Speciesvirginica   1.97422    0.24480   8.065 2.60e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2627 on 144 degrees of freedom
Multiple R-squared:  0.9786,    Adjusted R-squared:  0.9778 
F-statistic:  1317 on 5 and 144 DF,  p-value: < 2.2e-16

The only information we actually need from the model are the values in the Estimate and Std. Error columns. As we’ll see, the values in the corresponding t value and Pr(>|t|) columns are derived from these first two columns. (And the stars are derived from the Pr(>|t|) column values.)

Finding the estimate

Firstly, we can get the value of the Estimate column by applying the coefficients() function to the model:

our_estimates <- coefficients(mod_full)

our_estimates

      (Intercept)       Petal.Width      Sepal.Length       Sepal.Width 
       -1.1109888         0.6022215         0.6080058        -0.1805236 
Speciesversicolor  Speciesvirginica 
        1.4633709         1.9742229

We can see that these values are identical to those in the Estimate column of summary.

Finding the standard errors

What about the contents of the standard error column? Getting this quantity is a bit more cryptic and involved. Firstly, we can use the function vcov() to get the variance-covariance matrix of the model:

mod_full |>
   vcov()

                   (Intercept)   Petal.Width  Sepal.Length  Sepal.Width
(Intercept)        0.072831592  0.0128158873 -0.0063645539 -0.012468737
Petal.Width        0.012815887  0.0147483563 -0.0009380611 -0.003427085
Sepal.Length      -0.006364554 -0.0009380611  0.0025245414 -0.001762701
Sepal.Width       -0.012468737 -0.0034270855 -0.0017627006  0.006457374
Speciesversicolor -0.017507064 -0.0173108502 -0.0024945745  0.009589516
Speciesvirginica  -0.019784873 -0.0263239584 -0.0031243418  0.011820452
                  Speciesversicolor Speciesvirginica
(Intercept)            -0.017507064     -0.019784873
Petal.Width            -0.017310850     -0.026323958
Sepal.Length           -0.002494574     -0.003124342
Sepal.Width             0.009589516      0.011820452
Speciesversicolor       0.030086597      0.040493882
Speciesvirginica        0.040493882      0.059926863

The on-diagonal terms (from the top left to the bottom right corners of the matrix) give the variance of given coefficients; i.e. the amount of uncertainty around the point estimates (the Estimate column of the model summary output; and the result of calling the coefficients() function on the model). Off-diagonal values in the matrix, by contrast, show the covariance between any two coefficients, i.e. a measure of the strength and direction of association between any two terms.

As we are only interested in the on-diagonal terms from this matrix, however, we can use the diag() function to extract them, turning the matrix we started with into a vector:

mod_full |>
   vcov() |>
   diag()

      (Intercept)       Petal.Width      Sepal.Length       Sepal.Width 
      0.072831592       0.014748356       0.002524541       0.006457374 
Speciesversicolor  Speciesvirginica 
      0.030086597       0.059926863

This now returns only the covariances. The relationship between a covariance and a standard deviation is simple: a standard deviation is simply the square root of a covariance:

our_ses <- mod_full |>
   vcov() |>
   diag() |>
   sqrt()


our_ses

      (Intercept)       Petal.Width      Sepal.Length       Sepal.Width 
       0.26987329        0.12144281        0.05024482        0.08035779 
Speciesversicolor  Speciesvirginica 
       0.17345489        0.24479964

Let’s now compare the estimates and standard errors we’ve extracted with those reported in the summary table:

our_summary_table <- data.frame(
   estimate = our_estimates,
   std_err = our_ses
)

our_summary_table

                    estimate    std_err
(Intercept)       -1.1109888 0.26987329
Petal.Width        0.6022215 0.12144281
Sepal.Length       0.6080058 0.05024482
Sepal.Width       -0.1805236 0.08035779
Speciesversicolor  1.4633709 0.17345489
Speciesvirginica   1.9742229 0.24479964

summary(mod_full)


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width + 
    Species, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78396 -0.15708  0.00193  0.14730  0.65418 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.11099    0.26987  -4.117 6.45e-05 ***
Petal.Width        0.60222    0.12144   4.959 1.97e-06 ***
Sepal.Length       0.60801    0.05024  12.101  < 2e-16 ***
Sepal.Width       -0.18052    0.08036  -2.246   0.0262 *  
Speciesversicolor  1.46337    0.17345   8.437 3.14e-14 ***
Speciesvirginica   1.97422    0.24480   8.065 2.60e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2627 on 144 degrees of freedom
Multiple R-squared:  0.9786,    Adjusted R-squared:  0.9778 
F-statistic:  1317 on 5 and 144 DF,  p-value: < 2.2e-16

Both the estimates and standard errors appear identical.

Calculating the t values

How do we then produce the t values shown in the summary tables? These are simply the number (and direction) of standard errors that the estimate value is away from the value of zero. We can produce this simply by dividing one number by the other:

our_t_values <- our_estimates / our_ses

our_summary_table <- data.frame(
   estimate = our_estimates,
   std_err = our_ses,
   t_value = our_t_values
)

our_summary_table

                    estimate    std_err   t_value
(Intercept)       -1.1109888 0.26987329 -4.116705
Petal.Width        0.6022215 0.12144281  4.958890
Sepal.Length       0.6080058 0.05024482 12.100867
Sepal.Width       -0.1805236 0.08035779 -2.246498
Speciesversicolor  1.4633709 0.17345489  8.436608
Speciesvirginica   1.9742229 0.24479964  8.064648

Again, the t values we have calculated are identical to those reported in the summary.

Deriving the p values

Finally, how do we calculate the associated P value? This is the probability of a t distribution generating a t value as or more extreme as the t value we have observed for that coefficient. The t distribution requires two parameters: the t value itself, and the number of degrees of freedom, which as discussed previously is the difference between the number of observations in the dataset and the number of parameters in the model.

The default type of test reported in the summary table is a two-sided test, meaning t values much lower than zero, and t values much higher than zero, count equally as deviations potentially inconsistent with the implied Null hypothesis, that the true t value is really zero. Because of this, the P value involves doubling the complement of (1 minus) the cumulative probability probability of a t distribution up to and including the absolute value of the t statistic observed…

Got that? Don’t worry! The code required to do this is shown below:

n_obs <- nrow(iris)
n_coeff <- length(our_estimates)

df <- n_obs - n_coeff

# Our magical formula for the P value below:
our_p_values <- 2 * (1 - pt(abs(our_t_values), df = df))

our_summary_table <- data.frame(
   estimate = our_estimates,
   std_err = our_ses,
   t_value = our_t_values,
   p_value = our_p_values
)

our_summary_table

                    estimate    std_err   t_value      p_value
(Intercept)       -1.1109888 0.26987329 -4.116705 6.445916e-05
Petal.Width        0.6022215 0.12144281  4.958890 1.968679e-06
Sepal.Length       0.6080058 0.05024482 12.100867 0.000000e+00
Sepal.Width       -0.1805236 0.08035779 -2.246498 2.619373e-02
Speciesversicolor  1.4633709 0.17345489  8.436608 3.153033e-14
Speciesvirginica   1.9742229 0.24479964  8.064648 2.600142e-13

Again, other than the p values being reported to slightly more significant figures, the results for all values in our bespoke table are identical to those reported in the statistical model summary report.

QED: As we have been able to reconstruct all values in the statistical summary table from scratch, we have now confirmed we understand exactly what this summary table reports, and what steps and processes were involved in producing these values.

Reflection

Does what the statistical summary table produces, and the way it’s derived, surprise you? Are the hypotheses it tests hypotheses that actually interest you from a substantive point of view? Given the P values here are really just rules of thumb about whether you might ‘get away with’ using a slightly simpler model, rather than really informed hypotheses about the world as it might be if a relationship you think you see between two variables were not observed (i.e. a Null distribution derived from first principles), what’s usually tested in statistical summary tables, and the P values reported, are often not as meaningful or relevant as many (ab)users of statistic model outputs tend to assume.

One thing the summary model does seem to be suggesting is that, maybe, we can ‘get away’ with a model that drops the term Sepal.Width; this is because this is the only term in the coefficients table with a P value of less than 0.01. (If we were to use the standard p value threshold of 0.05 then there are no candidate terms for dropping.) A model without Sepal.Width would be an example of a restricted model, excluding the less statistically significant term, where the unrestricted model would then be the model we currently have. Let’s construct this restricted model:

mod_restricted <- lm(Petal.Length ~ Petal.Width + Sepal.Length + Species, data = iris)


summary(mod_restricted)


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Species, 
    data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.76508 -0.15779  0.01102  0.13378  0.66548 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.45957    0.22387  -6.520 1.09e-09 ***
Petal.Width        0.50641    0.11528   4.393 2.15e-05 ***
Sepal.Length       0.55873    0.04583  12.191  < 2e-16 ***
Speciesversicolor  1.73146    0.12762  13.567  < 2e-16 ***
Speciesvirginica   2.30468    0.19839  11.617  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2664 on 145 degrees of freedom
Multiple R-squared:  0.9778,    Adjusted R-squared:  0.9772 
F-statistic:  1600 on 4 and 145 DF,  p-value: < 2.2e-16

Let’s compare the summary from the restricted model with the summary from the full model again:

summary(mod_full)


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width + 
    Species, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78396 -0.15708  0.00193  0.14730  0.65418 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.11099    0.26987  -4.117 6.45e-05 ***
Petal.Width        0.60222    0.12144   4.959 1.97e-06 ***
Sepal.Length       0.60801    0.05024  12.101  < 2e-16 ***
Sepal.Width       -0.18052    0.08036  -2.246   0.0262 *  
Speciesversicolor  1.46337    0.17345   8.437 3.14e-14 ***
Speciesvirginica   1.97422    0.24480   8.065 2.60e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2627 on 144 degrees of freedom
Multiple R-squared:  0.9786,    Adjusted R-squared:  0.9778 
F-statistic:  1317 on 5 and 144 DF,  p-value: < 2.2e-16

One thing we notice is that, as a result of dropping the coefficient on the term Sepal.Width, all other coefficients change: the standard errors of these coefficients tend to reduce a bit, as we might expect now the data budget can be shared a bit more generously, but at the same time the values in the Estimate column, which gives the point estimates for each coefficient, have also changed. So, dropping one term doesn’t just improve the precision of the estimates of the other terms, but the centres (point estimates) of those terms as well.

The fact the P value on the t test for the Sepal.Width term was not statistically significant at ‘p < 0.01’ gave a signal that maybe that term was a candidate for dropping from the model. But it didn’t actually demonstate that the restricted model, which excludes this term, is better than the unrestricted model that contained the term.

What might be a better comparison between the models, now we’ve gone to the trouble of specifying and fitting the restricted model, and have seen that the coefficients’ point values as well as standard errors have changed? Well, we can compare how well both the restricted and unrestricted models fit the observed data. Though it’s likely the unrestricted model will have slightly better fit to the data, just because more coefficients in a model provides more ways a model can ‘flex’ to the observed data (even if the coefficients were for variables that are just random numbers, that don’t provide any real predictive information for the response variable), a statistically significant difference in model fit between the unrestricted and restricted models suggests that the unrestricted model may be worth keeping, even though it’s stretching the data budget further.

Statistically significant? Doesn’t this imply another hypothesis test, with another test statistic to be compared against another Null distribution? Yes it does. When comparing two linear regression models in this way, the test statistic - which is a summary measure of differences in model fit between the two models - is known as the F score, and the Null distribution is an F distribution. For better or worse, we can elide the algebraic derivations and reasoning behind this particular test, and just be aware that, if we pass an unrestricted and a resticted model as the first two arguments to R’s anova() (Analysis of Variance) function, then the F test comparing the quality of fit of the restricted and unrestricted models is performed:

f_test_results <- anova(mod_full, mod_restricted) 

f_test_results

Analysis of Variance Table

Model 1: Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width + Species
Model 2: Petal.Length ~ Petal.Width + Sepal.Length + Species
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
1    144  9.9397                              
2    145 10.2880 -1  -0.34835 5.0468 0.02619 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here the anova() function:

reports that two models are being compared;
numbers these models model 1 and model 2;
reports the residual sum of squares (RSS) for both models (i.e. a summary statistic of the error between observed and predicted response variables);
reports the degrees of freedom for both models;
notes the difference between the degrees of freedom between the models
notes the difference between the RSS of the two models (this is reported in the column Sum of Sq);
calculates an observed F statistic given the above information;
performs an F test with this information, and reports a P value

Note the P value reported from this F test here. Let’s compare it with the P value on the Sepal.Width coefficient from the F test. We can extract and compare these P values as follows:

broom::tidy(summary(mod_full))[4, 'p.value'] |> unlist() -> p_from_t

broom::tidy(anova(mod_full, mod_restricted))[2, 'p.value'] |> unlist() -> p_from_f

p_from_t

   p.value 
0.02619373

p_from_f

   p.value 
0.02619373

It looks like the P value from the F test and from the t tests are identical. Let’s check this:

identical(p_from_t, p_from_f)

[1] FALSE

They’re not exactly identical, but they seem very close to the number of decimal points presented, and would lead to the same conclusions about whether to exclude the Sepal.Width variable: If we are using a P < 0.05 threshold as our decision rule about whether to keep or exclude a variable from a model, then both approaches essentially lead to the same conclusion: keep this variable.

So, in this example, the row-wise t tests on the coefficients of the full model provide a useful shortcut means of assessing whether to keep or drop a variable, without having to build as many restricted models as there are coefficients, and then running F tests on the full model against each of the restricted models. Back when the level of computing time required to fit each model was substantial - such as when it took teams of people employed as computers days to perform the calculations necessary - the t test on the full model would have helped save a lot of resource.

For at least the last two or three decades, however, the practical savings involved in relying on t tests of one model, rather than actually performing a series of pairwise restricted/unrestricted model comparisons, are tiny: maybe a couple of minutes of code-writing time, and a second or two of computing time. However, the standard statistical model outputs have been around long enough that, for better or worse (hint: for worse), star gazing on P values based on t tests of coefficients of the unrestricted model has assumed a kind of totemic status in statistical analysis. It’s become seen as the way of doing robust quantitative scientific research on all manner of topics, in a way that provides dispositive proof for against a particular claim of significance. As mentioned previously, this is the cult of statistical significance.

Stepping with AIC: A different way of assessing whether we can prune a model

There are some other ways of trying to work out whether we can trim terms from a model specification, which don’t carry the kind of baggage of ritualised overinterpretion that the t test of the standard statistical model summary has acquired. One approach is to use the step() function, which runs an algorithm for searching through model specifications on our behalf. Let’s see what happens when we ask the step() function if it thinks it can shave off any terms without significantly harming the model fit:

step(mod_full)

Start:  AIC=-395.12
Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width + Species

               Df Sum of Sq     RSS     AIC
<none>                       9.9397 -395.12
- Sepal.Width   1    0.3484 10.2880 -391.95
- Petal.Width   1    1.6974 11.6371 -373.47
- Species       2    4.9133 14.8529 -338.87
- Sepal.Length  1   10.1075 20.0472 -291.88


Call:
lm(formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width + 
    Species, data = iris)

Coefficients:
      (Intercept)        Petal.Width       Sepal.Length        Sepal.Width  
          -1.1110             0.6022             0.6080            -0.1805  
Speciesversicolor   Speciesvirginica  
           1.4634             1.9742

What the step() function did is largely equivalent to running a series of F tests, comparing the unrestricted model mod_full against a series of partially restricted models, each of which drop a single term from the original specification. step() isn’t exactly equivalent, however, because it uses a different metric, rather than the F test, to compare the quality and efficiency of fit of the models, called AIC, which stands of Akaike’s Information Criterion (or ‘An Information Criterion’). The purposes of AIC include:

1. to quantify the trade-off between model fit and model complexity in a single value, a penalised model fit score;
1. to generalise the intuition of F tests beyond standard linear regression, by using a summary statistic based on log-likelihood;
1. to allow for comparison between models that are not nested.

The AIC is a summary measure of how well the model fits the data as well as model complexity. Given a single dataset \(D\) an ensemble of potentially viable model specifications \(\{M_1, M_2, M_3, .., M_K\}\) could be fit to it. Each of these models will differ both in terms of model fit and model complexity. The formula for AIC is -2log-likelihood + 2n, where n is the number of parameters in the model. The key observation here is that better fit of the model to the data, as represented by log likelihood, counts negatively to the AIC value, whereas more model complexity, repesented by the number of parameters, count positively to the score. Quality of fit and model complexity therefore tug the AIC score in opposite directions.

Note that, whereas with the F test the decision rule is based on a Null distribution and a P value below a critical threshold, and so is a form of statistical significance test, when using AIC the decision rule is usually simpler: pick the smallest number.

What the step() function has done is:

Fit the model with all specified parameters, and returned the AIC (-395.12); this is shown in the <none> row.
Fit the model with the coefficient Sepal.Width removed, and returned the AIC for this slightly simpler model (-391.95). It has also reported that this model has one fewer degree of freedom (the Df column) than the starting model.
Done the same for each of the other three coefficients in the model: Petal.Width, then Species, then Sepal.Length.

After doing this, the step() procedure has determined that the original model, with four coefficients, has superior (lower) AIC than all four alternative candidate models.

As a result of this, the step() procedure has decided to stop: it cannot identify any single deletion from the model that results in a better (lower) AIC as compared with the original, unrestricted model type.

Discussion

In this post, we’ve shown how the standard statistical summary outputs for linear regression model are derived, and presented a framework for thinking about the role of such model outputs in helping us consider whether we can ‘get away’ with a slightly simpler model specification. The role of statistical model outputs within this framework - in which they constitute just one step and piece of an interative process - is a far cry from the way such model outputs are often interpreted: as an authoritative source of ‘truth’ as to whether one thing is or is not (substantively) ‘significant’. In some cases, such as within a well designed and well conducted randomised control trial, the statistical significance of a coefficient may well answer the question of interest authoritatively; but in many cases the results from these statistical model summaries are vastly (and routinely) overinterpreted.

Having derived the components of standard statistical model outputs, we can hopefully see at least two problems with star-gazing such tables:

The problem of multiple testing: If we do not start with a specific hypothesis, and simply star-gaze a large table of coefficients, we are more likely to see a statistically significant result by chance alone.
The problem of inappropriate Null hypotheses: For instance, if we are fairly sure that, were a predictor to have an effect on the response variable, it would be in the positive direction, then our test should be one-sided. By contrast the P values reported in statistical model summaries are for two sided tests.

To help think further about the latter point, let’s return to the model specification that was the focal example of the the third section of my statistics series. This specification is as follows:

head(ToothGrowth)

   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5

mod_full_int <- lm(len ~ log(dose) * supp, data = ToothGrowth)

mod_full_int


Call:
lm(formula = len ~ log(dose) * supp, data = ToothGrowth)

Coefficients:
     (Intercept)         log(dose)            suppVC  log(dose):suppVC  
          20.663             9.255            -3.700             3.845

The formula for this model specification is len ~ log(dose) * supp, where the * term indicates that interaction terms should be added relating the terms before and after the * symbol. In this case these terms are log(dose) and supp, and the interaction term between these two terms is log(dose):suppVC. The summary report is as follows:

summary(mod_full_int)


Call:
lm(formula = len ~ log(dose) * supp, data = ToothGrowth)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.5433 -2.4921 -0.5033  2.7117  7.8567 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       20.6633     0.6791  30.425  < 2e-16 ***
log(dose)          9.2549     1.2000   7.712  2.3e-10 ***
suppVC            -3.7000     0.9605  -3.852 0.000303 ***
log(dose):suppVC   3.8448     1.6971   2.266 0.027366 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.72 on 56 degrees of freedom
Multiple R-squared:  0.7755,    Adjusted R-squared:  0.7635 
F-statistic:  64.5 on 3 and 56 DF,  p-value: < 2.2e-16

Say we are interested in investigating the following hypothesis:

The supplement allocated (VC or OJ) does not influence growth (the len variable)

Which of the four coefficients in the model summary need to be looked at, and which associated p value, in order to evaluate this hypothesis?

Now, because there’s an interaction term between log(dose) and suppVC included in the model, there are two variables which include suppVC, and represent a path of influence through which the selection of supplement could affect the response variable, len. These are:

The direct path of influence: suppVC
The indirect path of incluence: log(dose):suppVC

The hypothesis, as stated, does not distinguish between these two paths of influence. Therefore, no single coefficient and associated P value represents the alternative hypothesis being considered.

So, in this example, we need to construct our own restricted model for comparison, and use the F test or equivalent. Our restricted model would be the one in which both the terms for the direct and indirect paths of influence are removed, i.e.

mod_restricted_int <- lm(len ~ log(dose), data = ToothGrowth)

summary(mod_restricted_int)


Call:
lm(formula = len ~ log(dose), data = ToothGrowth)

Residuals:
   Min     1Q Median     3Q    Max 
-8.061 -3.274 -1.061  3.063 10.434 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.8133     0.5496   34.23   <2e-16 ***
log(dose)    11.1773     0.9711   11.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.257 on 58 degrees of freedom
Multiple R-squared:  0.6955,    Adjusted R-squared:  0.6902 
F-statistic: 132.5 on 1 and 58 DF,  p-value: < 2.2e-16

We can now compare the unrestricted model, with both paths of influence, with the restricted model, with no paths of influence, as follows:

anova(mod_full_int, mod_restricted_int)

Analysis of Variance Table

Model 1: len ~ log(dose) * supp
Model 2: len ~ log(dose)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     56  774.89                                  
2     58 1051.26 -2   -276.37 9.9865 0.0001953 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here the anova() reports that the restricted model represents a model containing two fewer degrees of freedom, rather than a single degree of freedeom decrement. The associated P value is very low, below P < 0.001, suggesting the supplement should be retained as a predictor of the response in the model.

Conclusion

This (fairly long) post has aimed to demystify and disenchant the standard statistical model outputs people are presented with after running a regression model. Demystify, in the sense of showing exactly how the standard model summary outputs are calculated, and how we can generate them from scratch. And disenchant, in the sense of trying to break the spell of the Cult of Statistical Significance that has long pervaded and perverted quantitative research throughout much of academia and the clinical sciences.

I have suggested that the standard summary outputs from statistical regression models definitely have some value, and are inexpensive ways of thinking about the potential balance and trade-off between models in terms of their complexity (number of parameters) on the one hand, and their fit (deviation between observed and predicted outputs) on the other hand. I’ve argued that models should be thought of mainly as data reduction tools, and the standard summary outputs as cheap advice on how to best spend a fixed data budget, i.e. the number of observations in the dataset.

However, the magical, totemic position that these model outputs, and especially the stars printed alongside coefficients, has acquired over many decades ultimately - in my view - represents a disservice and drag to applied quantitative science. The reliance on interpreting individual coefficients as either ‘statistically significant’ or ‘not statistically significant’ has the ironic consequence of rendering the main presented output of quantitative models as something qualitative: to be (statistically significant), or not to be. That is the question (those who apply this approach assume).

Many of the individual elements discussed in this post have been introduced or utilised in earlier posts I’ve written. For example I first discussed the overreliance on interpreting individual coefficients as quantities of interest in this post on beta values; I made use of terminology like restricted and unrestricted model, and AIC and anova, in this post on selecting a model specification; and I introduced hypothesis testing on this post on the infer package. However this is the first post where I’ve tried to focus on the concepts of P values and statistical significance of the type most people are familiar with. I hope this post therefore manages to tie these disparate pieces together a bit more coherently.

Thanks for reading!

References

McCloskey, Deirdre, and Steve Ziliak. 2008. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor, MI: University of Michigan Press. https://doi.org/10.3998/mpub.186351.