Lecture 03 Notes

Author

Alec L. Robitaille

Published

February 8, 2024

Geocentric models

Prediction without explanation. The geocentric model has epicycles and Earth is the center of the universe.

Linear regressions are geocentric - they describe associations, make predictions but are mechanically wrong. Therefore, they are useful when handled with care.

Normal distribution

Summed fluctuations tend towards the normal distribution

For estimating only the mean and variance, the normal distribution is the least informative distribution (maxent).

Note: variables do not have to be normally distributed for normal models to be useful.

Linear regression

Workflow

Question/goal/estimand
Scientific model
Statistical model(s)
Validate model
Analyze data

Example: Howell data

1. Estimand

Describe the association between adult weight and height

2. Scientific model

How does height influence weight?

H -> W <- U

coords <- data.frame(
    name = c('H', 'W', 'U'),
    x =    c(1,    2,   3),
    y =    c(0,    0,  0)
)

dagify(
    W ~ H + U,
  coords = coords
) |> ggdag(seed = 2, layout = 'auto') + theme_dag()

This DAG corresponds to: “weight is some function of height and unobserved variables” (\(W = f(H, U)\)).

\[W = \beta H + U\]

The unobserved variable U is normally distribution with a mean of 0 and a standard deviation sigma.

\[U_{i} \sim Normal(0, \sigma)\]

Height is uniformly distributed between 130 and 170 cm.

\[H_{i} \sim Uniform(130, 170)\]

There is a deterministic relationship for W and distributional relationships for U and H.

Corresponding generative code is:

sim_weight <- function(H, beta, sigma) {
    U <- rnorm(length(H), mean = 0, sd = sigma)
    W <- beta * H + U
    return(W)
}

heights <- seq.int(10, 50)
sim_weight(heights, beta = 0.5, sigma = 0.1)

 [1]  4.910309  5.518485  6.158785  6.386962  6.991975  7.513242  8.070795
 [8]  8.476030  9.198447  9.486121 10.041765 10.598175 10.960730 11.396033
[15] 12.178223 12.268893 13.087860 13.503581 14.101283 14.543227 15.209082
[22] 15.380007 16.158964 16.695465 17.000494 17.254829 18.047724 18.440344
[29] 19.079220 19.528964 20.073894 20.531896 21.107616 21.471584 21.922332
[36] 22.440434 22.827402 23.409742 23.944094 24.475349 24.961641

3. Statistical model

\(E(W_{i} | H_{i}) = \alpha + \beta H_{i}\)

\(E(W_{i} | H_{i})\): average weight conditional on height
\(\alpha\): intercept
\(\beta H_{i}\): slope

Posterior distribution

\(Pr(\alpha, \beta, \sigma | H_{i}, W_{i}) = \frac{Pr(W_{i} | H_{i}, \alpha, \beta, \sigma) Pr (\alpha, \beta, \sigma)}{Z}\)

\(Pr(\alpha, \beta, \sigma | H_{i}, W_{i})\): posterior probability of a specific line
\(Pr(W_{i} | H_{i}, \alpha, \beta, \sigma)\): garden of forking data, the number of ways we could see the observations, conditional on an exact line (in this case)
\(Pr (\alpha, \beta, \sigma)\): prior
\(Z\): normalizing constant

Alpha, beta and sigma are unobserved variables, height and weight are observed.

The posterior distribution is proportional to the product of the number of ways the observations could arise according to our assumptions multiplied by our prior.

Prior predictive simulation

What do the observable variables look like with these priors?

\(W_{i} \sim Normal(\mu_{i}, \sigma)\)

\(\mu_{i} = \alpha + \beta H_{i}\)

\(\alpha \sim Normal(0, 10)\)

\(\beta \sim Uniform(0, 1)\)

\(\sigma \sim Uniform(0, 10)\)

When height is 0, weight is 0. Weight increases on average with height. Weight (kg) is less than height (cm). Sigma must be positive.

Priors:

no correct priors, only scientifically justifiable priors.
justify with information outside the data, like the rest of the model
priors become more important as models get more complex
always simulate from the priors, better than trying to intuit what they mean from the definitions

4. Validate model

Test statistical model with simulated observations from scientific model. Stronger test: simulation-based calibration

simulate individuals with known parameters from scientific model
use model to determine if it can recover parameters
use a large sample
test with different values and sample sizes

5. Analyze the data

Once you’ve done steps 1-4, you are ready to analyze the real data.

Parameters are not independent, do not interpret them separately.

Use the posterior distribution to generate predictions on the outcome scale then describe and interpret those.