Lecture 13 Notes

Author

Alec L. Robitaille

Published

April 17, 2024

Clusters and features

Clusters: kinds of groups in the data

Features: aspects of the model (parameters) that vary by cluster

Examples (clusters -> features):

  • tanks -> survival
  • stories -> treatment effect
  • individuals -> average responses
  • departments -> admission rate, bias

Adding more clusters involves adding more index variables and more population priors.

Adding more features involves adding more parameters, increasing the dimensions in each population prior.

Varying effects are a way to measure unmeasured confounds. Unmeasured features of clusters leave an imprint on the data that can be estimated given repeat observations of each cluster and partial pooling among clusters.

This helps us, from the predictive perspective, to regularize cluster-level variance, and, from the causal perspective, to manage unobserved confounds and competing causes using cluster-level causes.

Practical difficulties of varying effects:

  • How to use more than one cluster type at the same time?
  • How to calculate predictions? At which level?
  • How to sample chains efficiently?
  • Group-level confounding

Strategy for modeling varying effects

  1. Varying intercept on one cluster
  2. Varying intercept for one cluster, slope for another cluster. This is two 1-dimensional distributions where information is not shared across.
  3. Correlated varying effects

Example: Bangladesh fertility survey

Outcome: contraceptive use

Variables: age, living children, urban/rural, districts

coords <- data.frame(
    name = c('A', 'C', 'D', 'K', 'U'),
    x =    c(1,    2,    3,   1.5, 2.5),
    y =    c(0,    0.5,    0,  -1, -1)
)
dagify(
    C ~ A + K + D + U,
    K ~ A + U,
    U ~ D,
  coords = coords
) |> ggdag(seed = 2, layout = 'auto') + theme_dag()

In addition,

  • group level confounds: unmeasured things about districts may influence features of individuals in those districts (eg. history of districts)

  • unmeasured variables: family

Varying intercepts on each district

\(C_{i} \sim Bernoulli(D_{i}, p_{i})\)

\(logit(p_{i}) = \alpha_{D[i]}\)

\(\alpha_{j} \sim Normal(\bar{\alpha}, \sigma)\)

\(\bar{\alpha} \sim Normal(0, 1)\)

\(\sigma \sim Exponential(1)\)

(Bernoulli distribution since contraception is 0/1 outcome)

Note the uneven sampling across districts, where because of partial pooling there is shrinkage in the clusters. More shrinkage in clusters with less observations eg. the districts with 2 and 13 observations, and less in clusters with more observations eg. the districts with 35 and 45 observations.

In addition, for one district there is no data. The minimum sample size is 0, because you have a prior. This is an informed prior because it is estimated from all the other districts, including the variation across all districts. This is in essence a prediction.

Varying intercepts on districts and slopes on urban

District features are potential group-level confounds. To estimate the total effect of urban (U), we need to stratify by district but not by kids (K) because the total effect of U passes through K.

\(C_{i} \sim Bernoulli(D_{i}, p_{i})\)

\(logit(p_{i}) = \alpha_{D[i]} + \beta_{D[i]}U_{i}\)

\(\alpha_{j} \sim Normal(\bar{\alpha}, \sigma)\)

\(\beta_{j} \sim Normal(\bar{\beta}, \tau)\)

\(\bar{\alpha}, \bar{\beta} \sim Normal(0, 1)\)

\(\sigma, \tau \sim Exponential(1)\)

Result: ~4/2000 transitions with a divergence. For 2000 samples, tau’s number of effective samples (n_eff) are quite low.

The problem is that there are priors that define the shape of other priors. They locate the priors, and are called “centered”. We can re-express the same thing using non-centered priors that are transformed to replace the nested priors with a Normal(0, 1) prior. These are mathematically equivalent but the non-centered model is much more efficient.

Z score is a standardized Gaussian deviation. Subtract by the mean and divide by the standard deviation. Easily reversible by adding the mean and multiplying by the standard deviation.

\(C_{i} \sim Bernoulli(D_{i}, p_{i})\)

\(logit(p_{i}) = \alpha_{D[i]} + \beta_{D[i]}U_{i}\)

\(\alpha_{j} = \bar{\alpha} + Z_{\alpha, j} * \sigma\)

\(\beta_{j} = \bar{\beta} + Z_{\beta, j} * \sigma\)

\(Z_{\alpha, j} \sim Normal(0, 1)\)

\(Z_{\beta, j} \sim Normal(0, 1)\)

\(\bar{\alpha}, \bar{\beta} \sim Normal(0, 1)\)

\(\sigma, \tau \sim Exponential(1)\)

Result: much more efficient sampling.

Sample size drops as you add more clusters and cut up the data.

The urban estimates look more uncertain, possibly due to lower sample sizes in urban areas (there are less urban areas).