Lecture 02 Notes

Author

Alec L. Robitaille

Published

January 9, 2023

Bayesian data analysis

Bayesian data analysis is Bayesian inference applied to scientific data analysis. For each possible explanation of the sample, count all the ways the sample could have happened and explanations with more ways to produce the sample are more plausible.

No guarantees except logical. Probability theory is a method of logically deducing the implications of data under the assumptions that you must choose.

Workflow

  1. Define generative model of the sample
  2. Define a specific estimand
  3. Design a statistical way to produce estimand
  4. Test the statistical method using the generative model
  5. Analyse the sample, summarize

Globe tossing model

Estimand: proportion of globe covered in water

Given N samples, how should we use the sample? Produce a summary of the sample? Represent uncertainty?

First, think about how sample is produced and how variables influence on another.

  • p = proportion of water
  • W = land observations
  • W = water observations
  • N = number of tosses
Note

Variables can be things you observed or things you want to estimate

N influences W and L: if you were to intervene on N (eg. increase the number of observations), it would change W and L.

To make the DAG generative, we need to define these relationships as a function.

\(W, L = f (P, N)\)

What is the function?

W L W W
4 L 0 x 4 x 0 0 x 0 0
3 L 1 W 1 x 3 x 1 3 x 1 3
2 L 2 W 2 x 2 x 2 8 x 2 16
1 L 3 W 3 x 1 x 3 9 x 3 27
4 W 4 x 0 x 4 0 x 4 0

Given the table and our scientific understanding, the function is:

\(W, L = (4p) ^ {W} * (4-4p) ^ {L}\)

Generative simulation

Simulate a data set with known parameters, then test if your statistical estimator returns these known parameters.

Repeat with different sampling designs, with extreme simulations (eg. all W or all L).

Bayesian inference

  • No minimum sample size
  • The shape of the posterior distribution embodies the sample size
  • No point estimate. The distribution is the estimate, always use the entire distribution.
  • No one true interval. Nothing happens at the boundary.

Posterior predictive simulations

Use samples from the entire posterior distribution to determine implications of the model.

Uncertainty -> causal model -> implications

Types:

  • model-based forecasts
  • causal effects
  • counterfactuals
  • prior predictions

Misclassification

Consider the true number of water samples is unobserved, eg. due to measurement error. W* (W_star) is misclassified sample, influenced by the measurement process (M).

Note: this DAG is simplified to remove L since N - L = W

Next, use a simulation to determine the potential impact of misclassification.

Define a function with a new variable X, the misclassification rate, that takes a simulated sample then introduces error. The misclassification estimator returns a new formula for the posterior distribution.

If there is measurement error, better to model it than to ignore it. Samples do not need to be representative of the population in order to provide good estimates of the population. The priority is to determine in which ways (why) the sample differs from the population. Then you can modeol the sample process and the causal differences for the population.