Lecture 20 Notes

Author

Alec L. Robitaille

Published

June 17, 2024

Stargazing

Fortune telling frameworks (eg. horoscopes, tarot cards, linear models) has to be vague - derived from vague facts the advice has to be vague. Also, often has an exaggerated importance.

Cannot offload subjective responsibility onto an objective procedure

Tendency to focus on parts that are mathematical, objective (the quality of the data analysis)

Other things that are also important:

quality of theory
quality of data
quality of code and procedures
code documentation
reporting

Planning

Goal setting

- Define estimates at the beginning

Theory building

- Which assumptions will we make to construct an appropriate estimator?
- Causal model

Levels of theory building

Heuristic causal model (DAGs)
Structural causal models (synthetic functions that identify in precise mathematical ways the relationships between variables)
Dynamic models (eg. ODEs)
Agent-based models (most fine grained approach)

These all specify or imply algebraic systems that can be analysed for their implications.

Best way to learn models is to read models.

Heuristic causal models (DAGs)

Treatment and outcome
Other causes
Other effects
Unobserved causes

Justified sampling plan
Justified analysis plan
Documentation
Open software and data formats
- “Especially for academics, not ethical to use closed, proprietary, expensive software or data formats”
- Future self might also thank you, since you may no longer have access to software in the future

Working

Express theory as probabilistic program
Prove planned analysis could work (conditionally on assumptions)
Test pipeline on synthetic data
Run pipeline on empirical data

Control

version control (Git)
history
backup
accountability

Incremental testing

build things iteratively
test each piece

Documentation

comment everything
for you and for others

Review

at least two people should look at each thing you do
explain the code to someone (rubber ducky)

Reporting

Sharing materials

Papers are an advertisement, the data and its analysis are the product. Data and code should be available through a link, not “by request”

Describing methods

math-stats notation of statistical model (software independent)
explanation of how math-stats model provides estimand
algorithm used to produce estimate
diagnostics, code tests
cite software packages

“To estimate the reciprocity within dyads, we model the correlation within dyads in giving, using a multilevel mixed-membership model (textbook citation). To control for confounding from generalized giving and receiving, as indicated by the DAG in the previous section, we stratify giving and receiving by household. The full model with priors is presented at right. We estimated the posterior distribution using Hamiltonian Monte Carlo as implemented in Stan version 2.29 (citation). We validated the model on simulated data and assessed convergence by inspection of trace plots, R-hat values, and effective sample sizes. Diagnostics are reported in Appendix B and all results can be replicated using the code available at LINK.”

Justifying priors

“Priors were chosen through prior predictive simulation so that pre-data predictions span the range of scientifically plausible outcomes. In the results, we explicitly compared the posterior distribution to the prior, so that the impact of the sample is obvious.”

Responding to reviewers

- change discussion from statistics to causal models, scientific models.
- Point readers to a primer paper on Bayesian statistics in your field.

Describing data

Sample size, but specifically the structure of your data: how many observations of how many units?
At which level (across or within clusters) are variables measured?
Missing values

Describing results

Focus of results typically are on estimands, presented using marginal causal effects
Warn against causal interpretation of control variables (Table 2 fallacy)
Sample realizations > Densities > Intervals

Making decisions

Academic research: communicate uncertainty, conditional on sample and models
Industry, applied research: what should we do, given uncertainty, conditional on sample and models?

Bayesian decision theory:

State costs and benefits of outcomes
Compute posterior benefits of hypothetical policy choices (interventions)

Horoscopes for research

Fixes:

No statistics without associated causal model
Prove that your code works in principle
Share as much as possible
Beware of proxies for research quality