Lecture 20 Notes
Stargazing
Fortune telling frameworks (eg. horoscopes, tarot cards, linear models) has to be vague - derived from vague facts the advice has to be vague. Also, often has an exaggerated importance.
Cannot offload subjective responsibility onto an objective procedure
Tendency to focus on parts that are mathematical, objective (the quality of the data analysis)
Other things that are also important:
- quality of theory
- quality of data
- quality of code and procedures
- code documentation
- reporting
Planning
Goal setting
- Define estimates at the beginning
Theory building
- Which assumptions will we make to construct an appropriate estimator?
- Causal model
Levels of theory building
- Heuristic causal model (DAGs)
- Structural causal models (synthetic functions that identify in precise mathematical ways the relationships between variables)
- Dynamic models (eg. ODEs)
- Agent-based models (most fine grained approach)
These all specify or imply algebraic systems that can be analysed for their implications.
Best way to learn models is to read models.
Heuristic causal models (DAGs)
- Treatment and outcome
- Other causes
- Other effects
- Unobserved causes
- Justified sampling plan
- Justified analysis plan
- Documentation
- Open software and data formats
- “Especially for academics, not ethical to use closed, proprietary, expensive software or data formats”
- Future self might also thank you, since you may no longer have access to software in the future
Working
- Express theory as probabilistic program
- Prove planned analysis could work (conditionally on assumptions)
- Test pipeline on synthetic data
- Run pipeline on empirical data
Control
- version control (Git)
- history
- backup
- accountability
Incremental testing
- build things iteratively
- test each piece
Documentation
- comment everything
- for you and for others
Review
- at least two people should look at each thing you do
- explain the code to someone (rubber ducky)
Reporting
Sharing materials
- Papers are an advertisement, the data and its analysis are the product. Data and code should be available through a link, not “by request”
Describing methods
- math-stats notation of statistical model (software independent)
- explanation of how math-stats model provides estimand
- algorithm used to produce estimate
- diagnostics, code tests
- cite software packages
“To estimate the reciprocity within dyads, we model the correlation within dyads in giving, using a multilevel mixed-membership model (textbook citation). To control for confounding from generalized giving and receiving, as indicated by the DAG in the previous section, we stratify giving and receiving by household. The full model with priors is presented at right. We estimated the posterior distribution using Hamiltonian Monte Carlo as implemented in Stan version 2.29 (citation). We validated the model on simulated data and assessed convergence by inspection of trace plots, R-hat values, and effective sample sizes. Diagnostics are reported in Appendix B and all results can be replicated using the code available at LINK.”
Justifying priors
“Priors were chosen through prior predictive simulation so that pre-data predictions span the range of scientifically plausible outcomes. In the results, we explicitly compared the posterior distribution to the prior, so that the impact of the sample is obvious.”
Responding to reviewers
- change discussion from statistics to causal models, scientific models.
- Point readers to a primer paper on Bayesian statistics in your field.
Describing data
- Sample size, but specifically the structure of your data: how many observations of how many units?
- At which level (across or within clusters) are variables measured?
- Missing values
Describing results
- Focus of results typically are on estimands, presented using marginal causal effects
- Warn against causal interpretation of control variables (Table 2 fallacy)
- Sample realizations > Densities > Intervals
Making decisions
- Academic research: communicate uncertainty, conditional on sample and models
- Industry, applied research: what should we do, given uncertainty, conditional on sample and models?
Bayesian decision theory:
- State costs and benefits of outcomes
- Compute posterior benefits of hypothetical policy choices (interventions)
Horoscopes for research
Fixes:
- No statistics without associated causal model
- Prove that your code works in principle
- Share as much as possible
- Beware of proxies for research quality