Homework 09

Author

Alec L. Robitaille

Published

May 4, 2023

source('R/packages.R')

Data

The data in data(Achehunting) are 14,364 individual hunting trips by 147 men among the indigenous Aché of Paraguay. Each trip has recorded the hunter’s identity (anonymized id), hunter’s age at the time of the trip, the duration in hours of the trip, and the kilograms of meat returned. Many of the trips have missing values for duration.

# Data
source('R/data_achehunting.R')

# month : Month of record
# day : Day of record
# year : Year of record
# id : Identifier of individual man
# age : Man's age at time of record
# kg.meat : Kilograms of meat returned from hunt
# hours : Duration in hours of hunting trip
# datatype : 1 if duration of trip known, 3 otherwise

DT_hunting <- data_achehunting()

Question 1

In this first problem, estimate the influence of age on the probability of trip success. Define “success” as a trip that returns any non-zero amount of meat. Ignore individual hunter identities for now. You can use any functional relationship for age and success that you think is sensible, but be sure to justify it and check it against posterior predictions.

coords <- data.frame(
    name = c('S', 'A'),
    x =    c(1,    2),
    y =    c(0,    1)
)

dagify(
    S ~ A,
    coords = coords
) |> ggdag(seed = 2) + theme_dag()

Functional relationship

First, fit with a log base 10 line.

range_age <- data.table(
    age = seq(min(DT_hunting$age), max(DT_hunting$age))
)

ggplot(DT_hunting[, .(prob_success = sum(success) / .N), age], 
    aes(x = age, prob_success)) + 
    geom_point() + 
    stat_function(fun = function(x) -1 + log10(x)) + 
    labs(x = 'Age', y = 'Probability of success')

Short of more background on these data, I am going to take it as practice fitting a GAM with brms (through mgcv).

Priors

get_prior(success ~ s(age), family = 'binomial', data = DT_hunting)

                prior     class   coef group resp dpar nlpar lb ub       source
               (flat)         b                                         default
               (flat)         b sage_1                             (vectorized)
 student_t(3, 0, 2.5) Intercept                                         default
 student_t(3, 0, 2.5)       sds                               0         default
 student_t(3, 0, 2.5)       sds s(age)                        0    (vectorized)

Recall: predicted draws returns posterior predictions on the scale of the response (0, 1), epred draws returns posterior predictions with the inverse transformation.

tar_load(m_h09_q01_prior)
m_h09_q01_prior$prior

          prior     class   coef group resp dpar nlpar lb ub       source
 normal(0, 0.5)         b                                            user
 normal(0, 0.5)         b sage_1                             (vectorized)
 normal(0, 1.5) Intercept                                            user
 exponential(1)       sds                               0            user
 exponential(1)       sds s(age)                        0    (vectorized)

pred_q01_prior <- add_predicted_draws(range_age, m_h09_q01_prior)
table(pred_q01_prior$.prediction)


    0     1 
32198 32802

epred_q01_prior <- add_epred_draws(range_age, m_h09_q01_prior)
range(epred_q01_prior$.epred)

[1] 0.0007812558 0.9999527929

ggplot(epred_q01_prior) + 
    stat_lineribbon(aes(age, .epred)) + 
    ylim(0, 1) +
    scale_fill_grey(start = 0.8, end = 0.4) + 
    labs(x = 'Age', y = 'Prior probability of success')

Posterior

tar_load(m_h09_q01)
m_h09_q01

 Family: bernoulli 
  Links: mu = logit 
Formula: success ~ s(age) 
   Data: DT_hunting (Number of observations: 14364) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Smooth Terms: 
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(sage_1)     2.85      0.78     1.63     4.67 1.00     1090     1946

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -0.10      0.02    -0.13    -0.06 1.00     3767     2803
sage_1        0.05      0.49    -0.91     1.02 1.00     2427     2725

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

epred_q01 <- add_epred_draws(range_age, m_h09_q01)
range(epred_q01$.epred)

[1] 0.06100912 0.59011762

ggplot(epred_q01) + 
    stat_lineribbon(aes(age, .epred)) + 
    ylim(0, 1) +
    scale_fill_grey(start = 0.8, end = 0.4) + 
    labs(x = 'Age', y = 'Posterior probability of success') + 
    geom_point(aes(age, prob_success), 
               data = DT_hunting[, .(prob_success = sum(success) / .N), age])

Question 2

Now incorporate individual hunter varying effects into your analysis from the previous problem. Allow the influence of age on success to vary by each individual. How much variation in success is explained by individuals and how much by age?

Priors

priors_q02 <- get_prior(success ~ t2(age, by = id), family = 'binomial', data = DT_hunting)
unique(priors_q02$class)

[1] "b"         "Intercept" "sds"

Posterior

tar_load(m_h09_q02)
#m_h09_q02
# m_h09_q02_s_var_id
DT_hunting[, uniqueN(id), age]

      age    V1
    <int> <int>
 1:    67     6
 2:    68     5
 3:    62    11
 4:    63     9
 5:    64     7
 6:    56    19
 7:    57    19
 8:    69     4
 9:    70     6
10:    71     5
11:    72     2
12:    74     3
13:    75     2
14:    50    12
15:    48    12
16:    49    11
17:    53    13
18:    65     6
19:    66     4
20:    47    11
21:    51    11
22:    45    17
23:    46    12
24:    59    14
25:    60    10
26:    61     8
27:    44    17
28:    43    15
29:    58    15
30:    41    18
31:    42    16
32:    40    19
33:    39    21
34:    38    24
35:    52    11
36:    54    10
37:    55    11
38:    37    19
39:    33    26
40:    34    26
41:    35    19
42:    32    20
43:    30    19
44:    29    22
45:    31    12
46:    28    14
47:    27    14
48:    26     8
49:    25    12
50:    24    15
51:    23    13
52:    36    13
53:    22    11
54:    21     9
55:    20    12
56:    19    10
57:    18    10
58:    17     9
59:    15     7
60:    16     8
61:    14    10
62:    12     1
63:    13     2
64:    11     1
      age    V1

epred_q02 <- add_epred_draws(unique(DT_hunting[, .(age, id)]), m_h09_q02)

setDT(epred_q02)
var_id <- epred_q02[, .(var_epred = var(.epred)), by = id]
var_age <- epred_q02[, .(var_epred = var(.epred)), by = age]

var_id[, mean(var_epred)]

[1] 0.009940593

var_age[, mean(var_epred)]

[1] 0.01740102

ggplot(var_id, aes(var_epred)) + 
    stat_halfeye() + 
    xlim(0, 0.1)

ggplot(var_age, aes(var_epred)) + 
    stat_halfeye() + 
    xlim(0, 0.1)

ggplot(var_age, aes(age, var_epred)) + 
    geom_point()

ggplot(var_id[order(var_epred)]) + 
    geom_point(aes(id, var_epred))

ggplot(epred_q02) + 
    stat_lineribbon(aes(age, .epred)) + 
    ylim(0, 1) +
    scale_fill_grey(start = 0.8, end = 0.4) + 
    labs(x = 'Age', y = 'Posterior probability of success') + 
    geom_point(aes(age, prob_success), 
               data = DT_hunting[, .(prob_success = sum(success) / .N), .(id, age)])

Question 3

Now include trip duration in the model from the previous problem. The focus is still the causal relationship between age and success, but duration is a potential competing cause or a mediator, so it can be useful to include it for either increased efficiency (competing cause) or to study how variation in trip durations mediate differences in hunting returns. Since there are many missing values for trip duration, compare (1) a complete case analysis, which drops all the trips with missing durations, to (2) an analysis which uses all 14,364 trips and imputes missing duration values where necessary. Use any functional relationship between duration and success that you think is sensible, but be sure to justify it and check it against posterior predictions.

tar_load(m_h09_q03_complete_cases)
m_h09_q03_complete_cases

DT_hunting_naomit <- na.omit(copy(DT_hunting))[, id := factor(id)]
epred_q03 <- add_epred_draws(DT_hunting_naomit, m_h09_q03_complete_cases)

imp <- tar_read(DT_hunting_mice)

tar_load(m_h09_q03_mice)
m_h09_q03_mice

mcmc_areas(m_h09_q03_mice, regex_pars = 'sd|b')

conditional_effects(m_h09_q03_mice)