Global Burden of Disease Lab

Introduction to Global Health for Engineers
Evan Thomas, Professor, University of Colorado Boulder

Overview

In this lab you will estimate the health impacts of environmental exposures using two complementary approaches:

Part 1 — Water Quality: Conduct a quantitative microbial risk assessment (QMRA) to estimate how the burden of disease in the United States would change if household drinking water quality resembled the water quality in Boulder Creek.
Part 2 — Air Quality: Use the Household Air Pollution Intervention Tool (HAPIT) to evaluate the cost-effectiveness of different cookstove technologies across three countries.

You will synthesize both parts into a formal lab report (see Lab Report Requirements and Grading Rubric).

Getting Started with R and RStudio

This lab uses R, a free programming language widely used in public health, epidemiology, and data science. You don’t need any prior programming experience — this section will walk you through everything you need to get set up. Complete these steps at least one day before the lab so you can troubleshoot any issues in advance.

What Are R and RStudio?

Think of R as the engine and RStudio as the dashboard. R is the programming language that does the actual computation — it runs your statistical models, generates plots, and crunches numbers. RStudio is a free application that gives you a visual interface for working with R: a place to write code, see your results, manage files, and view plots, all in one window. You need both installed, but you’ll only ever open RStudio directly.

Step 1: Install R

Download R from the official repository. Choose the link for your operating system:

Windows: https://cran.r-project.org/bin/windows/base/ — Click “Download R for Windows,” then run the installer. Accept all default settings.
Mac: https://cran.r-project.org/bin/macosx/ — Download the .pkg file that matches your Mac (Apple Silicon for M1/M2/M3/M4 chips, Intel for older Macs). If you’re not sure which you have, click the Apple menu → “About This Mac” and look for “Chip.”
Linux: https://cran.r-project.org/bin/linux/ — Follow the instructions for your distribution (Ubuntu, Fedora, etc.).

Important: Install R before installing RStudio. RStudio needs to detect your R installation during setup.

Step 2: Install RStudio

Download the free version of RStudio Desktop from https://posit.co/download/rstudio-desktop/. The page should auto-detect your operating system and recommend the right installer. Download it, run the installer, and accept the defaults.

Once installed, open RStudio (not R). You should see a window that looks something like this:

┌─────────────────────────────────┬──────────────────────────┐
│                                 │  Environment / History   │
│  Source Editor                  │                          │
│  (where you write & edit code)  │  (your variables &       │
│                                 │   command history)       │
│                                 ├──────────────────────────┤
├─────────────────────────────────┤  Files / Plots / Help    │
│                                 │                          │
│  Console                        │  (file browser, plot     │
│  (where code runs & output      │   viewer, documentation) │
│   appears)                      │                          │
└─────────────────────────────────┴──────────────────────────┘

The four panes are:

Source Editor (top-left): Where you write and edit R scripts and R Markdown files. This is where you’ll do most of your work in this lab.
Console (bottom-left): Where R actually runs your code and prints results. You can type commands here directly, but it’s usually better to write them in the Source Editor so you can save and rerun them.
Environment (top-right): Shows all the variables and data you’ve created. As you run code, you’ll see items appear here.
Files / Plots / Help (bottom-right): A file browser, a panel for viewing plots, and a searchable help system.

Step 3: Run Your First Code

Let’s make sure everything works. In the Console (bottom-left pane), click next to the > prompt, type the following, and press Enter:

1 + 1

You should see [1] 2 appear below your command. The [1] just means “this is the first element of the result” — R always labels its output this way. Congratulations, you’ve run R code.

Now try creating a variable and making a simple plot:

# Create a variable (the arrow <- means "assign to")
my_numbers <- c(3, 7, 2, 9, 5, 11, 4)

# Calculate the mean
mean(my_numbers)

# Make a simple histogram
hist(my_numbers, col = "steelblue", main = "My First Histogram")

You can type these lines one at a time in the Console, or paste them all at once. The histogram should appear in the Plots tab in the bottom-right pane. Lines starting with # are comments — R ignores them, and they’re there to explain the code to human readers (including your future self).

Step 4: Understand R Markdown

This lab uses an R Markdown file (.Rmd). R Markdown lets you combine written text (like a report) with executable R code in a single document. When you’re done, you “knit” it into a polished HTML or PDF file — this is how you’ll produce your lab report.

An R Markdown file has two types of content:

Text sections: Written in plain text with simple formatting. Use # Heading for section headers, **bold** for bold, and *italic* for italics.
Code chunks: Sections of R code enclosed in special markers. They look like this:

```{r}
# Your R code goes here
mean(c(1, 2, 3))
```

To run a code chunk in RStudio, click the green play button (▹) in the top-right corner of the chunk, or place your cursor inside the chunk and press Ctrl+Shift+Enter (Windows) or Cmd+Shift+Enter (Mac). The output will appear directly below the chunk.

To produce your final report, click the Knit button at the top of the Source Editor (it has a small yarn ball icon). This runs all the code chunks in order and generates a formatted document.

Step 5: Install Packages

R comes with basic functionality built in, but most real work uses packages — add-on libraries that provide specialized tools. Think of them like apps on a phone: R is the operating system, and packages are the apps you install for specific tasks.

You only need to install a package once (like downloading an app), but you need to load it with library() each time you start a new R session (like opening the app). Run the following in your Console to install all the packages needed for this lab:

install.packages(c("data.table", "ggplot2",
                    "scales", "dplyr", "tidyr", "rmarkdown", "remotes"))

The MonteCarlo package is no longer available directly from CRAN. Install it from the archive using remotes:

install.packages("remotes")
remotes::install_version("MonteCarlo", version = "1.0.6")

This may take a few minutes. You’ll see a lot of red text scrolling by — that’s normal. It’s only a problem if you see the word ERROR (not WARNING, which is usually fine). Once installation is complete, verify everything worked by running:

# If these all run without errors, you're set
library(MonteCarlo)
library(data.table)
library(ggplot2)
library(scales)
library(dplyr)
library(tidyr)

Troubleshooting Common Issues

Problem	Solution
`Error in install.packages` or “package not available”	Make sure you’re connected to the internet. Try running `install.packages("ggplot2")` for a single package to isolate the issue. On Mac, you may need to install Xcode Command Line Tools first: open Terminal and run `xcode-select --install`.
`Error in library(X): there is no package called 'X'`	The package didn’t install correctly. Run `install.packages("X")` again and check for `ERROR` messages in the output.
RStudio says “R not found” or the Console shows no `>` prompt	R may not be installed, or RStudio can’t find it. Reinstall R first, then restart RStudio. On Mac, make sure you downloaded the correct version (Apple Silicon vs. Intel).
Knitting fails with an error	Read the error message — it usually tells you which line caused the problem. The most common issue is running code chunks out of order. Try clicking Run All Chunks Above (downward arrow icon at the top of a chunk) to make sure earlier code has executed. Also verify that the `rmarkdown` package is installed.
Plots don’t appear	Check the Plots tab in the bottom-right pane. If it says “Plot area too small,” drag the pane border to make it larger.
R is “stuck” or shows a `+` instead of `>` in the Console	R is waiting for you to finish a command (usually a missing closing parenthesis or quote). Press Escape to cancel and try again.

Still stuck? This is a great time to use Claude. Paste the exact error message and a brief description of what you were trying to do. Claude is very good at diagnosing R setup issues. You can also come to office hours or contact the TA.

Optional: Learning More About R

You don’t need to learn R in depth for this lab — the code is provided, and you’re encouraged to use Claude for help. But if you want to build a stronger foundation, these free resources are excellent starting points:

RStudio Education: Beginners — Short tutorials from the team that makes RStudio.
R for Data Science (2e) — A free online textbook. Chapters 1–3 cover the basics you’ll use in this lab.
RStudio Cheat Sheets — One-page visual references for ggplot2, dplyr, R Markdown, and more.

Using Claude in This Lab

This lab encourages you to use Claude, an AI assistant made by Anthropic, as a learning and productivity tool. Think of Claude as a knowledgeable tutor who is available around the clock — you can ask it to explain concepts, help you debug code, review your writing, or walk you through a calculation step by step. Learning to work effectively with AI tools is an increasingly important skill in public health and data science, and this lab is a good place to start.

What Is Claude?

Claude is a large language model — a type of AI that can read and generate text, understand code, analyze data, and hold a conversation. You interact with it by typing messages in a chat interface, much like texting. You can ask it questions in plain English, paste in code or error messages, share your writing for feedback, or ask it to explain something you’ve read. Claude responds in real time and you can ask follow-up questions to dig deeper.

A few important things to know about Claude:

It’s very good at explaining things — especially code, statistics, and scientific concepts. If you don’t understand a code block in this lab, Claude can walk you through it line by line.
It can write and debug R code — you can paste an error message and Claude will usually identify the problem and suggest a fix.
It can be wrong — Claude sometimes generates plausible-sounding information that is incorrect, especially specific numbers or niche facts. Always verify quantitative claims against your own results and cited references.
It doesn’t have access to your computer — Claude can’t run your R code, open the GBD Compare tool, or use HAPIT for you. You still need to do those things yourself and bring the results back into the conversation.

Getting Access

Go to claude.ai and create a free account. The free tier is sufficient for this assignment. Claude works in any web browser — no software installation required. There is also a mobile app for iOS and Android if you prefer to work from your phone.

Tip: You can upload files directly to Claude. If you’re stuck on your R Markdown file, upload the whole .Rmd and ask Claude to help you find the problem. You can also paste screenshots of error messages or plots.

How to Use Claude During This Lab

Below are example prompts organized by the type of help you might need. You don’t have to use these exact words — Claude understands natural language, so just describe what you need in your own way.

Understanding Code

When you encounter a code block you don’t understand, paste it into Claude and ask for an explanation:

Example prompt:
“I’m new to R. Can you explain what this code does line by line? I don’t understand what the beta_poisson function is calculating or what the parameters alpha and n50 mean.”
[paste the code block]

Debugging Errors

When your code produces an error, copy the exact error message and the code that triggered it:

Example prompt:
“I’m getting this error when I try to run my Monte Carlo simulation. Here’s the error message and the code. What’s wrong and how do I fix it?”
[paste error message and code]

Understanding Concepts

Use Claude to build intuition for the public health and statistical concepts in this lab:

Example prompts:
“Explain DALYs to me like I’m hearing about them for the first time. How are they calculated and why do public health researchers use them instead of just counting deaths?”

“What is a Monte Carlo simulation and why would I use one instead of just plugging in average values? Can you give me a simple non-medical example first?”

“What does ‘cost-effective’ mean in the WHO-CHOICE framework? Why is it tied to GDP per capita, and what are the criticisms of that approach?”

Improving Figures

Share your ggplot code and ask Claude to help you make it clearer or more polished:

Example prompt:
“Here’s my ggplot code for comparing DALYs across pathogens. The plot works but looks pretty basic. Can you help me improve the colors, labels, and overall readability for a formal lab report?”
[paste your ggplot code]

Strengthening Your Writing

Paste a draft paragraph from your discussion section and ask Claude for feedback:

Example prompt:
“Here’s my draft paragraph discussing why my simulated DALYs are higher than the GBD baseline. Does my reasoning make sense? Am I missing any important explanations? Are there counter-arguments I should address?”
[paste your paragraph]

Checking Your Assumptions

When you’re unsure whether a parameter value or modeling choice is reasonable:

Example prompt:
“In this lab we assume that 100% of the population is susceptible to campylobacter infection. Is that realistic? What does the literature say about natural immunity or age-related susceptibility?”

Tips for Getting Better Results

Be specific. “Help me with my code” will get a generic answer. “This ggplot isn’t showing a legend and I don’t know why — here’s my code” will get a targeted fix.
Include context. Tell Claude what class this is for, that you’re new to R, and what you’re trying to accomplish. It adjusts its explanations based on your level.
Ask follow-up questions. If Claude’s explanation uses a term you don’t know, just ask. The conversation is iterative — keep going until it clicks.
Paste your actual code and errors. Don’t try to describe code from memory. Copy and paste the exact text so Claude can see precisely what you’re working with.
Push back when something doesn’t seem right. If Claude gives you an answer that contradicts your results or your reading of a paper, say so. It will often reconsider and give a better response.
Use it to learn, not just to get answers. After Claude fixes your code, ask why the fix works. After it drafts a paragraph, ask yourself whether you agree with every sentence. The goal is for you to understand the material by the end.

Remember: Claude is a tool, not an authority. It cannot run your simulations, access the IHME visualization tool, or look at your HAPIT outputs. It can sometimes state incorrect facts with confidence. Always verify claims against your own data and the references cited in this assignment. See the AI Use Statement requirements in the Lab Report section for transparency expectations.

Before You Start

Setup Checklist

R (≥ 4.1) and RStudio installed (see Getting Started with R)
Required packages installed and verified (see Step 5)
Claude account created at claude.ai (see Using Claude)
Access to GBD Compare (no account needed)
Access to HAPIT (no account needed — verify it loads before lab day)
Access to GHDx for population data
Access to World Bank GDP per capita data for Part 2

Expected runtime: The Monte Carlo simulations in Part 1 may take 10–20 minutes depending on your machine. Use cache = TRUE in your Rmd chunk options for simulation blocks so they only run once unless the code changes. Do not delete the _cache folder between knitting sessions.

Part 1

Water Quality Impacts on Health

1. Current Burden of Disease

Before estimating a hypothetical disease burden, you need to establish a baseline. Use the GBD Compare visualization tool from the Institute for Health Metrics and Evaluation (IHME).

Tip: Download the data. The GBD Compare website only displays a subset of the cause and risk hierarchy. To see the full hierarchy — including lower-level causes like “Unsafe water” nested under “Unsafe water, sanitation and hygiene” — use the Download button in GBD Compare to export the data as a CSV. This will give you access to all levels and categories that may not be visible in the interactive visualization.

For the United States (most recent year available):

Navigate to “Risks by Cause” (second icon on the left) and select Level 4 on the slider.
Select “United States” for location and find “Unsafe water” under “Unsafe water, sanitation and hygiene.”
Record the total DALYs, percent of total DALYs, and DALY rate (per 100,000).
Switch to the arrow diagram view. With Level 4 still selected, record the same three metrics plus the cause ranking for unsafe water (change Category to “All Risk Factors”).

For Sub-Saharan Africa (most recent year available):

Repeat step 4 for Sub-Saharan Africa.

Verification Values (2019) — Use these to confirm you are accessing the data correctly, then repeat for the most recent year available.

United States, 2019 — Enteric infections from unsafe water: 14,503 DALYs; 0.013% of total DALYs; 4.42 per 100,000 (95% UI: 0.7–11.76)
Sub-Saharan Africa, 2019 — Diarrheal diseases from unsafe water: ~36 million DALYs; 7.05% of total DALYs; 3,338 per 100,000

2. Quantitative Microbial Risk Assessment

You will now estimate the disease burden that would result from drinking water contaminated at levels similar to Boulder Creek. This involves four steps: estimating the exposed population, simulating water consumption, modeling dose-response relationships, and calculating health impact in DALYs.

Recommended References

Havelaar, A.H., & Melse, J.M. (2003). Quantifying public health risk in the WHO Guidelines for drinking-water quality: A burden of disease approach. RIVM. Link
Brown, J., & Clasen, T. (2012). High Adherence Is Necessary to Realize Health Gains from Water Quality Interventions. PLoS ONE, 7(5), e36735. DOI
U.S. EPA (2011). Exposure Factors Handbook, Chapter 3: Ingestion of Water and Other Select Liquids. Link — Source for water consumption distribution parameters.

2a. Population

What was the population of the United States in the most recent year available? See GHDx. For background on Monte Carlo methods, see Wikipedia and Investopedia.

library(MonteCarlo)
library(data.table)
library(ggplot2)
library(scales)
library(dplyr)
library(tidyr)

set.seed(42)  # for reproducibility — do not change

# Update this value to the most recent year available from GHDx
population <- 331893745

2b. Water Consumption

How much water do people drink per day? Rather than using a single average, we simulate daily consumption to capture variability across the population. Real water intake follows a continuous distribution — the EPA Exposure Factors Handbook reports a mean of approximately 1.4 L/day of direct water intake for adults, with substantial right-skew (some individuals consume much more due to physical activity or climate).

We model this with a gamma distribution, which naturally captures the positive-valued, right-skewed nature of water consumption data. The parameters below (shape = 4.0, rate = 2.0) produce a distribution with a mean of ~2.0 L/day, a mode of ~1.5 L/day, and a long tail extending past 5 L/day.

Think about: Why is simulation with a continuous distribution preferable to using a single average value? What information do you lose with a point estimate? Why is a gamma distribution a better model for water intake than sampling from a few discrete values?

# Simulate daily water consumption (liters) for 10,000 people over 365 days
# Gamma distribution: shape=4.0, rate=2.0
#   mean = 2.0 L/day, mode ~ 1.5 L/day, sd = 1.0
# Parameters informed by EPA Exposure Factors Handbook (2011), Ch. 3
generate_water <- function(n = 1, shape = 4.0, rate = 2.0) {
  rgamma(n, shape = shape, rate = rate)
}

# Using 10,000 people (not 100,000) to keep simulation runtime manageable
# Results are per-capita, so this does not affect DALY rates
n_people <- 10000
dt_water <- data.table(replicate(365, generate_water(n_people)))

# Verify the distribution looks reasonable
hist(dt_water$V1, breaks = 50,
     main = "Simulated Daily Water Consumption (Day 1)",
     xlab = "Liters per day", col = "steelblue", border = "white")
cat("Mean:", round(mean(dt_water$V1), 2),
    "| Median:", round(median(dt_water$V1), 2),
    "| SD:", round(sd(dt_water$V1), 2))

2c. Dose-Response

How many organisms would people ingest per liter of water, and what is the probability that exposure leads to infection and illness?

Below are two approaches for estimating organism concentrations: a log-normal distribution and an empirical distribution based on E. coli water quality testing of Boulder Creek. Before choosing one, you will compare them quantitatively.

# Option 1: Log-normal distribution
# meanlog and sdlog are parameters of the underlying normal distribution
# on the log scale. A meanlog of 1 with sdlog of 1 gives:
#   median ~ 2.7, mean ~ 4.5 organisms/L
organisms_lnorm <- function(n = 1, meanlog = 1, sdlog = 1) {
  rlnorm(n, meanlog = meanlog, sdlog = sdlog)
}

# Option 2: Empirical distribution from Boulder Creek E. coli testing (CFU/100mL)
# Values multiplied by 10 to convert to per-liter estimates
organisms_empirical <- function(n = 1) {
  sample(c(14,75,9,18,14,22,8,14,6,12,26,6,1,2,8,4),
         n, replace = TRUE) * 10
}

Exercise: Compare the Two Distributions

Before choosing a distribution for the simulation, compare them visually and quantitatively. Run the code below to plot both distributions, compute summary statistics, and perform a Kolmogorov-Smirnov test.

# Generate 10,000 samples from each distribution
samp_lnorm <- organisms_lnorm(10000, meanlog = 1, sdlog = 1)
samp_empirical <- organisms_empirical(10000)

# Summary statistics
compare_df <- data.frame(
  Distribution = c("Log-normal", "Empirical"),
  Mean   = c(mean(samp_lnorm), mean(samp_empirical)),
  Median = c(median(samp_lnorm), median(samp_empirical)),
  SD     = c(sd(samp_lnorm), sd(samp_empirical)),
  Min    = c(min(samp_lnorm), min(samp_empirical)),
  Max    = c(max(samp_lnorm), max(samp_empirical))
)
print(compare_df)

# Side-by-side histograms
par(mfrow = c(1, 2))
hist(samp_lnorm, breaks = 50, main = "Log-normal",
     xlab = "Organisms/L", col = "#4a90d9", border = "white", xlim = c(0, 800))
hist(samp_empirical, breaks = 30, main = "Empirical (Boulder Creek)",
     xlab = "Organisms/L", col = "#d97a4a", border = "white", xlim = c(0, 800))
par(mfrow = c(1, 1))

# Kolmogorov-Smirnov test: are the two distributions significantly different?
ks.test(samp_lnorm, samp_empirical)

Think about: Which distribution better represents the uncertainty in organism concentrations and why? Consider the range of values, the shape of each distribution, and whether the empirical sample (only 16 observations) captures enough variability. Justify your choice in your methods section.

Now calculate each person's daily dose of organisms and the probability of infection using the beta-Poisson dose-response model:

# Generate concentrations for 10,000 people over 365 days
# (use your chosen distribution — here we show the log-normal)
dt_organisms <- data.table(replicate(365, organisms_lnorm(n_people, 1, 1)))

# Daily dose = water consumed (L) x organisms per liter
dose <- dt_water * dt_organisms

# Beta-Poisson dose-response model
beta_poisson <- function(dose, alpha = 0.145, n50 = 896) {
  1 - (1 + (dose / n50) * (2^(1 / alpha) - 1))^(-alpha)
}

# Probability of daily infection (campylobacter parameters)
p_inf_daily <- beta_poisson(dose)

# Probability of illness given infection (0.3 for campylobacter)
p_illness_inf <- 0.3
p_illness <- p_inf_daily * p_illness_inf

2d. Health Impact

You now have a matrix of daily illness probabilities for 10,000 individuals over one year. Convert this to disability-adjusted life years (DALYs) using the disease burden per case of campylobacteriosis.

# Proportion of population susceptible to campylobacter infection
susceptible <- 1.00

# Disease burden per case of campylobacteriosis (DALYs/case)
db_case <- 4.6e-3

# DALYs per person per year
DALYs_year <- p_illness * db_case * susceptible

# Total DALYs per 10,000 — scale to per 100,000 for comparison with GBD data
DALYs_total <- sum(DALYs_year) * (100000 / n_people)
cat("Estimated DALYs per 100,000:", round(DALYs_total, 2))

3. Monte Carlo Simulation

The calculation above has two important limitations. First, it assumes a person could develop campylobacteriosis every day, when in reality illness duration is already incorporated into the disease burden per case estimate. Second, we used fixed parameter values when many of them — especially organism concentration — are uncertain.

A Monte Carlo simulation addresses this by running the calculation many times with varying inputs to produce a distribution of outcomes with confidence intervals. See Introducing the MonteCarlo Package for background.

Your responsibility is to understand how this code works so that you can describe the methods and interpret the results in your lab report.

Disease Burden Function

# This function is called repeatedly with different parameter combinations
disease_burden = function(people, meanlog, alpha, n50,
                           p_illness_inf, susceptible, db_case) {

  # Simulate water consumption (gamma distribution)
  water_consumed <- rgamma(people, shape = 4.0, rate = 2.0)

  # Simulate organism concentration (log-normal)
  number_of_organisms <- rlnorm(people, meanlog = meanlog)

  # Daily dose
  dose <- water_consumed * number_of_organisms

  # Probability of daily infection (beta-Poisson model)
  p_inf_daily <- 1 - (1 + (dose / n50) * (2^(1 / alpha) - 1))^(-alpha)

  # Probability of illness
  p_illness <- p_inf_daily * p_illness_inf

  # Annual infection risk (assuming independent daily events)
  p_year = 1 - (1 - p_illness)^365

  # Total DALYs per 100,000
  DALYs_year = p_year * db_case * susceptible
  DALYs_total <- sum(DALYs_year) * (100000 / people)

  return(list("DALYs" = DALYs_total))
}

Pathogen Parameters

Each pathogen has different dose-response characteristics and disease severity. The meanlog parameter is varied across runs to explore the sensitivity of results to organism concentration.

# Campylobacter — Sources: Haas et al. (1999); Havelaar & Melse (2003)
param_campylobacter <- list(
  "people"        = 10000,
  "meanlog"       = c(1, 0.1, 0.01, 0.001, 0.0001),
  "alpha"         = 0.145,
  "n50"           = 896,
  "p_illness_inf" = 0.3,
  "susceptible"   = 1.00,
  "db_case"       = 4.6e-3
)

# Rotavirus — Sources: Ward et al. (1986); Havelaar & Melse (2003)
param_rotavirus <- list(
  "people"        = 10000,
  "meanlog"       = c(1, 0.1, 0.01, 0.001, 0.0001),
  "alpha"         = 0.253,
  "n50"           = 6.17,
  "p_illness_inf" = 0.5,
  "susceptible"   = 0.06,
  "db_case"       = c(0.2, 0.3, 0.4, 0.5)
)

# Shiga toxin-producing E. coli — Sources: Strachan et al. (2005); Teunis et al. (2004)
param_ecoli <- list(
  "people"        = 10000,
  "meanlog"       = c(1, 0.1, 0.01, 0.001, 0.0001),
  "alpha"         = 0.16,
  "n50"           = 2.11e6,
  "p_illness_inf" = 1.00,
  "susceptible"   = 1.00,
  "db_case"       = 5.47e-2
)

# Cryptosporidium — Sources: DuPont et al. (1995); Messner et al. (2001)
param_crypto <- list(
  "people"        = 10000,
  "meanlog"       = c(0.0001, 0.001, 0.01, 0.1, 1),
  "alpha"         = 0.115,
  "n50"           = 132,
  "p_illness_inf" = 0.7,
  "susceptible"   = 1.00,
  "db_case"       = 1.5e-3
)

Running the Simulations

# cache = TRUE recommended in your Rmd chunk options for this block
# These simulations may take 10-20 minutes total
campylobacter <- MonteCarlo(func = disease_burden, nrep = 1000,
                            param_list = param_campylobacter)
rotavirus <- MonteCarlo(func = disease_burden, nrep = 1000,
                        param_list = param_rotavirus)
ecoli <- MonteCarlo(func = disease_burden, nrep = 1000,
                    param_list = param_ecoli)
crypto <- MonteCarlo(func = disease_burden, nrep = 1000,
                     param_list = param_crypto)

# Summarize results
summary(campylobacter$results$DALYs)
summary(rotavirus$results$DALYs)
summary(ecoli$results$DALYs)
summary(crypto$results$DALYs)

Visualizing Results

The code below produces a working boxplot comparing campylobacter DALY estimates across organism concentration levels. Use it as a starting point and create additional figures for your report.

# Extract results into a data frame
camp_df <- MakeFrame(campylobacter)

# Boxplot: DALYs by organism concentration level
ggplot(camp_df, aes(x = factor(meanlog), y = DALYs)) +
  geom_boxplot(fill = "#4a90d9", alpha = 0.7, outlier.size = 0.8) +
  scale_y_continuous(labels = comma) +
  labs(x = "Mean Log Organism Concentration",
       y = "Total DALYs per 100,000",
       title = "Campylobacter Disease Burden by Exposure Level",
       subtitle = "1,000 Monte Carlo repetitions per parameter combination") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Think about: How do your simulated DALY estimates compare to the baseline values you recorded in Section 1? What assumptions drive the differences? Consider the scope of pathogens included, the exposure scenario, and the population denominator.

4. Sensitivity Analysis: Annual Risk Calculation

The Monte Carlo simulation uses the formula p_year = 1 - (1 - p_illness)^365, which treats each day as an independent Bernoulli trial. In reality, a person who develops campylobacteriosis will be sick for several days and cannot be “re-infected” during that period. The disease burden per case already accounts for illness duration, so treating daily events as fully independent may overestimate risk.

In this exercise you will compare the daily-independence approach against a simpler effective-exposure-days approach that accounts for illness duration. For campylobacteriosis, the average illness duration is approximately 7 days, meaning a person can effectively experience at most ~52 independent exposure windows per year (365/7) rather than 365.

# Sensitivity analysis: compare two annual risk approaches
# Run for a single meanlog value (1) to keep it simple
set.seed(42)
n_sens <- 10000

# Simulate one set of inputs
water <- rgamma(n_sens, shape = 4.0, rate = 2.0)
organisms <- rlnorm(n_sens, meanlog = 1)
dose_sens <- water * organisms

p_inf <- 1 - (1 + (dose_sens / 896) * (2^(1/0.145) - 1))^(-0.145)
p_ill <- p_inf * 0.3

# Approach 1: Daily independence (365 independent trials)
p_year_365 <- 1 - (1 - p_ill)^365
dalys_365 <- sum(p_year_365 * 4.6e-3 * 1.00) * (100000 / n_sens)

# Approach 2: Effective exposure days (accounting for 7-day illness duration)
effective_days <- floor(365 / 7)  # ~52 independent exposure windows
p_year_eff <- 1 - (1 - p_ill)^effective_days
dalys_eff <- sum(p_year_eff * 4.6e-3 * 1.00) * (100000 / n_sens)

# Compare
cat("DALYs per 100,000 (365 independent days):", round(dalys_365, 2), "\n")
cat("DALYs per 100,000 (52 effective windows): ", round(dalys_eff, 2), "\n")
cat("Ratio (365/effective):                   ",
    round(dalys_365 / dalys_eff, 2), "x\n")

Think about: How much does the choice of annual risk model affect your DALY estimates? Which approach is more defensible? How should you account for this uncertainty when reporting your results?

Part 2

Air Quality Impacts on Health

Cost-Effectiveness Scenarios

Use the Household Air Pollution Intervention Tool (HAPIT) to compare the relative impact of different cookstove technologies across three countries: Ethiopia, India, and one country of your choosing.

HAPIT Input Parameters

Parameter	Value	Source
Pre-intervention PM2.5 (three-stone fire)	386 μg/m³	Balakrishnan et al. (2013), wood fuel, kitchen median 24-hr
Post-intervention PM2.5 (rocket stove)	216 μg/m³	HAPIT default, wood fuel
Post-intervention PM2.5 (LPG stove)	54 μg/m³	HAPIT default, LPG fuel
Targeted households	25,000	—
Adoption rate	60%	—
Stove lifespan	5 years	Ecozoom Dura and LPG
All other parameters	Default values	—

For each country and intervention scenario, record the total averted DALYs and total averted deaths from HAPIT.

Cost-Effectiveness Analysis

To evaluate cost-effectiveness, apply the WHO-CHOICE thresholds based on per capita GDP. Under this framework, an intervention costing less than three times the national annual GDP per capita per DALY averted is considered cost-effective, and one costing less than one times GDP per capita per DALY averted is considered highly cost-effective. See Revill et al. (2014) for a discussion of this approach and its limitations.

Cost Assumptions

To compute cost per DALY averted, you need to estimate the total program cost over the intervention period. Use the following assumptions (you may adjust with justification):

Cost Component	Rocket Stove	LPG Stove
Stove purchase price (per unit)	$40	$50
Annual fuel cost (per household)	$0 (collected wood)	$60
Distribution & training (per household)	$10	$15
Lifespan	5 years	5 years
Number of households	25,000	25,000

Worked Example

Here is how to calculate cost per DALY averted for a hypothetical scenario (replace with your actual HAPIT results):

# Worked example — replace placeholder values with your HAPIT results

# Total program cost: (stove + distribution) + (fuel x years), all x households
rocket_cost_per_hh <- 40 + 10 + (0 * 5)   # = $50 per household over 5 years
lpg_cost_per_hh    <- 50 + 15 + (60 * 5)  # = $365 per household over 5 years

n_households <- 25000
rocket_total_cost <- rocket_cost_per_hh * n_households  # = $1,250,000
lpg_total_cost    <- lpg_cost_per_hh * n_households     # = $9,125,000

# Example HAPIT output (REPLACE THESE with your actual results)
averted_dalys_rocket_ethiopia <- 500   # placeholder
averted_dalys_lpg_ethiopia    <- 1200  # placeholder

# Cost per DALY averted
cost_per_daly_rocket <- rocket_total_cost / averted_dalys_rocket_ethiopia
cost_per_daly_lpg    <- lpg_total_cost / averted_dalys_lpg_ethiopia

cat("Rocket stove: $", round(cost_per_daly_rocket), "per DALY averted\n")
cat("LPG stove:    $", round(cost_per_daly_lpg), "per DALY averted\n")

# WHO-CHOICE thresholds
# Look up GDP per capita at: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
gdp_ethiopia <- 1027  # example — update with current value

cat("\nEthiopia GDP/capita: $", gdp_ethiopia, "\n")
cat("Highly cost-effective threshold (1x GDP): $", gdp_ethiopia, "\n")
cat("Cost-effective threshold (3x GDP):         $", gdp_ethiopia * 3, "\n")

Working with HAPIT Results in R

After running your HAPIT scenarios, enter your results into the data frame below so you can create figures for your report. Replace all placeholder values with your actual HAPIT outputs.

# Enter your HAPIT results here — replace ALL placeholder values
hapit_results <- data.frame(
  Country      = rep(c("Ethiopia", "India", "[Your Country]"), each = 2),
  Intervention = rep(c("Rocket Stove", "LPG Stove"), 3),
  Averted_DALYs  = c(0, 0, 0, 0, 0, 0),  # replace with your values
  Averted_Deaths = c(0, 0, 0, 0, 0, 0),  # replace with your values
  Total_Cost     = rep(c(1250000, 9125000), 3),
  GDP_Per_Capita = c(0, 0, 0, 0, 0, 0)   # look up for each country
)

# Calculate cost-effectiveness
hapit_results <- hapit_results %>%
  mutate(
    Cost_Per_DALY = Total_Cost / Averted_DALYs,
    Threshold_1x  = GDP_Per_Capita,
    Threshold_3x  = GDP_Per_Capita * 3,
    Rating = case_when(
      Cost_Per_DALY <= Threshold_1x ~ "Highly Cost-Effective",
      Cost_Per_DALY <= Threshold_3x ~ "Cost-Effective",
      TRUE ~ "Not Cost-Effective"
    )
  )
print(hapit_results)

Figure: Averted DALYs by Country and Intervention

# Grouped bar chart — averted DALYs
ggplot(hapit_results, aes(x = Country, y = Averted_DALYs, fill = Intervention)) +
  geom_col(position = "dodge", width = 0.7) +
  scale_fill_manual(values = c("Rocket Stove" = "#d97a4a",
                                "LPG Stove" = "#4a90d9")) +
  scale_y_continuous(labels = comma) +
  labs(y = "Averted DALYs",
       title = "Health Impact of Cookstove Interventions by Country",
       subtitle = "25,000 households, 60% adoption, 5-year lifespan") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "bottom")

# Stacked bar chart — averted DALYs + Deaths
hapit_long <- hapit_results %>%
  select(Country, Intervention, Averted_DALYs, Averted_Deaths) %>%
  pivot_longer(cols = c(Averted_DALYs, Averted_Deaths),
               names_to = "Metric", values_to = "Value") %>%
  mutate(Metric = gsub("Averted_", "Averted ", Metric))

ggplot(hapit_long, aes(x = interaction(Intervention, Country),
                        y = Value, fill = Metric)) +
  geom_col() +
  scale_y_continuous(labels = comma) +
  labs(x = "", y = "Count",
       title = "Averted DALYs and Deaths by Scenario") +
  theme_minimal(base_size = 13) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold"),
        legend.position = "bottom")

Discussion Questions for Part 2

How do you account for the differences in health outcomes between countries when all other inputs were held constant? What assumptions are built into HAPIT’s calculations? (Refer to Chapter 10 of Broken Promises and the HAPIT Info tab.)
Do you agree with the results? Which assumptions seem least plausible?
Can you envision using HAPIT as a policy tool for selecting air quality interventions? Why or why not?
What are the ethical implications of the WHO-CHOICE threshold approach? Consider: Does tying cost-effectiveness to GDP per capita mean that the same intervention could be “cost-effective” in a low-income country but not in a high-income one? What does that imply about how we value health across contexts?

Grading Rubric

Component	Weight	Criteria
Introduction & Methods	20%	Clear research question; methods described accurately enough for replication; assumptions stated and justified with references; appropriate use of QMRA and HAPIT frameworks.
Results & Figures	30%	DALY estimates reported with uncertainty ranges; figures are well-labeled, appropriately chosen, and visually clear; HAPIT results presented in a summary table; GBD baseline comparison is quantitative, not just qualitative.
Discussion & Critical Analysis	35%	All discussion questions addressed substantively; sensitivity analysis results interpreted; limitations are specific (not generic); ethical dimensions of both QMRA and WHO-CHOICE engaged with thoughtfully; connections drawn between Part 1 and Part 2.
Code Quality & Reproducibility	15%	Code runs without errors when knitted; `set.seed()` preserved; comments explain non-obvious steps; distribution choice justified; no hardcoded paths or unnecessary package installations in final submission.

Lab Report Requirements

Compose a formal lab report covering both Part 1 and Part 2. The report should not exceed five pages (excluding appendices) and may be submitted as a knitted R Notebook (PDF) or a standalone PDF.

Required Sections

Your report must include a brief introduction, methods, results, discussion, and conclusion.

Required Figures and Tables

At least one figure comparing DALY estimates across pathogens (Part 1)
Sensitivity analysis comparison (Section 4)
At least one figure comparing averted DALYs and/or deaths across countries and interventions (Part 2)
A summary table of HAPIT results and cost-effectiveness ratings in an appendix

Discussion Questions to Address

Water Quality (Part 1)

How does your simulated disease burden (assuming Boulder Creek water quality) compare to the IHME baseline? How does it compare with other causes of disease burden in the United States? In low-SDI countries?
Is disease burden per case the same regardless of context? Justify your answer.

Air Quality (Part 2)

Would the disease burden from cooking with rocket stoves exceed the current disease burden from smoking in the United States?
How does the modeled disease burden compare with other leading causes in the United States? In low-SDI countries?

Both Parts

Simulations require many assumptions. Which assumptions felt most uncertain or difficult to justify from the literature? Which most influenced your results? How confident are you in your findings?
Why is it important to conduct a Monte Carlo simulation instead of relying on average values?
Both the QMRA and WHO-CHOICE approaches require assigning numerical values to human health. What are the ethical implications of this? Consider how assumptions about population susceptibility, equal weighting of life-years, and GDP-based thresholds might systematically advantage or disadvantage certain populations.

Using Claude as a Learning Tool

You are encouraged to use Claude (Anthropic’s AI assistant) throughout this assignment. Working effectively with AI tools is a professional skill in public health, data science, and policy analysis — and this lab is a good place to start developing it. That said, AI is most useful when you bring your own understanding to the conversation. The goal is to use Claude to learn more deeply, not to bypass the learning.

Where Claude Can Help

Debugging R code: Paste error messages or unexpected outputs and ask Claude to help you diagnose the problem. This is often faster than searching Stack Overflow, and Claude can explain why something broke.
Understanding the code: Ask Claude to walk you through any code block you don’t fully understand. For example: “Explain what the beta-Poisson function does and why the parameters alpha and n50 matter.”
Improving your figures: Share your ggplot code and ask Claude to help you refine axis labels, color palettes, or layout for a more polished result.
Strengthening your writing: Paste a draft paragraph from your discussion and ask Claude to identify gaps in your reasoning, suggest counter-arguments, or improve clarity.
Exploring concepts: Use Claude to go deeper on topics like DALYs, dose-response modeling, or the WHO-CHOICE framework. Ask follow-up questions until the concept clicks.
Checking assumptions: Ask Claude whether a particular parameter value or distributional choice is reasonable, and what the literature says about alternatives.

What’s Still Your Responsibility

Running the simulations and collecting data: You need to execute the code, record your GBD Compare and HAPIT results, and generate your own outputs. Claude cannot access these tools for you.
Making analytical decisions: Choosing between the log-normal and empirical distributions, interpreting your sensitivity analysis, and evaluating cost-effectiveness are judgment calls that should be yours. Claude can help you think through the tradeoffs, but the decision and justification must reflect your understanding.
Forming your own arguments: The discussion questions ask for your critical assessment. If Claude drafts a paragraph for you, read it carefully — do you actually agree? Can you defend every claim? Revise until every sentence reflects something you understand and believe.
Verifying AI outputs: Claude can be confidently wrong, especially about specific numbers, parameter values, or niche QMRA literature. Always verify quantitative claims against your own results and the cited references.

Transparency Requirement

Include a brief AI Use Statement at the end of your report (does not count toward the five-page limit). In two to three sentences, describe how you used Claude and for what parts of the assignment. For example:

“I used Claude to debug an error in my Monte Carlo function where the DALY scaling was incorrect, and to help me understand the difference between the beta-Poisson and exponential dose-response models. I also asked Claude to review my discussion of the WHO-CHOICE thresholds and suggest a counter-argument I hadn’t considered.”

There is no penalty for using Claude extensively and no bonus for avoiding it. The AI Use Statement is about practicing transparency, not gatekeeping.

See the syllabus for submission deadlines and formatting requirements.

# Include this at the end of your Rmd for debugging support
sessionInfo()