This document details the generative model used by wwinference
to infer global and local infection dynamics from count data (e.g. cases or hospital admissions) and wastewater concentration data, and use that to produce nowcasts and forecasts.
The wwinference
model assumes that wastewater concentration data is available for one or more "local" sites that represent a subset of the total population that produce the "global" epidemiological indicators (cases or admissions).
In the future, we plan to provide the user with functionality for other types of data structures, e.g. multiple data streams of hospital admissions in addition to multiple wastewater concentration data streams, but for now this structure is the only option provided to the user.
The model is structured into four generative components: infections, hierarchical subpopulation-level infection dynamics, hospital admissions, and viral genome concentrations in wastewater.
The model imposes a hierarchical structure where the infection dynamics in the subpopulations represented by the wastewater concentration data are assumed to be localized outbreaks similar to one another and centered around the "global" infection dynamics that give rise to the hospital admissions. Note, we will describe the model in terms of the generation of hospital admissions, but the user can choose to replace this with any "count" dataset with a delay distribution from infection to the generation of that count data, e.g. cases would also work well here.
Our models are constructed from a set of generative components. These are:
- Infection component: A renewal model for the infection dynamics, which generates estimates of incident latent infections per capita.
- Hierarchical subpopulation-level infection dynamics: A model for the relation between the subpopulation infection dynamics and the "global" infection dynamics
- Hospital admissions component: A model for the expected number of hospital admissions given incident latent infections per capita.
- Viral genome concentration in wastewater: A model for the expected genome concentration given incident infections per capita.
See the notation section for an overview of the mathematical notation we use to describe the model components, including how probability distributions are parameterized.
This component assumes that latent (unobserved) expected incident infections per capita
Where
This process is initialized by estimating an initial exponential growth2 of infections for 50 days prior to the calibration start time
where
We decompose the instantaneous reproduction number
We assume that the unadjusted reproduction number
where
The damping term we use is based on Asher et al. 20185 but extended to be applicable to a renewal process. It assumes that the instantaneous reproduction number is damped by recent infections weighted by the generation interval. This is a simple way to account for the fact that the instantaneous reproduction number is likely to decrease when there are many infections in the population, due to factors such as immunity, behavioral changes, and public health interventions. The damping term is defined as:
where
The structure of this model assumes that we have hospital admissions data coming from a larger "global" population (e.g. an entire state or county) and localized wastewater concentration measurements coming from subsets of the global population. We therefore divide the "global" population into subpopulations representing sampled wastewater sites' catchment populations, with an additional subpopulation to represent individuals who do not contribute to the sampled wastewater.
We model infection dynamics in these subpopulations hierarchically: subpopulation infection dynamics are distributed about a central jurisdiction-level infection dynamic, and the total infections that generate the hospital admissions observations are simply the sum of the subpopulation-level infections.
The total population consists of
Whenever the sum of the wastewater catchment population sizes
The total number of subpopulations is then
This amounts to modeling the wastewater catchments populations as approximately non-overlapping; every infected individual either does not contribute to measured wastewater or contributes principally to one wastewater catchment. This approximation is reasonable if we restrict our analyses to primary wastewaster treatment plants, which avoids the possibility that an individual might be sampled once in a sample taken upstream and then sampled again in a more aggregated sample taken further downstream.
When converting from predicted per capita incident hospital admissions
This amounts to making two key additional modeling assumptions:
- Any individuals who contribute to wastewaster measurements but are not part of the total population are distributed among the catchment populations approximately proportional to catchment population size.
- Whenever
$\sum n_k \ge n$ , the fraction of individuals in the jurisdiction not covered by wastewater is small enough to have minimal impact on the jurisdiction-wide per capita infection dynamics.
The hierarchical subpopulation structure linking infection dynamics in each subpopulation to a central or "global" dynamic is implemented using a reference subpopulation.
The reference subpopulation is by default the subpopulation not covered by wastewater, or in the case where the sum of the wastewater site catchment populations meet or exceed the total population (
We couple the subpopulation and total population infection dynamics at the level of the un-damped instantaneous reproduction number in the reference subpopulation,
We model the subpopulations as having infection dynamics that are similar to one another but can differ from the reference subpopulation dynamic.
We represent this with a hierarchical model where we estimate the reference subpopulation's un-damped effective reproductive number
The refrence value for the undamped instantaneous reproductive number
where
The time-varying subpopulation effect on
where
We chose a prior of
The subpopulation
From
To obtain the number of infections per capita
We infer the site level initial per capita incidence
Following other semi-mechanistic renewal frameworks, we model the expected hospital admissions per capita
To account for day-of-week effects in hospital reporting, we use an estimated weekday effect
Where
We define the discrete hospital admissions delay distribution
In the models that include fits to wastewater data, we allow the population-level infection-hospitalization rate (IHR) to change over time. An inferred change in the IHR could reflect either a true change in the rate at which infections result in hospital admissions (e.g. the age distribution of cases could shift, a more or less severe variant could emerge, or vaccine coverage could shift) or a change in the relationship between infections and genomes shed in wastewater
Therefore, we model the proportion of infections that give rise to hospital admissions
The values
where
where
We chose a relatively strong prior of
In the version of the model where we do not fit the wastewater data (which we refer to as the "hospital admissiosn only" model), we model the IHR as constant. We assign this constant IHR the same prior distribution that we assign
We model the observed hospital admission counts
where the "global" population size
Currently, we do not explicitly model the delay from hospital admission to reporting of hospital admissions. In reality, corrections (upwards or downwards) in the admissions data after the report date are possible and do happen. This is an active area of further development, but for now, we advise the user to manually exclude hospital admissions data points that appear implausible. Future work will include incorporation of a simple model for right-truncation when data is rolling in in real-time with incomplete reporting in recent days. However, the current workflow assumes mandatory and for the most part complete reporting of hospital admissions.
We model site-specific viral genome concentrations in wastewater
where
This approach assumes that
We model the shedding kinetics
where
Future iterations of this model will evaluate the utility of mechanistic modeling of wastewater collection and processing.
Genome concentration measurements can vary between sites, and even within a site through time, because of differences in sample collection and lab processing methods. To account for this variability, we add a scaling term
Both
such that
See Prior Distributions for priors on
In the rare cases when a site submits multiple concentrations for a single date and lab method, we treat each record as an independent observation.
Lab processing methods have a finite limit of detection (LOD), such that not all wastewater measurements can be modeled using the log-normal approach above. This limit of detection varies across sites, between methods, and potentially also over time.
If an observed value
where
(This is mathematically equivalent to integrating the probability density function of the log-normal distribution from zero to the LOD.)
The default parameters provided by the wwinference
package are used to fit a model of COVID-19 hospital admissions and wastewater concentrations in terms of reported SARS-CoV-2 genome copies per mL.
Below we will describe the priors and parameters provided. If fitting the model to a different epidemiological indicator (e.g. cases) or a different pathogen (e.g. flu) a number of these will have to be modified accordingly.
We use informative priors for parameters that have been well characterized in the literature and weakly informative priors for parameters that have been less well characterized.
Parameter | Prior distribution | Source |
---|---|---|
Initial hospitalization probability | Perez-Guzman et al. 2023 8 | |
Time to peak fecal shedding | Russell et al. 2023 9, Huisman et al. 2022 10, Cavany et al. 2022 11 | |
Peak viral shedding |
Miura et al. 2021 12 | |
Duration of shedding | Cevik et al. 2021 13, Russell et al. 2023 9 | |
Total genomes shed per infected individual | Watson et al 202314 | |
Initial infections per capita |
where |
|
Initial exponential growth rate | Chosen to assume flat dynamics prior to observations | |
Infection feedback term | Weakly informative prior chosen to have a mode of 500 in natural scale, based on posterior estimates of peaks from prior seasons in a few jurisdictions | |
Day of the week effects | Weakly informative prior with a mode at even daily reporting (no effects) | |
Standard deviation of the log of the site-lab level multiplier |
Weakly informative prior chosen to allow average magnitude of concentrations to be either similar or different among individual sites, depending on data | |
Modal site-level observation standard deviation | Weakly informative prior chosen to allow the mode to be either small or large | |
Standard deviation of the Normal distribution of individual log observation standard deviations |
Weakly informative prior which allows for individual s.d.s to be either clustered around the mode or more dispersed |
Parameter | Value | Source |
---|---|---|
Maximum generation interval |
|
|
Maximum infection to hospital admissions delay |
|
|
Wastewater produced per person-day |
|
Ortiz 202415 |
The discrete generation interval probability mass function
We derive the distribution
We model the incubation period with a discretized, modified Weibull distribution6 with probability mass function
We model the symptom onset to hospital admission delay distribution with a Negative Binomial distribution with probability mass function
The infection-to-hospitalization delay distribution
This resulting infection to hospital admission delay distribution has a mean of 12.2 days and a standard deviation of 5.67 days.
Our framework is an extension of the widely used 18 19, semi-mechanistic renewal framework {EpiNow2}
202, using a Bayesian latent variable approach implemented in the probabilistic programming language Stan 21 using 22 to interface with R.
We fit the model using the “No-U-Turn Sampler Markov chain Monte Carlo” method. This is a type of Hamiltonian Monte Carlo (HMC) algorithm and is the core fitting method used by cmdstan
.
The dfault parameter settings are set to run 4 chains for 750 warm-up iterations and 500 sampling iterations, with a target average acceptance probability of 0.95 and a maximum tree depth of 12.
The user can adjust these settings using the get_mcmc_options()
function.
We identify potential outlier genome concentrations for each unique site and lab pair with an approach based on
Briefly, we compute
- For purposes of outlier detection, exclude wastewater observations below the LOD.
- For purposes of outlier detection, exclude observations more than 90 days before the forecast date.
- For each site
$i$ , compute the change per unit time between successive observations$t$ and$t'$ :$(\log[c_{it'}] - \log[c_{it}])/(t' - t)$ . - Compute
$z$ -scores for$\log[c_{it}]$ across all sites$i$ and timepoints$t$ . Flag values with$z$ -scores over 3 as outliers and remove them from model calibration. - Compute
$z$ -scores for the change per unit time values across all sites and pairs of timepoints. For values with$z$ -scores over 2, flag the corresponding wastewater concentrations$c_{it}$ as outliers and remove them from model calibration.
The
The notation
We parameterize Normal distributions in terms of their mean and standard deviation:
We parameterize Beta distributions in terms of their two standard shape parameters
We parameterize Negative Binomial distributions in terms of their mean and their positive-constrained dispersion parameter (often denoted
We write
Observed data are labeled by data source:
Footnotes
-
Cori, A., Ferguson, N. M., Fraser, C., & Cauchemez, S. A new framework and software to estimate time-varying reproduction numbers during epidemics. Am. J. Epidemiol. 178, 1505-1512 (2013). https://doi.org/10.1093/aje/kwt133 ↩
-
Abbott, S. et al. EpiNow2: Estimate real-time case counts and time-varying epidemiological parameters. https://doi.org/10.5281/zenodo.3957489 ↩ ↩2 ↩3
-
Fraser, C. (2007). Estimating individual and household reproduction numbers in an emerging epidemic. PLoS One, 2(8), e758 (2007). https://doi.org/10.1371/journal.pone.0000758 ↩
-
Gostic, K.M. et al. Practical Considerations for Measuring the Effective Reproductive Number, Rt. PLoS Comput Biol. 16(12) (2020). https://doi.org/10.1371/journal.pcbi.1008409 ↩
-
Asher, J. Forecasting Ebola with a regression transmission model. Epidemics. 22, 50-55 (2018). https://doi.org/10.1016/j.epidem.2017.02.009 ↩ ↩2
-
Park, S.W. et al. Inferring the differences in incubation-period and generation-interval distributions of the Delta and Omicron variants of SARS-CoV-2. Proc Natl Acad Sci U S A. 120(22):e2221887120 (2023). https://doi.org/10.1073/pnas.2221887120 ↩ ↩2 ↩3
-
Larremore, D.B. et al. Test sensitivity is secondary to frequency and turnaround time for COVID-19 screening. Science Advances (2021). https://doi.org/10.1126/sciadv.abd5393 ↩
-
Perez-Guzman, P.N. et al. Epidemiological drivers of transmissibility and severity of SARS-CoV-2 in England. Nat Commun 14, 4279 (2023). https://doi.org/10.1038/s41467-023-39661-5 ↩
-
Russell, T.W. et al. Within-host SARS-CoV-2 viral kinetics informed by complex life course exposures reveals different intrinsic properties of Omicron and Delta variants. medRxiv (2023). https://doi.org/10.1101/2023.05.17.23290105 ↩ ↩2
-
Huisman, J.S. et al. Estimation and worldwide monitoring of the effective reproductive number of SARS-CoV-2 eLife 11:e71345 (2022). https://doi.org/10.7554/eLife.71345 ↩
-
Cavany S, et al. Inferring SARS-CoV-2 RNA shedding into wastewater relative to the time of infection. Epidemiology and Infection 150:e21 (2022). https://doi.org/10.1017/S0950268821002752 ↩
-
Miura F, Kitajima M, Omori R. Duration of SARS-CoV-2 viral shedding in faeces as a parameter for wastewater-based epidemiology: Re-analysis of patient data using a shedding dynamics model. Sci Total Environ 769:144549 (2021). https://doi.org/10.1016/j.scitotenv.2020.144549 ↩
-
Cevik, M. et al. SARS-CoV-2, SARS-CoV, and MERS-CoV viral load dynamics, duration of viral shedding, and infectiousness: a systematic review and meta-analysis. Lancet Microbe 2(1),e13-e22 (2021). https://doi.org/10.1016/S2666-5247(20)30172-5 ↩
-
Leighton, M. et al. Improving estimates of epidemiological quantities by combining reported cases with wastewater data: a statistical framework with applications to COVID-19 in Aotearoa New Zealand. medRxiv (2023). https://doi.org/10.1101/2023.08.14.23294060 ↩
-
Ortiz, P. Wastewater facts - statistics and household data in 2024. https://housegrail.com/wastewater-facts-statistics/ ↩
-
Park, S.W. et al. Estimating epidemiological delay distributions for infectious diseases. medRxiv (2024). https://doi.org/10.1101/2024.01.12.24301247 ↩ ↩2 ↩3
-
Danaché, C. et al. Baseline clinical features of COVID-19 patients, delay of hospital admission and clinical outcome: A complex relationship. PLoS One 17(1):e0261428 (2022). https://doi.org/10.1371/journal.pone.0261428 ↩
-
US Centers for Disease Control and Prevention. Current Epidemic Growth Status (Based on Rt) for States and Territories. https://www.cdc.gov/forecast-outbreak-analytics/about/rt-estimates.html (2024). ↩
-
US Centers for Disease Control and Prevention. Technical Blog: Improving CDC’s Tools for Assessing Epidemic Growth (2024). https://www.cdc.gov/forecast-outbreak-analytics/about/technical-blog-rt.html ↩
-
Abbott, S. et al. Estimating the time-varying reproduction number of SARS-CoV-2 using national and subnational case counts. Wellcome Open Res. 5:112 (2020). https://doi.org/10.12688/wellcomeopenres.16006.2 ↩
-
Stan Development Team. Stan Modeling Language Users Guide and Reference Manual. (2023). https://mc-stan.org ↩
-
CmdStanR: the R interface to CmdStan. (2024). https://mc-stan.org/cmdstanr/index.html ↩