Estimating the sensitivity of wastewater metagenomic sequencing using nasal swabs

Simon Grimm, Dan Rice, Mike McLaren

Author Simon Grimm, Dan Rice, Mike McLaren

Date June 8, 2025

We thank Vanessa Smilansky for processing and sequencing swab samples, Jeff Kaufman for reviewing drafts of this post, and Evan Fields for code review.

Summary

To assess the sensitivity of untargeted wastewater sequencing for pathogen detection, we linked wastewater sequencing data and swab sequencing data in a Bayesian model. Common cold viruses (rhinoviruses and seasonal coronaviruses) showed high population prevalence (>1% for the majority of species) but relatively low detectability in wastewater. SARS-CoV-2 is again confirmed to be readily detectable. As we produce more wastewater and swab sequencing data, we will be able to narrow these estimates further and improve our understanding of how different biosurveillance methods complement one another.

Estimating the sensitivity of wastewater metagenomic sequencing

Prior to the Nucleic Acid Observatory scaling up its wastewater surveillance program, we investigated a key question: how deeply do you need to sequence a city’s wastewater to see a pathogen spreading through the population? Based on answers for SARS-CoV-2 and influenza, we believe that our scaled surveillance system could flag similar pathogens early enough to mount a response. However, knowing how well we can detect a wider range of pathogens is important to assess the potential and limitations of wastewater surveillance.

To assess wastewater sequencing sensitivity, we need two pieces of information: the amount of viral genetic material we find in our sequencing data and the percentage of infected people contributing to the sample. By combining this data we can calculate a summary statistic called RA_p(1%) (Grimm et al. 2023). This statistic represents the relative abundance (RA) of sequencing reads we would expect from a given virus if its prevalence (p) in the population contributing to the wastewater sample were 1%. Higher RA_p(1%) values indicate viruses are easier to detect.

Until recently we weren’t able to estimate RA_p(1%) for most respiratory viruses: While we generate a lot of wastewater sequencing data, reliable prevalence information is available for only a small number of viral species.

However, as part of our effort to explore alternative sample types, we’ve recently begun collecting, pooling, and sequencing nasal swabs. By pooling many fewer individuals per sample (~50 vs tens or hundreds of thousands with wastewater), there is typically at most one person infected with a given virus per sample. Thus, the pattern of positive and negative swab samples gives us information we can use to estimate prevalence.

Comparing sequencing data from swabs and wastewater, we are able to estimate wastewater RA_p(1%) for an expanded set of respiratory viruses, using the following approach:

Identify human respiratory viruses (coronaviruses, Mononegavirales, influenza, and rhinoviruses) in both swab and wastewater sequencing datasets.
Count the number of reads assigned to each viral species in each sample.
Fit a Bayesian statistical model to this count data to estimate prevalence and RA_p(1%) for each species.

Respiratory viruses are detected in both wastewater and swab samples

Between January 7th and February 12th 2025 we collected 12 swab sample pools (ranging from 25 to 60 swabs per pool) in busy public places in Downtown Boston. We extracted RNA (protocol), and sequenced libraries using Oxford Nanopore sequencing¹. Simultaneously, as part of our routine wastewater surveillance system, our collaborators in the Johnson lab at the University of Missouri sequenced wastewater samples² which were collected at one-week intervals at Deer Island Wastewater Treatment Plant (Boston).

Identifying reads from respiratory viruses³, we detected 17 viral species in wastewater, and a subset of 12 of these species in swabs (Figure 1, Table S1). Beyond SARS-CoV-2 and influenza, which we’ve previously studied, wastewater contained all known species of rhinovirus and seasonal coronaviruses (the most common causes of the common cold), and a few other viruses, including respiratory syncytial and parainfluenza viruses.

In swab sequencing data, we similarly observed all rhinovirus and seasonal coronavirus species. Other viruses were harder to detect: we found influenza in only one swab pool despite a severe flu season and SARS-CoV-2 appeared in just a single swab pool, even though it showed up at much higher levels in wastewater than any other respiratory virus. This mismatch might be due to infected individuals tending to stay home rather than visiting the public spaces where we collected samples.

**Fig.1 Wastewater relative abundance and swab sample positivity.** The left panel shows virus relative abundance in wastewater sequencing data, with individual points representing individual wastewater samples. The right panel shows the number of swab sample pools testing positive for a given virus.

Estimating prevalence and wastewater sensitivity

We designed a Bayesian model to jointly infer prevalence and RA_p(1%) from our sequencing data. See the Appendix for details of the model definition, prior distributions, model fitting, and model checking.

Estimating prevalence from pooled swab sequencing is challenging because we don’t know the method’s sensitivity. Zero reads for a given viral species and swab pool could indicate no infections, poor swabbing, or insufficient sequencing depth. We model the viral read generation process to account for this uncertainty. For instance, when sequencing depth times expected viral abundance is low, our model assigns high probability to observing zero reads even with infections present. The model accounts for false-negatives like these by adjusting the prevalence estimates upward.

We applied the model to the seventeen respiratory viruses observed in our wastewater sequencing data. For coronaviruses and rhinoviruses our swab data was sufficient to estimate prevalence. However, for influenza and Mononegavirales⁴, we had too few viral reads in swabs to estimate prevalence⁵. Accordingly, we restrict further analyses to SARS-CoV-2, seasonal coronaviruses, and rhinovirus.

Rhinoviruses and seasonal coronaviruses frequently have more than 1% prevalence

**Fig. 2: Posterior distributions of the prevalence for SARS-CoV-2, seasonal coronaviruses, and rhinoviruses.** White lines indicate 15th percentile, median, and 85th percentile.

Our prevalence estimates for the eight coronaviruses and rhinoviruses (Figure 2, Table S2) vary by an order of magnitude. On the low end, HCoV-229E⁶—detected in 1/12 pools—has a posterior median prevalence of 0.2% (with a 15th to 85th percentile range of 0.1% to 0.5%). On the high end, HCoV-OC43—detected in 5/12 pools—has a posterior median prevalence of 2% (1% to 4%). Four out of seven cold viruses have a median prevalence above 1%. The relatively wide confidence intervals in our estimates are explained by two sources of uncertainty: our limited sample size (12 pools total) and unknown false negative rates from factors like poor swabbing technique or low viral shedding (see Appendix).

Cold viruses show relatively low wastewater sensitivity

**Fig. 3: Posterior distributions of both new RA_p(1%) estimates for SARS-CoV-2, seasonal coronaviruses, and rhinoviruses, and previous RA_i(1%) estimates for SARS-CoV-2 and influenza A.** White lines indicate 15th percentile, median, and 85th percentile.

Our estimates of RA_p(1%) reveal substantial variation between viruses: RA_p(1%) varies by a factor of 360 from 2e-9 (with a 15th to 85th percentile range of 8e-10 to 6e-9) for rhinovirus B, to 7e-7 (3e-7 to 2e-6) for SARS-CoV-2 (Figure 3, Table S3). Our estimate for SARS-CoV-2 is significantly higher than for the other viruses, consistent with our prior understanding that SARS-CoV-2 is the most readily detectable respiratory virus in wastewater. Our RA_p(1%) estimates have substantial uncertainty, most of which is due to uncertainty in the prevalences (see Appendix).

Previously, we estimated expected relative abundance at 1% weekly incidence (RA_i(1%)) for SARS-CoV-2 and Influenza A. Though comparing RA_p with RA_i is complicated⁷, our median RA_p(1%) estimates for rhinoviruses and seasonal coronaviruses are up to ten times lower than our previous median RA_i(1%) estimate for influenza (1e-8). For SARS-CoV-2, our current estimate (7e-7) is one order of magnitude higher than Grimm et al. 2023’s SARS-CoV-2 estimate (6e-8). Aside from the differences in metrics estimated (RA_p(1%) here vs. RA_i(1%) previously), there are other factors that could explain these differences, such as different sequencing protocols or our swab-based prevalence measurements underestimating true community prevalence.

Caveats

Though this work allows us to estimate wastewater sequencing sensitivity for a new set of viruses, several caveats apply:

The swab data only reflects prevalence among people present at our public sampling sites, which likely underestimates true population prevalence for two reasons: First, symptomatic individuals tend to stay home, perhaps explaining why we detected only one influenza read despite a severe flu season. Second, our limited sampling locations may miss localized outbreaks that would appear in wastewater data from the broader sewershed.
We do not know the probability of false negative pools from poor swabbing. Our model accounts for this uncertainty with a prior distribution on swab read count overdispersion. We chose this prior to allow for wide variation in the implied rate of poor swabbing (from less than 10% to over 50%). Because the data provide almost no information to constrain this parameter, our prevalence estimates depend on the prior distribution. For example, if we knew a priori that 50% of swabs taken were bad, our estimates of prevalence would be about twice as high as if we knew that almost none were bad. See Appendix for details.
Viral prevalences likely vary in time and space in ways that affect the swab and wastewater samples differently. Our model assumes a constant prevalence for each virus over the five-week study period.
Our model uses an operational definition of prevalence—the fraction of people shedding virus in the nose—which may be smaller than the total fraction of people infected.
Our model does not account for the probability of bioinformatic errors. For instance, false-positive assignment errors in our wastewater data will bias RA_p(1%) upward, while false-negative assignments will bias it downward.

Conclusion

The findings suggest that many common cold pathogens are detectable via untargeted wastewater sequencing, though sensitivity for these viruses appears lower than for influenza and SARS-CoV-2. As we gather more wastewater and swab sequencing data, we will be able to narrow these estimates further and develop our understanding of which type of pathogen surveillance is most effective for detecting different pathogen types.

Appendix

See here for the full appendix, including a more detailed description of the Bayesian model: https://naobservatory.github.io/swab-based-p2ra/

Supplemental tables

Pathogen	Median WW RA	Geomean (excl. 0 RA samples)	Min WW RA	Max WW RA	WW Presence	Swab Presence
Species
Flu B	4.64E-10	2.23E-09	0	4.99E-09	6/12	0/12
H1N1	1.12E-08	9.82E-09	0	2.36E-08	11/12	1/12
H3N2	1.21E-08	1.18E-08	2.96E-09	3.13E-08	12/12	0/12
HCoV-229E	2.27E-09	3.48E-09	0	1.14E-08	8/12	1/12
HCoV-HKU1	4.26E-10	1.30E-09	0	1.90E-09	6/12	1/12
HCoV-NL63	4.36E-09	4.33E-09	8.52E-10	1.52E-08	12/12	4/12
HCoV-OC43	1.56E-08	1.61E-08	5.98E-09	4.55E-08	12/12	5/12
HMPV-1	0	1.23E-09	0	1.23E-09	1/12	0/12
HPIV1	0	1.33E-09	0	1.90E-09	2/12	0/12
HPIV2	0	1.92E-09	0	3.05E-09	3/12	0/12
HPIV4	0	1.27E-09	0	1.90E-09	4/12	1/12
RSV-A	2.81E-09	4.10E-09	0	2.37E-08	10/12	1/12
RSV-B	4.26E-10	2.56E-09	0	1.14E-08	6/12	1/12
Rhinovirus A	2.94E-09	3.26E-09	0	1.23E-08	10/12	3/12
Rhinovirus B	4.64E-10	2.03E-09	0	4.74E-09	6/12	2/12
Rhinovirus C	1.66E-08	1.38E-08	9.97E-10	7.50E-08	12/12	4/12
SARS-CoV-2	1.87E-07	1.59E-07	5.98E-08	3.53E-07	12/12	1/12
Virus Groups
Coronaviruses (SARS-CoV-2)	1.87E-07	1.59E-07	5.98E-08	3.53E-07	12/12	1/12
Coronaviruses (seasonal)	2.57E-08	2.42E-08	1.02E-08	7.40E-08	12/12	7/12
Influenza	2.07E-08	2.11E-08	2.96E-09	4.78E-08	12/12	1/12
Mononegavirales	4.68E-09	4.65E-09	8.52E-10	3.89E-08	12/12	3/12
Rhinoviruses	2.12E-08	1.64E-08	9.97E-10	9.20E-08	12/12	8/12

Table S1. Summary of human-infecting viruses relative abundance and presence in both wastewater and swab sequencing data. Column definitions: Median WW RA, Min WW RA, and Max WW RA show the median, minimum, and maximum relative abundance observed across all wastewater samples. Geomean (excl. 0 RA samples) shows the geometric mean across non-zero wastewater relative abundance measurements. WW Presence and Swab Presence show the number of wastewater and swab samples in which a given virus was observed. Virus group relative abundance represents the sum of all relative abundances of that group’s species.

Species	Pos Pools	Viral Reads	Q15	Median	Q85	Q85 / Q15
SARS-CoV-2	1	78	9.15E-04	2.66E-03	6.57E-03	7.18
HCoV-229E	1	6008	6.73E-04	1.99E-03	4.97E-03	7.38
HCoV-HKU1	1	25	7.98E-04	2.52E-03	6.97E-03	8.73
HCoV-NL63	4	22	8.34E-03	1.70E-02	3.62E-02	4.34
HCoV-OC43	5	160	1.03E-02	1.94E-02	3.79E-02	3.67
Rhinovirus A	3	9	6.02E-03	1.40E-02	3.33E-02	5.53
Rhinovirus B	2	20	2.44E-03	5.96E-03	1.37E-02	5.62
Rhinovirus C	4	100	6.92E-03	1.37E-02	2.69E-02	3.89

Table S2. Summary of swab data and posterior distribution of the prevalence. Column definitions: Pos pools shows the number of pools with viral reads; Viral Reads shows the total viral reads across pools; Q15, Median, and Q85 show the 15th, 50th, and 85th quantiles of the posterior distribution; Q85 / Q15 shows the ratio of the 85th to the 15th percentile, a measure of the posterior uncertainty (and approximately one geometric standard deviations around the geometric mean).

Species	Q15	Median	Q85	Q85 / Q15
SARS-CoV-2	2.84E-07	7.21E-07	2.11E-06	7.43
HCoV-229E	5.45E-09	1.50E-08	4.66E-08	8.54
HCoV-HKU1	7.80E-10	2.45E-09	8.54E-09	10.96
HCoV-NL63	1.57E-09	3.53E-09	7.47E-09	4.74
HCoV-OC43	5.09E-09	1.04E-08	2.04E-08	4.01
Rhinovirus A	1.09E-09	2.82E-09	7.05E-09	6.49
Rhinovirus B	7.61E-10	2.01E-09	5.64E-09	7.42
Rhinovirus C	7.59E-09	1.58E-08	3.33E-08	4.38

Table S3. Summary of posterior distribution of RA_p(1%). Column definitions: Q15, Median, and Q85 show the 15th, 50th, and 85th quantiles of the posterior distribution; Q85 / Q15 shows the ratio of the 85th to the 15th percentile, a measure of the posterior uncertainty (and approximately one geometric standard deviations around the geometric mean).

Date	No. swabs	No. reads
Jan 7, 2025	37	328894
Jan 9, 2025	48	243294
Jan 13, 2025	33	407678
Jan 15, 2025	58	857131
Jan 16, 2025	60	226099
Jan 22, 2025	37	534204
Jan 23, 2025	33	753766
Jan 27, 2025	44	817608
Jan 29, 2025	60	1238229
Feb 4, 2025	57	661238
Feb 5, 2025	33	175859
Feb 12, 2025	47	216826

Table S4. Swab sample sequencing depth and swab counts. All samples were collected in Boston.

Footnotes

Swab samples were sequenced on a Oxford Nanopore PromethION P2 Solo.↩︎
Wastewater samples were sequenced on a Illumina NovaSeq X 25B flow cell.↩︎
We analyzed sequencing data with mgs-workflow (https://github.com/naobservatory/mgs-workflow). For swab samples we used mgs-workflow v2.8.3.2, for Illumina wastewater sequencing data we used v2.8.1. The documentation for how viral classification differs between Illumina short-read and Oxford Nanopore long-read data can be found here. We then subsetted mgs-workflow results to reads that match respiratory pathogens (Mononegavirales, Coronaviridae, Enterovirus (excluding non-respiratory species), and Orthomyxoviridae). Virus assignments were subsequently BLAST-validated, using an approach that tried to assign as many virus reads as possible, with as few reference genomes as necessary. BLAST was not used for SARS-CoV-2 reads, and HCoV-229E in one sample (due to very long BLAST run times). For those two viruses we spot-checked a subset of reads manually. Relative abundance metrics are given as deduplicated read counts divided by total raw read count.↩︎
A virus order containing both respiratory syncytial virus, metapneumovirus, and parainfluenza virus species.↩︎
For most of the viruses in these groups we observed zero or one read in our swab data. With such low read counts, it is difficult to distinguish between low prevalence and a high false negative rate. As a result, our prevalence and RA_p(1%) estimates for these viruses are very uncertain and dominated by our priors. As we sequence more swabs, we will be able to create more robust estimates for these viruses.↩︎
We note that HCoV-229E is a large outlier in our swab data with 6,008 viral reads in its single positive pool (vs 160 for the next-largest count for any virus). Our model interprets this observation as indicating a high expected relative abundance in positive swabs (posterior median 27% vs 0.67% for SARS-CoV-2) and relatively low probability that a positive swab contributes no viral reads to a pool (posterior median 9.1% vs 27% for SARS-CoV-2). With only one pool containing viral reads, it is unclear whether this interpretation is correct or whether the large viral read count reflects high variance instead.↩︎
We previously estimated RA_i(1%), i.e., relative abundance at 1% weekly incidence, which can give different results than RA_p(1%). When shedding duration is roughly one week, we expect these metrics to be approximately the same. When shedding duration is longer than a week, we expect prevalence to be higher than weekly incidence and vice versa.↩︎