Now that our blog is up, we’re taking the opportunity to post some written updates on the work our team has done over the past ~6 months. We’re hoping to make similar updates something like quarterly. Since this post covers a longer period it’s a bit longer than we expect future ones will be. If anything here is particularly interesting or if you’re working on similar problems, please reach out!
Wastewater Sequencing
In the fall & winter we partnered with CDC’s Traveler-based Genomic Surveillance program and Ginkgo Biosecurity to collect and sequence paired wastewater samples of aggregated airplane lavatory waste and municipal treatment plant influent. Initial sequencing is complete, and we have banked nucleic acids for additional sequencing. We have continued processing weekly treatment plant samples and banking the nucleic acids
Developing a good approach for extracting the nucleic acids from these samples took a lot of iteration. Wastewater is a challenging sample type, with a complex and variable composition. We experimented with different concentration methods, DNA/RNA extraction kits, dissociation reagents, and filters, looking for a protocol that would optimize for viruses relative to bacteria while giving sufficient yield and with a series of steps that were feasible for a daily processing. We also needed to adjust our protocols to handle settled solids (“primary sludge”) and airplane lavatory waste in addition to influent. We’ve published all three protocols (influent, sludge, airplane waste) to protocols.io.
We’ve sequenced a subset of these samples at MIT’s BioMicroCenter, using their standard protocols for bulk RNA library preparation. We think a more custom protocol would likely give significantly better results, and also want more in-depth understanding and control of how exactly our sequencing libraries are produced. So we’re very excited to be collaborating with experts at the Broad Institute’s Sabeti Lab on adapting some of their custom MGS protocols to wastewater. Developing in-house library prep expertise will also allow us to use the Broad’s walk-up sequencing service, which by using newer sequencers is much cheaper per read than the BMC’s offering but requires ready-to-go libraries.
We’re also collaborating with Marc Johnson and Clayton Rushford at the University of Missouri and Jason Rothman in Katrine Whiteson’s lab at the University of California, Irvine. Both are doing weekly wastewater RNA sequencing to a depth of around 2B read pairs, and we’ve received about 25B from each lab. We’ve been very happy with this data, and the opportunity to work with and learn from the very experienced folks in both labs.
Because we now have municipal wastewater RNA data from multiple different groups, we’re in a good position to compare protocols. While we don’t have this as formalized as we’d like, we think of a protocol as working well if we see a large fraction of reads (“relative abundance”) from human-infecting viruses, and good coverage of the various kinds of viruses. The current best NAO protocol gives results in the same range as the best of the other groups, but includes an expensive ribosomal RNA depletion step. Since we’ve now seen good data generated without depletion we think we should be able to iterate on our protocol to remove the need for it.
Pooled Individual Sequencing
We’re starting a new effort to collect and sequence pooled nasal swab samples. We’ll be going to busy public places like mass transit stations and asking for volunteers to swab their noses. We don’t know the best times and places to visit, whether people will need to be compensated (and if so how much), or even whether nasal swabs are the ideal sample type (we’re also planning to compare throat swabs, mouth swabs, and saliva), so initially we’re planning a lot of small collection runs to get an idea of what works. We now have approval to begin sampling, and our first collection run is planned for tomorrow.
Reviewing the literature [2024-07-04: blog post], we expect these samples to have a much higher relative abundance of respiratory viruses, enough that unlike with wastewater we’re now guessing the primary constraint will be getting enough participants and not the depth of sequencing. This makes long-read Nanopore sequencing very attractive since it’s cheaper per run and we can do it in the lab on our timeline instead of relying on an external partner. While we haven’t previously worked with Nanopore, others in the Sculpting Evolution lab have, and we’re excited to be learning from them and building up our own experience.
Other Sampling Strategies
Over the past few years we’ve put a lot of work into understanding the relative promise of different sample types for pathogen-agnostic early detection. We’ve recently prioritized rounding out and publishing this work, including:
A white paper setting out a framework for comparing different sampling strategies for early detection of stealth biothreats.
A detailed review of air sampling for viral biosurveillance, including sources of airborne viruses, suitable air sampling mechanisms, and promising locations for air sampling of viruses. This supersedes our earlier report.
A blog post comparing the expected relative abundance of SARS-CoV-2 in metagenomic sequencing of respiratory swabs to municipal wastewater. [2024-07-04: blog post].
Additional documents on swabs and saliva sampling which we’re hoping to post when they’re ready.
Nucleic Acid Tracers
One of our first projects, started in early 2022, was to develop a collection of virus-like barcoded tracers for use in ‘deposition’ experiments. The tracers can be deposited into the sewer system (for example, by flushing down a toilet) and then measured in wastewater samples to understand sewage dynamics and calibrate wastewater detection systems. In late 2023, we published a preprint describing the creation and characterization of these tracers, including showing that they are harmless to people and the environment. Regulatory review has been a complex and slow process, but we hope to receive approval later this year to use these tracers in deposition experiments. If you’re interested in working with these tracers, please get in touch.
Analysis of Sequencing Data
On the computational side, a major effort over the past few months has been redesigning and reimplementing the metagenomic sequencing pipeline we use to get an overall understanding of sequencing data. This is what takes raw short-read data, removes sequencing artifacts, and assigns individual reads to taxonomic nodes. The first version of our pipeline was something we put together relatively quickly, gluing together tools with custom Python and bash, and wasn’t designed to scale beyond a single machine. The new version is built on top of Nextflow, and has involved carefully comparing tooling options for each stage of the pipeline.
We’ve also been developing a pipeline to flag reads that could be genetically engineered. This is a component of our Near-Term First effort (see below) and looks for reads where part is a good match for a human-infecting virus and part is not. It still generates too many false positives to put into production, but the rate is decreasing, and we have a bunch of ideas we’re trying out to reduce it further.
We’re also collaborating with Willie Neiswanger and Oliver Liu at the University of Southern California, and Ryan Teo in Nicole Wheeler’s lab at the University of Birmingham. The two groups are taking different angles on the problem of interpreting and modeling metagenomic sequencing data to identify concerning sequences. We’re sharing wastewater sequencing data with them for development and meeting with them to give context on how the data is generated. We see the development of computational tools that can flag suspicious sequencing reads as really important and also really parallelizable. If you’re interested in collaborating with us on this problem, please reach out.
Last year’s work on estimating relative abundance has continued in the background as we prepare our preprint for publication. With our recent sequencing data we now have relative abundance information for many more pathogens, but turning this into estimates of RAi(1%), the predicted relative abundance when 1% of people became infected in the last week, is dependent on assembling good public health estimates for incidence for each pathogen. We have enough urgent work that we’re not planning to gather those estimates this quarter, but this is another place where we’d be excited to collaborate.
Cost Modeling
In the first months of 2024, we made progress on three projects with the goal of understanding the cost of detecting “stealth” pandemics via metagenomic sequencing:
We mathematically analyzed the shape of the cost-sensitivity curve of detection. This curve represents how expensive it would be to detect a pandemic as a function of the fraction of people infected (“cumulative incidence”) at the time you raise the alarm. This expanded on past work, clarifying assumptions, incorporating noise associated with sample collection and sequencing, and comparing sampling frequency and methodology. You can read our results starting from the NAO Cost Estimate summary post.
We conducted a theoretical analysis to understand the potential advantages of monitoring air travelers arriving into a community over monitoring individuals in that community, for the purposes of detecting an emerging pandemic. We used deterministic simulations of a pandemic spreading via air travel, under the assumption that detection requires reaching a threshold cumulative incidence among the monitored group (either incoming travelers or the local population). Preliminary results support our intuition that the advantage of monitoring travelers is more pronounced for faster spreading pathogens and communities that receive fewer daily arrivals per capita. We’re not yet ready, however, to draw firm conclusions about the advantages and disadvantages of airport monitoring for real-world detection.
We wrote a simplified in-browser simulator for comparing the cost and efficacy of different approaches to sampling and sequencing (blog post).
Organizational Updates
We’ve recently grouped the team internally into two sections: Robust Detection (led by Mike McLaren) and Near-Term First (led by Jeff Kaufman). This reflects a trade-off between figuring out how to build a system that reliably detects any sort of stealth pandemic and getting a system up and running as quickly as possible even if it has significant coverage gaps. We’re planning to allocate roughly ¾ of our efforts into Near Term First until we have that initial system up and running.
Our Research Technician, Ari Machtinger, is leaving the NAO to start graduate school at the University of Wisconsin-Madison. He’ll be in Dave O’Connor’s lab, where he’s hoping to continue working in environmental pathogen surveillance. We’re sad to lose him, but also very excited for his next steps! This also means we’re now looking for a new wet-lab hire, which could be another Research Technician or a Research Scientist. We’ll have job postings out soon [2024-07-04: specialist post, scientist post], but in the meantime, if you know anyone who is really into sequencing please point them our way.
We recently finished a hiring round for a Bioinformatics Research Scientist (job description), but didn’t end up making a hire. We’re not planning another round at the moment, focusing instead on filling our wet-lab opening, but if this is a role that would be a great fit for you we’d encourage you to submit a general application.