Evaluating Outward Assembly on a Surprising Junction

Evan Fields

Author Evan Fields

Date July 24, 2025

Thanks to Jeff Kaufman and Mike McLaren for reviewing drafts of this post.

Background

Our chimera detection system recently flagged a suspicious junction with the structure [SARS-CoV-2][synthetic construct] for manual review. After investigating, including running outward assembly, we determined this was almost surely lab contamination: the lab works with SARS-CoV-2, and the synthetic portion matches a synthetic plasmid they’ve been working with.

This investigation is atypical in that, post-investigation, we actually know the ground truth for the synthetic side of the investigated chimera. This provides a great opportunity to evaluate outward assembly both quantitatively and qualitatively: even though we don’t know how the plasmid got stuck to SARS-CoV-2 in the lab, we can test how well the synthetic side of the outward assembly contigs matches the known plasmid.

Evaluating outward assembly

Outward assembly works by extending an initial seed or short contig via an iterative process: (1) search for reads matching the current contig; (2) assemble these reads; (3) repeat. We seeded outward assembly with the observed chimeric junction: [13bp SCV2][13bp synthetic construct], and for data we chose a collection of 22 billion read pairs (6.6 terabases) comprising the last two months of relevant samples¹. Outward assembly ran for three search-then-assemble iterations; the contig found in the third iteration didn’t extend the second iteration’s contig, so the algorithm terminated after the third iteration. On a single AWS c7a.48xlarge EC2 instance, this process took 113 minutes (roughly $19).

We can evaluate the outward assembly contigs by comparing the synthetic portion (everything after the chimeric junction) of each contig to the known plasmid sequence.

The synthetic portion of the iteration 1 contig matched 462bp of the plasmid at 100% identity.
The synthetic portion of the iteration 2 contig matched 959bp of the plasmid at 99.4% identity.

These are extremely promising results: we recovered a long section of the synthetic region at nearly full accuracy, and we were able to use this recovered section to confirm we had the right plasmid.

Figure 1: A SARS-CoV-2 genome (green) is attached to a plasmid (blue) in the lab, creating an unknown chimera which contaminates a few samples. Starting from a seed of 13bp on either side of the observed chimeric junction, outward assembly recovers 1371bp of the unknown chimera in two iterations. The SARS-CoV-2 and plasmid genomes extend for thousands of bases beyond the region shown in this figure.

Our qualitative experience of using outward assembly for a real time flag investigation was somewhat mixed. On the plus side, the iterative nature of outward assembly was quite helpful: outward assembly produces contigs every iteration, and we could start evaluating the first contig after just 30 minutes, rather than waiting for the entire algorithm to conclude.

On the other hand, outward assembly was missing some features which would have made our investigation easier:

Orienting contigs relative to seed sequences. Flagged junctions have a natural orientation defined by the coding strand of the virus genome. But outward assembly agnostically returns contigs in either the forward or reverse complement orientation, requiring us to spend a little investigation time checking orientations and reverse complementing as needed.
Identifying samples where reads were found. It’s helpful to know which samples contributed reads used to build a contig, e.g. to assess whether the contig is supported by reads from multiple labs or locations. But outward assembly doesn’t retain sample information when collating reads.
Deduplication. Sequencing data may contain duplicate reads such as PCR or optical duplicates. We shouldn’t count duplicate reads as independent pieces of evidence for a contig, so it’d be helpful to deduplicate the set of reads used by outward assembly.
Pileup visualization. We use pileup visualizations to assess how well a collection of reads jointly supports a contig. However, our code to generate these visualizations isn’t integrated with outward assembly, so creating each pileup required manual data munging and execution in the middle of an investigation.

Overall, this investigation has strongly validated the outward assembly approach; being able to build targeted contigs out of tens of billions of read pairs in minutes was key for our investigation. We plan to address the missing features and add some automation this quarter so that future flag investigations will be much faster.

Footnotes

In a typical junction investigation, the most relevant samples are those from the same locations as the samples within which we first observed the suspicious junction. Since in this case the suspicious junction came from lab contamination, the locations where samples were collected didn’t turn out to be relevant – just the lab where samples were processed. The contamination was not widespread; we ran a larger search (171 billion read pairs) and didn’t see the junction beyond the handful of samples where we first identified it, which contained 9 non-duplicate read pairs covering the junction.↩︎