## Sequencing errors: where do they come from?

Antibody repertoire sequencing relies primarily on the Illumina MiSeq sequencer due to the relatively low error rate, high throughput (~30M reads per run), and sufficiently long reads to cover the variable region of antibodies (two overlapping reads of 300nt each). However, despite these desirable qualities, the average base error rate of the MiSeq is 1% (1), primarily in the form of substitution errors, which means that there will be an expected 3-4 errors over the typical ~360nt variable region of the antibody. While this is a simplification of the MiSeq error rate (e.g., positional dependence, etc), it is a useful simplification.

Knowing that errors are likely to occur in each read, different approaches have been used to process reads in the literature:

- Do nothing. Use all reads assuming errors are minor, collapsing unique reads.
- Collapse unique reads and retain all above a global threshold, e.g., min abundance of 2.
- Perform clustering based error correction (Our approach, explained in more detail below)

While approach #1 and #2 seem similar, #2 can be interpreted as the simplest form of error correction: sequences observed independently multiple times are unlikely to have random errors occurring at the same positions and are thus likely correct. Unfortunately, method #2 retains only a small fraction of reads, thus is very wasteful, while method #1 ignores a very real problem for characterizing true diversity and not sequencing artifacts.

The figure to the right shows the expected distribution of errors over a sequence of 360nt at a probability of 0.01 (computed as a binomial distribution). This distribution further exemplifies why approach #1 is not reasonable; as only a small number of reads are expected to be error free.

## Effect of errors on antibody repertoire sequencing

#### Common approaches to handling errors

^{TM}pipeline.

Briefly, Reptor^{TM} performs error correction using a Hamming graph to cluster similar reads (3). Every unique read is a node in the Hamming graph HG=(V,E), and an edge is drawn between two nodes u,v in V if and only if HammingDist(u,v) <= tau. The tau parameter then controls how connected HG becomes. Once a graph is constructed, dense subgraphs are identified, and used to partition the reads. The consensus sequence is then taken on all reads within a partition to remove errors.

- 68,639 unique reads.
- 8,235 unique reads with min abundance 2 (only 12% of the total input reads).
- 9,105 unique clusters (consensus reads) with min abundance 2 (8,522 at the correct abundance of 10).

The figure to the right shows the distribution of counts for two of the methods (#1 and #3). Doing nothing retains all reads, but nearly all are (incorrectly) singletons. Performing a min abundance cutoff reduces the number of errors, but throws out 88% of the reads. Method #3 (using parameter tau=5) retains 94% of the total reads, with 88.4% at the correct abundance (abundance of 10).

#### Clustering fidelity

The table below shows the effect of the parameter tau on the Reptor^{TM} Hamming graph approach, where tau=0 corresponds to collapsing unique reads only (i.e., approach #1 from the previous section). This shows that from tau=5 onward, the partitions do not change; suggesting that effectively few additional reads are clustered together and no additional gains are realized. It also shows that on this uniform dataset, over-correction does not appear to occur.

Method | Jaccard | FM | Inflation index |
---|---|---|---|

k=1 | 0 | 0 | 0 |

k=n | 0 | 0 | 10 |

tau=0 | 0.121 | 0.34 | 6.858 |

tau=1 | 0.373 | 0.611 | 4.472 |

tau=2 | 0.573 | 0.757 | 2.829 |

tau=3 | 0.735 | 0.858 | 2.059 |

tau=5 | 0.816 | 0.903 | 1.743 |

tau=7 | 0.826 | 0.909 | 1.7 |

tau=10 | 0.826 | 0.909 | 1.693 |

The table above includes two dummy rows of k=n (i.e., clustering all reads in their own cluster), and k=1 (i.e., clustering all reads in a single cluster) to show the effects of the statistics in these edge cases.

## Other approaches to error correction

UMIs must be generated uniformly randomly to reduce the possibility of collisions. Synthesizing long stretches of random nucleotides from a uniform distribution can be difficult, and sometimes spacers need be incorporated, increasing the amplicon size. Furthermore, dimerization between two distinct UMIs can create chimera products that are difficult to detect. Finally, PCR and sequencing errors can accumulate in the UMI which then may require clustering on the UMI sequence (with a Hamming graph maybe?) to prevent loss of sequences (4). A full comparison between the Hamming graph error correction approach and UMIs is beyond the scope of this blog post, but certainly either approach is preferable to ignoring artifactual diversity in repertoires.

## References

(2) Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. ART: a next-generation sequencing read simulator. Bioinformatics, 28(4):593–594, 2011. (PubMed: 22199392)

(3) Y. Safonova, S. Bonissone, E. Kurpilyansky, E. Starostina, A. Lapidus, J. Stinson, L. DePalatis, W. Sandoval, J. Lill, and P. A. Pevzner. IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis. Bioinformatics, 31(12):53–61, Jun 2015 (PubMed: 26072509)

(4) Smith, Tom, Andreas Heger, and Ian Sudbery. “UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy.” Genome research 27.3 (2017): 491-499. (PubMed: 28100584)

We discussed the overall antibody repertoire sequencing process in a previous blog post. Sample preparation, and RNA quality in particular, are critical for accurate antibody repertoire sequencing, which we discussed in yet another blog post.