US 11,817,180 B2
Systems and methods for analyzing nucleic acid sequences
Zheng Zhang, Pasadena, CA (US); Danwei Guo, San Mateo, CA (US); Yuandan Lou, Cupertino, CA (US); Asim Siddiqui, San Francisco, CA (US); and Dumitru Brinza, San Mateo, CA (US)
Assigned to Life Technologies Corporation, Carlsbad, CA (US)
Filed by LIFE TECHNOLOGIES CORPORATION, Carlsbad, CA (US)
Filed on Feb. 19, 2019, as Appl. No. 16/279,315.
Application 16/279,315 is a continuation of application No. 13/097,677, filed on Apr. 29, 2011, abandoned.
Claims priority of provisional application 61/406,543, filed on Oct. 25, 2010.
Claims priority of provisional application 61/330,090, filed on Apr. 30, 2010.
Prior Publication US 2020/0051663 A1, Feb. 13, 2020
Int. Cl. G16B 30/10 (2019.01); G16B 30/00 (2019.01); G16B 30/20 (2019.01)
CPC G16B 30/10 (2019.02) [G16B 30/00 (2019.02); G16B 30/20 (2019.02)] 20 Claims
 
1. A system for nucleic acid sequence assembly, comprising:
a first data storage configured to store nucleic acid sequencing data;
a second data storage configured to store reference genome data; and
a computing device in communication with the first data storage and the second data storage, wherein the computing device is configured to:
obtain a nucleic acid sequence read from the first data storage,
obtain a reference genome from the second data storage,
iteratively select anchor sequences comprising differing lengths of a contiguous portion of the nucleic acid sequence read, wherein each anchor sequence begins at a same location on the nucleic acid sequence read,
map the anchor sequence to the reference genome using an approximate string mapping method that allows for a set number of mismatches with the reference genome and produces at least one match with the reference genome, wherein the iterative selection and mapping of the anchor sequence is performed until a number of matches of the anchor sequence to the reference genome is less than a threshold number,
map a remaining portion of the nucleic acid sequence read to the reference genome using an ungapped local alignment method that produces an alignment of the remaining portion extending from the at least one match to map the nucleic acid sequence read to the reference genome,
generate an indexed binary file that stores the mapping of the anchor sequence and the remaining portion to the reference genome,
filter the indexed binary file to generate a filtered indexed binary file, the filtered indexed binary file comprising nucleic acid sequence reads that are mapped to target enriched regions in the reference genome, and
perform a resequencing analysis on the filtered indexed binary file to assemble a nucleotide sequence based on the nucleotide sequence read.