US 9,811,438 B1
Techniques for processing queries relating to task-completion times or cross-data-structure interactions
Ryan Barrett, San Francisco, CA (US); Katsuya Noguchi, San Francisco, CA (US); Nishant Bhat, San Francisco, CA (US); Zhengua Li, Saratoga, CA (US); and Kurt Smith, San Francisco, CA (US)
Assigned to COLOR GENOMICS, INC., Burlingame, CA (US)
Filed by Ryan Barrett, San Francisco, CA (US); Katsuya Noguchi, San Francisco, CA (US); Nishant Bhat, San Francisco, CA (US); Zhengua Li, Saratoga, CA (US); and Kurt Smith, San Francisco, CA (US)
Filed on May 11, 2017, as Appl. No. 15/592,949.
Application 15/592,949 is a continuation of application No. 15/366,409, filed on Dec. 1, 2016, granted, now 9,678,794.
Claims priority of provisional application 62/262,183, filed on Dec. 2, 2015.
Int. Cl. G06F 9/46 (2006.01); G06F 11/34 (2006.01); G06F 9/48 (2006.01); G06N 99/00 (2010.01); G06F 11/30 (2006.01)
CPC G06F 11/3419 (2013.01) [G06F 9/4887 (2013.01); G06F 11/3024 (2013.01); G06N 99/005 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for using machine learning to identify anomaly subsets of sets of iteration data, the method comprising:
accessing a structure including at least part of a definition for a workflow, the workflow including:
a first task of aligning each read of a set of reads to a portion of a reference data set, wherein the reference data set includes a reference sequence;
a second task of generating a client data set for the respective client using the aligned set of reads, the client data set including a set of values associated with each of one or more units, wherein the client data set includes a client sequence, wherein each value of the set of values represents a base, each unit of the one or more units representing a gene and corresponding to a set of defined positions within a genomic data structure; and
a third task of detecting a presence of one or more sparse indicators associated with the respective client by comparing the set of values of the client data set to corresponding values in the reference data set, each sparse indicator of the one or more sparse indicators representing a variant indicative of a distinction between the client data set and the reference data set;
for each client of a plurality of clients:
accessing a set of reads based on a material associated with a respective client, wherein the material includes a biological material;
performing an iteration of the workflow using the set of reads; generating iteration data based on the performance of the iteration of the workflow, wherein the iteration data includes or is based on:
a result of a task in the workflow;
a time required to perform one or more tasks in the workflow;
and/or
a degree of usage of a computational resource while performing one or more tasks in the workflow;
storing the iteration data in association with an identifier of the client;
collecting a set of iteration data by retrieving, for each client of the plurality of clients, at least part of the stored iteration data;
using a machine-learning technique to process the set of iteration data to identify an anomaly subset of the set of iteration data; and
generating a communication that represents the anomaly subset.