| US 7,543,006 B2 | ||
| Flexible, efficient and scalable sampling | ||
| Paul Geoffrey Brown, San Jose, Calif. (US); and Peter Jay Haas, San Jose, Calif. (US) | ||
| Assigned to International Business Machines Corporation, Armonk, N.Y. (US) | ||
| Filed on Aug. 31, 2006, as Appl. No. 11/469,231. | ||
| Prior Publication US 2008/0059540 A1, Mar. 06, 2008 | ||
| Int. Cl. G06F 17/30 (2006.01) | ||
| U.S. Cl. 707—205 [707/101; 707/102; 707/103 R; 707/104.1; 707/201] | 7 Claims |

| 1. A sampling system to maintain a warehouse of uniform random samples, said system comprising:
a partitioning module to divide values in a data set into a plurality of mutually disjoint partitioned subsets;
a plurality of sampling modules, each of said sampling modules sampling one of said partitioned subsets in an independent
and parallel manner to output uniform random samples;
a sample data warehouse to store uniform random samples output from each of said sampling modules; and
wherein said independent and parallel sampling allows said sampling system to provide flexible and scalable sampling of said
data set, and
wherein at least one of said sampling modules: i) adds each arriving data value from a partitioned subset to a sample represented
as a histogram, ii) if a memory bound is exceeded, then, purges said sample to obtain a purged subsample, expands a histogram
representation of said purged subsample to a bag of values while sampling remaining data values of said partitioned subset;
and yields a compact and uniform random sample of said partitioned subset by converting said expanded subsample representation
into a histogram representation upon receiving a last data value; else, yields a compact and complete frequency distribution
sample of said partitioned subset upon receiving a last data value.
|