US 11,755,571 B2
Customized data scanning in a heterogeneous data storage environment
Jie Huang, Shanghai (CN); Zelin Yan, Shanghai (CN); Fuyuan Kang, Shanghai (CN); and Gaoyuan Wang, San Jose, CA (US)
Assigned to PayPal, Inc., San Jose, CA (US)
Appl. No. 16/768,525
Filed by PAYPAL, INC., San Jose, CA (US)
PCT Filed May 13, 2020, PCT No. PCT/CN2020/090021
§ 371(c)(1), (2) Date May 29, 2020,
PCT Pub. No. WO2021/226875, PCT Pub. Date Nov. 18, 2021.
Prior Publication US 2023/0058870 A1, Feb. 23, 2023
Int. Cl. G06F 16/00 (2019.01); G06F 16/245 (2019.01); G06F 16/28 (2019.01); G06F 16/25 (2019.01)
CPC G06F 16/245 (2019.01) [G06F 16/25 (2019.01); G06F 16/285 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A system, comprising:
a non-transitory memory; and
one or more hardware processors coupled with the non-transitory memory and configured to read instructions from the non-transitory memory to cause the system to perform operations comprising:
receiving, from a device, a request for classifying data within a first data storage;
determining a first set of criteria for scanning the first data storage based on the request, wherein the determining the first set of criteria comprises selecting, from a plurality of data records within the first data storage, a first subset of data records;
performing a first scan on the first data storage according to the first set of criteria, wherein the performing the first scan comprises scanning the first subset of data records without scanning other data records in the first data storage, identifying, within a first data record in the first subset of data records, a secondary key value corresponding to a primary key of a second data record in a second data storage, and analyzing the second data record in the second data storage;
deriving a data fingerprint for the first data storage based on the first scan, wherein the data fingerprint indicates a percentage of the data within the first data storage corresponding to a first data type;
determining whether to perform a second scan on the first data storage according to a second set of criteria based on the derived data fingerprint, wherein the second set of criteria indicates a second subset of data records from the plurality of data records being different from the first subset of data records; and
classifying the data within the first data storage as a first data classification based at least in part on the first scan and/or the second scan.