US 7,617,231 B2
Data hashing method, data processing method, and data processing system using similarity-based hashing algorithm
Hwa Shin Moon, Daejeon (Korea, Republic of); Sungwon Yi, Seoul (Korea, Republic of); Jintae Oh, Daejeon (Korea, Republic of); Jong Soo Jang, Daejeon (Korea, Republic of); and Changhoon Kim, Seoul (Korea, Republic of)
Assigned to Electronics and Telecommunications Research Institute, Daejeon (Korea, Republic of)
Filed on Dec. 06, 2006, as Appl. No. 11/634,731.
Claims priority of application No. 10-2005-0119074 (KR), filed on Dec. 07, 2005; and application No. 10-2006-0064012 (KR), filed on Jul. 07, 2006.
Prior Publication US 2007/0130188 A1, Jun. 07, 2007
Int. Cl. G06F 17/00 (2006.01)
U.S. Cl. 707—101  [707/102; 707/103 Z; 707/104.1] 23 Claims
OG exemplary drawing
 
1. A data hashing method using a similarity-based hashing (SBH) algorithm, the data hashing method comprising:
receiving computerized data; and
generating a hash value of the computerized data using the SBH algorithm in which two data are the same if calculated hash values are the same and two data are similar if the difference of calculated hash values is small,
wherein the hash value has at least two variable values that allows for a quick search of the computerized data for determining if the two data are similar, wherein the generating of the hash value of the computerized data using the SBH algorithm comprises:
calculating a fingerprint value from the content of the computerized data;
changing a component value of an Nth-order hash vector to correspond to the fingerprint value according to a predetermined rule;
determining whether the entire amount of the content of the computerized data has been processed; and
if it is determined that the entire amount of the content of the computerized data has been processed, converting the Nth-order hash vector to the hash value, and
wherein the calculating of the fingerprint value comprises:
extracting a shingle, which is a continuous or discontinuous byte-string having a predetermined length, from the computerized data; and
generating a fingerprint value using a data hashing algorithm which satisfies uniformity and randomness criteria for the shingle and has a low possibility of collision.