| US 7,617,231 B2 | ||
| Data hashing method, data processing method, and data processing system using similarity-based hashing algorithm | ||
| Hwa Shin Moon, Daejeon (Korea, Republic of); Sungwon Yi, Seoul (Korea, Republic of); Jintae Oh, Daejeon (Korea, Republic of); Jong Soo Jang, Daejeon (Korea, Republic of); and Changhoon Kim, Seoul (Korea, Republic of) | ||
| Assigned to Electronics and Telecommunications Research Institute, Daejeon (Korea, Republic of) | ||
| Filed on Dec. 06, 2006, as Appl. No. 11/634,731. | ||
| Claims priority of application No. 10-2005-0119074 (KR), filed on Dec. 07, 2005; and application No. 10-2006-0064012 (KR), filed on Jul. 07, 2006. | ||
| Prior Publication US 2007/0130188 A1, Jun. 07, 2007 | ||
| Int. Cl. G06F 17/00 (2006.01) | ||
| U.S. Cl. 707—101 [707/102; 707/103 Z; 707/104.1] | 23 Claims |

| 1. A data hashing method using a similarity-based hashing (SBH) algorithm, the data hashing method comprising:
receiving computerized data; and
generating a hash value of the computerized data using the SBH algorithm in which two data are the same if calculated hash
values are the same and two data are similar if the difference of calculated hash values is small,
wherein the hash value has at least two variable values that allows for a quick search of the computerized data for determining
if the two data are similar, wherein the generating of the hash value of the computerized data using the SBH algorithm comprises:
calculating a fingerprint value from the content of the computerized data;
changing a component value of an Nth-order hash vector to correspond to the fingerprint value according to a predetermined
rule;
determining whether the entire amount of the content of the computerized data has been processed; and
if it is determined that the entire amount of the content of the computerized data has been processed, converting the Nth-order
hash vector to the hash value, and
wherein the calculating of the fingerprint value comprises:
extracting a shingle, which is a continuous or discontinuous byte-string having a predetermined length, from the computerized
data; and
generating a fingerprint value using a data hashing algorithm which satisfies uniformity and randomness criteria for the shingle
and has a low possibility of collision.
|