| US 7,552,116 B2 | ||
| Method and system for extracting web query interfaces | ||
| Kevin Chen-Chuan Chang, Champaign, Ill. (US); Zhen Zhang, Champaign, Ill. (US); and Bin He, Urbana, Ill. (US) | ||
| Assigned to The Board of Trustees of the University of Illinois, Urbana, Ill. (US) | ||
| Filed on Aug. 06, 2004, as Appl. No. 10/913,721. | ||
| Prior Publication US 2006/0031202 A1, Feb. 09, 2006 | ||
| Int. Cl. G06F 17/30 (2006.01); G06F 17/27 (2006.01) | ||
| U.S. Cl. 707—5 [707/6; 704/9] | 31 Claims |

| 1. A computer readable storage medium encoded with a computer program to be executed by a computer for extracting semantic
information about a plurality of documents autonomously created by different sources and being accessible via a computer network,
said computer readable storage medium comprising:
a tokenizer for causing the computer to generate a set of tokens indicative of document object model (DOM) nodes associated
with visual information in a displayed document image from one of the plurality of autonomously created documents;
a grammar mechanism for causing the computer to derive a non-prescribed visual grammar from the set of tokens to represent
a hidden syntax convention of a visual presentation; and
a best-effort parser for causing the computer to apply the derived visual grammar to construct multiple parse trees that represent
semantic structure of the document and interpret a maximum subset of the set of tokens,
wherein said non-prescribed visual grammar is derived from a plurality of autonomously created or heterogeneous Web documents
to represent the hidden syntax convention of the visual presentation common among the plurality of autonomously created or
heterogeneous Web documents; and
said grammar is a five tuple <Σ, N, s, Pd, Pf> where Σ is a set of terminal symbols, N is a set of nonterminal symbols, sεN
is a start symbol, Pd is a set of production rules that represent visual patterns and Pf is a set of preference rules that
represent pattern precedence.
|