US 7,552,116 B2
Method and system for extracting web query interfaces
Kevin Chen-Chuan Chang, Champaign, Ill. (US); Zhen Zhang, Champaign, Ill. (US); and Bin He, Urbana, Ill. (US)
Assigned to The Board of Trustees of the University of Illinois, Urbana, Ill. (US)
Filed on Aug. 06, 2004, as Appl. No. 10/913,721.
Prior Publication US 2006/0031202 A1, Feb. 09, 2006
Int. Cl. G06F 17/30 (2006.01); G06F 17/27 (2006.01)
U.S. Cl. 707—5  [707/6; 704/9] 31 Claims
OG exemplary drawing
 
1. A computer readable storage medium encoded with a computer program to be executed by a computer for extracting semantic information about a plurality of documents autonomously created by different sources and being accessible via a computer network, said computer readable storage medium comprising:
a tokenizer for causing the computer to generate a set of tokens indicative of document object model (DOM) nodes associated with visual information in a displayed document image from one of the plurality of autonomously created documents;
a grammar mechanism for causing the computer to derive a non-prescribed visual grammar from the set of tokens to represent a hidden syntax convention of a visual presentation; and
a best-effort parser for causing the computer to apply the derived visual grammar to construct multiple parse trees that represent semantic structure of the document and interpret a maximum subset of the set of tokens,
wherein said non-prescribed visual grammar is derived from a plurality of autonomously created or heterogeneous Web documents to represent the hidden syntax convention of the visual presentation common among the plurality of autonomously created or heterogeneous Web documents; and
said grammar is a five tuple <Σ, N, s, Pd, Pf> where Σ is a set of terminal symbols, N is a set of nonterminal symbols, sεN is a start symbol, Pd is a set of production rules that represent visual patterns and Pf is a set of preference rules that represent pattern precedence.