US 7,558,792 B2
Automatic extraction of human-readable lists from structured documents
Eric A. Bier, Palo Alto, Calif. (US)
Assigned to Palo Alto Research Center Incorporated, Palo Alto, Calif. (US)
Filed on Jun. 29, 2004, as Appl. No. 10/879,843.
Prior Publication US 2005/0289456 A1, Dec. 29, 2005
Int. Cl. G06F 17/30 (2006.01)
U.S. Cl. 707—6 17 Claims
OG exemplary drawing
 
1. A computer-controlled method for extracting a human readable list from a document, said method comprising steps of:
accessing a file, said file containing data that represents a portion of said document, said data formatted in accordance with a document formatting description (DFD);
parsing said data, the parsing including:
making corrections to said data when said data is formed incorrectly with respect to the DFD;
identifying a plurality of tags in said data;
generating a plurality of container tokens, each of said plurality of container tokens corresponding to one of said plurality of tags; and
generating a plurality of textual tokens for said data not identified as one of said plurality of tags;
determining a set of tag paths, each of the tag paths being associated with at least one of the plurality of textual tokens;
determining tag paths of interest, the step of determining tag paths of interest including:
collecting all of the tag paths of the set of tag paths;
removing duplicate tag paths from the collected tag paths to generate a full tag path set for the document;
determining the number of textual tokens for each tag path of the full tag path set; and
finding tag paths of interest in the full tag path set, each of the tag paths of interest satisfying a predetermined criteria;
determining a context for each tag path of interest, said context defined by the tag path of interest and a matching pair of said container tokens within the tag path of interest;
determining contiguous lists, the step of determining contiguous lists including:
iterating through tokens of the tag paths of interest to find at least one locatable list;
for each locatable list, accumulating text tokens, the step of accumulating text tokens comprising:
identifying a first textual token in the respective tag path of interest;
accumulating the first textual token into a contiguous list;
identifying a second textual token having the same context as the first textual token; and
accumulating the second textual token into the contiguous list;
determining a separator pattern between one of said plurality of textual tokens and an adjacent textual token where both said one of said plurality of textual tokens and said adjacent textual token have said context, the step of determining a separator pattern including:
creating the separator pattern based on tokens between the first textual token and the second textual token, the step of creating a separator pattern including:
discarding white space in the tokens between the first textual token and the second textual token; and
discarding tag attributes in the tokens between the first textual token and the second textual token; and
checking for another occurrence of the separator pattern following the second textual token; and
for each occurrence of the separator pattern, extracting one or more of said plurality of textual tokens, wherein each of the extracted one or more of said plurality of textual tokens have said context, the step of extracting one or more of said plurality of textual tokens including:
extracting a subsequent text token associated with the occurrence of the separator pattern and accumulating the subsequent textual token into the contiguous list if the subsequent textual token is separated from the previous textual token by only the occurrence of the separator pattern; and
terminating the accumulating text tokens for the current locatable list if the subsequent textual token is not separated from the previous textual token by only the occurrence of the separator pattern; and
presenting one or more of said plurality of textual tokens as said human readable list, the step of presenting one or more of said plurality of textual tokens as said human readable list comprising at least one of:
presenting said human readable list on a display;
presenting said human readable list using audio; and
storing said human readable list in a retrievable electronic file.