| US 7,558,792 B2 | ||
| Automatic extraction of human-readable lists from structured documents | ||
| Eric A. Bier, Palo Alto, Calif. (US) | ||
| Assigned to Palo Alto Research Center Incorporated, Palo Alto, Calif. (US) | ||
| Filed on Jun. 29, 2004, as Appl. No. 10/879,843. | ||
| Prior Publication US 2005/0289456 A1, Dec. 29, 2005 | ||
| Int. Cl. G06F 17/30 (2006.01) | ||
| U.S. Cl. 707—6 | 17 Claims |

| 1. A computer-controlled method for extracting a human readable list from a document, said method comprising steps of:
accessing a file, said file containing data that represents a portion of said document, said data formatted in accordance
with a document formatting description (DFD);
parsing said data, the parsing including:
making corrections to said data when said data is formed incorrectly with respect to the DFD;
identifying a plurality of tags in said data;
generating a plurality of container tokens, each of said plurality of container tokens corresponding to one of said plurality
of tags; and
generating a plurality of textual tokens for said data not identified as one of said plurality of tags;
determining a set of tag paths, each of the tag paths being associated with at least one of the plurality of textual tokens;
determining tag paths of interest, the step of determining tag paths of interest including:
collecting all of the tag paths of the set of tag paths;
removing duplicate tag paths from the collected tag paths to generate a full tag path set for the document;
determining the number of textual tokens for each tag path of the full tag path set; and
finding tag paths of interest in the full tag path set, each of the tag paths of interest satisfying a predetermined criteria;
determining a context for each tag path of interest, said context defined by the tag path of interest and a matching pair
of said container tokens within the tag path of interest;
determining contiguous lists, the step of determining contiguous lists including:
iterating through tokens of the tag paths of interest to find at least one locatable list;
for each locatable list, accumulating text tokens, the step of accumulating text tokens comprising:
identifying a first textual token in the respective tag path of interest;
accumulating the first textual token into a contiguous list;
identifying a second textual token having the same context as the first textual token; and
accumulating the second textual token into the contiguous list;
determining a separator pattern between one of said plurality of textual tokens and an adjacent textual token where both said
one of said plurality of textual tokens and said adjacent textual token have said context, the step of determining a separator
pattern including:
creating the separator pattern based on tokens between the first textual token and the second textual token, the step of creating
a separator pattern including:
discarding white space in the tokens between the first textual token and the second textual token; and
discarding tag attributes in the tokens between the first textual token and the second textual token; and
checking for another occurrence of the separator pattern following the second textual token; and
for each occurrence of the separator pattern, extracting one or more of said plurality of textual tokens, wherein each of
the extracted one or more of said plurality of textual tokens have said context, the step of extracting one or more of said
plurality of textual tokens including:
extracting a subsequent text token associated with the occurrence of the separator pattern and accumulating the subsequent
textual token into the contiguous list if the subsequent textual token is separated from the previous textual token by only
the occurrence of the separator pattern; and
terminating the accumulating text tokens for the current locatable list if the subsequent textual token is not separated from
the previous textual token by only the occurrence of the separator pattern; and
presenting one or more of said plurality of textual tokens as said human readable list, the step of presenting one or more
of said plurality of textual tokens as said human readable list comprising at least one of:
presenting said human readable list on a display;
presenting said human readable list using audio; and
storing said human readable list in a retrievable electronic file.
|