US 9,811,727 B2
Extracting reading order text and semantic entities
Eunyee Koh, San Jose, CA (US); and Walter Wei-Tuh Chang, San Jose, CA (US)
Assigned to Adobe Systems Incorporated, San Jose, CA (US)
Filed by Eunyee Koh, San Jose, CA (US); and Walter Wei-Tuh Chang, San Jose, CA (US)
Filed on May 30, 2008, as Appl. No. 12/130,607.
Prior Publication US 2014/0301644 A1, Oct. 9, 2014
Int. Cl. G06F 17/00 (2006.01); G06K 9/00 (2006.01)
CPC G06K 9/00469 (2013.01) 17 Claims
OG exemplary drawing
 
1. A method comprising:
receiving a collection of strings, each string from the collection of strings having a corresponding bounding box describing a position of at least a portion of the string in a source document, the source document including multiple sections, each section presenting at least a portion of the collection of strings;
determining string densities occurring in a first subset of vertical portions of the source document by processing vertical position information from at least one bounding box by scanning left to right;
determining string densities occurring in a first subset of horizontal portions of the source document by processing horizontal position information from at least one bounding box by scanning top to bottom;
detecting a section boundary of one of the multiple sections in the source document by concurrently analyzing the string densities occurring in vertical portions and the string densities occurring in horizontal portions;
based on the section boundary, assigning each string from the collection of strings to either a pre-boundary collection of strings, or a post-boundary collection of strings;
recursively analyzing the pre-boundary collection of strings and the post-boundary collection of strings to search for additional section boundaries in each collection; and
arranging the collection of strings according to a reading order using the section boundary, the reading order corresponding to a language associated with the collection of strings.