US 11,704,785 B2
	Techniques for image content extraction
David James Wheaton, Pittsboro, NC (US); Stuart Dakari Cooke, III, Durham, NC (US); and William Robert Nadolski, Raleigh, NC (US)
Assigned to SAS INSTITUTE INC., Cary, NC (US)
Filed by SAS Institute Inc., Cary, NC (US)
Filed on Aug. 17, 2022, as Appl. No. 17/889,801.
Application 17/889,801 is a continuation in part of application No. 17/397,470, filed on Aug. 9, 2021, granted, now 11,443,416.
Application 17/397,470 is a continuation in part of application No. 17/089,962, filed on Nov. 5, 2020, granted, now 11,087,077, issued on Aug. 10, 2021.
Application 17/089,962 is a continuation of application No. 17/083,568, filed on Oct. 29, 2020, granted, now 11,049,235, issued on Jun. 29, 2021.
Claims priority of provisional application 63/285,444, filed on Dec. 2, 2021.
Claims priority of provisional application 63/154,569, filed on Feb. 26, 2021.
Claims priority of provisional application 63/170,484, filed on Apr. 3, 2021.
Claims priority of provisional application 62/992,941, filed on Mar. 21, 2020.
Claims priority of provisional application 62/991,259, filed on Mar. 18, 2020.
Prior Publication US 2022/0392047 A1, Dec. 8, 2022
Int. Cl. G06K 9/00 (2022.01); G06T 7/00 (2017.01); G06F 16/81 (2019.01); G06F 16/93 (2019.01); G06F 40/284 (2020.01); G06F 40/186 (2020.01); G06F 40/169 (2020.01); G06F 3/04842 (2022.01); G06V 10/40 (2022.01); G06K 9/62 (2022.01); G06V 30/10 (2022.01); G06V 30/24 (2022.01); G06V 30/418 (2022.01)

CPC G06T 7/0002 (2013.01) [G06F 3/04842 (2013.01); G06F 16/81 (2019.01); G06F 16/93 (2019.01); G06F 40/169 (2020.01); G06F 40/186 (2020.01); G06F 40/284 (2020.01); G06V 10/40 (2022.01); G06K 9/6253 (2013.01); G06K 9/6276 (2013.01); G06T 2207/30168 (2013.01); G06T 2207/30176 (2013.01); G06V 30/10 (2022.01); G06V 30/248 (2022.01); G06V 30/418 (2022.01)]

30 Claims

1. An apparatus comprising a processor and a storage to store instructions, that when executed by the processor, cause the processor to perform operations comprising:

identify semi-structured data comprising an input document image (IDI), a set of IDI words, and a set of IDI word locations, the set of IDI word locations including a location in the IDI for each IDI word in the set, and each IDI word location including a first-dimensional value and a second-dimensional value;

identify a document template comprising a set of template words, a set of template word locations, and a breakpoint, the set of template word locations including a location in the document template for each template word in the set and each template word location including a first-dimensional value and a second-dimensional value, and the breakpoint comprising a location in the document template and the location in the document template comprising a first-dimensional value, wherein each respective template word in the set of template words is assigned a first value or a second value from a binary set of values based on a corresponding first-dimensional value for the respective template word relative to the first-dimensional value of the breakpoint;

compare the set of IDI words to the set of template words to identify the document template as a candidate template match for the IDI, wherein the set of IDI words and the set of template words include a set of common words occurring in the IDI and the document template;

compare the IDI to the candidate template match by performing a multiple regression analysis based on locations of the common words in the IDI and in the candidate template match, the multiple regression analysis including a first regression equation comprising a first dependent variable set equal to a summation of a first independent variable multiplied by a first regression coefficient and a second independent variable multiplied by a second regression coefficient, wherein the first dependent variable corresponds to the first-dimensional value of a respective common word in the IDI, the first independent variable corresponds to a respective first-dimensional value of the respective common word in the candidate template match, and the second independent variable corresponds to the first or second value assigned from the binary set of values to the respective common word in the candidate template match;

determine the candidate template match is an actual template match for the IDI based on the multiple regression analysis; and

extract structured data from the IDI based on the document template in response to determination of the candidate template match as the actual template match for the document template.