US 11,816,913 B2
Methods and systems for extracting information from document images
Mouli Rastogi, Gurgaon (IN); Syed Afshan Ali, Gurgaon (IN); Mrinal Rawat, Gurgaon (IN); Lovekesh Vig, Gurgaon (IN); Puneet Agarwal, Noida (IN); Gautam Shroff, Gurgaon (IN); and Ashwin Srinivasan, Goa (IN)
Assigned to Tata Consultancy Services Limited, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on May 27, 2021, as Appl. No. 17/332,021.
Claims priority of application No. 202121008796 (IN), filed on Mar. 2, 2021.
Prior Publication US 2022/0284215 A1, Sep. 8, 2022
Int. Cl. G06K 9/00 (2022.01); G06F 16/93 (2019.01); G06F 16/901 (2019.01); G06K 9/62 (2022.01); G06K 9/40 (2006.01); G06K 9/46 (2006.01); G06V 30/418 (2022.01); G06V 10/30 (2022.01); G06V 10/426 (2022.01); G06V 10/75 (2022.01); G06V 30/413 (2022.01); G06V 30/414 (2022.01); G06F 18/21 (2023.01); G06F 18/22 (2023.01); G06V 30/18 (2022.01)
CPC G06V 30/418 (2022.01) [G06F 16/9024 (2019.01); G06F 16/93 (2019.01); G06F 18/21 (2023.01); G06F 18/22 (2023.01); G06V 10/30 (2022.01); G06V 10/426 (2022.01); G06V 10/751 (2022.01); G06V 30/18057 (2022.01); G06V 30/413 (2022.01); G06V 30/414 (2022.01)] 10 Claims
OG exemplary drawing
 
1. A processor-implemented method (200) for extracting information from images of one or more templatized documents comprising:
receiving (202), via an input/output interface, at least one image of each of the one or more templatized documents in a predefined sequence from which the information to be extracted and a template document dataset, wherein the template document dataset includes a predefined set of template documents, a knowledge graph for each template document and a rule set for each of the one or more template documents of the set of template documents;
preprocessing (204), via one or more hardware processors, the received at least one image of each of the one or more templatized documents using a cycle generative adversarial network (GAN) to obtain a pre-processed image of each of the one or more templatized documents, wherein the pre-processing includes de-noising;
identifying (206), via the one or more hardware processors, words and sentences along with a spatial relationship for each word from each pre-processed image of the one or more templatized documents using a vision model and an optical character recognition (OCR) technique, wherein the words and sentences are identified from text, tables, charts, and checkboxes present in each pre-processed image of the one or more templatized documents;
generating (208), via the one or more hardware processors, a knowledge graph for each pre-processed image of the one or more templatized documents using a schema from the identified words and sentences, and the spatial relationship for each word, wherein each word is represented by a node, and a sub-graph for each word formed in the generated knowledge graph;
determining (210), via the one or more hardware processors, a similarity metric by comparing the generated knowledge graph of each pre-processed image with a knowledge graph of each template document present in the template document dataset using a Formal Concept Analysis (FCA), wherein the similarity metric provides at least one matched template document from the template document dataset for each pre-processed image of the one or more templatized documents; and
extracting (212), via the one or more hardware processors, the information of the pre-processed image by applying a rule set of the at least one matched template document from the template document dataset on the generated knowledge graph of each pre-processed image of the one or more templatized documents.