US 11,816,428 B2
	Automatically identifying chunks in sets of documents
Andrew Begun, Redmond, WA (US); Steven DeRose, Silver Spring, MD (US); Taqi Jaffri, Kirkland, WA (US); Luis Marti Orosa, Las Condes (CL); Michael Palmer, Edmonds, WA (US); Jean Paoli, Kirkland, WA (US); Christina Pavlopoulou, Emeryville, CA (US); Elena Pricoiu, Issaquah, WA (US); Swagatika Sarangi, Bellevue, WA (US); Marcin Sawicki, Kirkland, WA (US); Manar Shehadeh, Kirkland, WA (US); Michael Taron, Seattle, WA (US); Bhaven Toprani, Cupertino, CA (US); Zubin Rustom Wadia, Chappaqua, NY (US); David Watson, Seattle, WA (US); Eric White, San Luis Obispo, CA (US); Joshua Yongshin Fan, Bellevue, WA (US); Kush Gupta, Seattle, WA (US); Andrew Minh Hoang, Olympia, WA (US); Zhanlin Liu, Seattle, WA (US); Jerome George Paliakkara, Seattle, WA (US); Zhaofeng Wu, Seattle, WA (US); Yue Zhang, St Paul, MN (US); and Xiaoquan Zhou, Bellevue, WA (US)
Assigned to Docugami, Inc., Kirkland, WA (US)
Filed by Docugami, Inc., Kirkland, WA (US)
Filed on Aug. 5, 2020, as Appl. No. 16/986,139.
Application 16/986,139 is a continuation of application No. PCT/US2020/043606, filed on Jul. 24, 2020.
Claims priority of provisional application 62/900,793, filed on Sep. 16, 2019.
Prior Publication US 2021/0081602 A1, Mar. 18, 2021
Int. Cl. G06F 40/186 (2020.01); G06N 20/00 (2019.01); G06F 40/30 (2020.01); G06F 40/169 (2020.01); G06F 40/117 (2020.01); G06F 40/106 (2020.01); G06F 40/289 (2020.01); G06F 40/295 (2020.01); G06F 16/93 (2019.01); G06F 16/2457 (2019.01); G06F 16/248 (2019.01); G06V 30/414 (2022.01); G06V 30/416 (2022.01)

CPC G06F 40/186 (2020.01) [G06F 16/248 (2019.01); G06F 16/2457 (2019.01); G06F 16/93 (2019.01); G06F 40/106 (2020.01); G06F 40/117 (2020.01); G06F 40/169 (2020.01); G06F 40/289 (2020.01); G06F 40/295 (2020.01); G06F 40/30 (2020.01); G06N 20/00 (2019.01); G06V 30/414 (2022.01); G06V 30/416 (2022.01)]

20 Claims

1. A method implemented on a computer system executing instructions for analyzing and annotating documents, the method comprising:

importing documents in a document set;

performing visual extraction of the imported documents, including creating signatures for document parts;

automatically identifying a hierarchical structure of chunks within individual documents in the document set (a) based on the visual extraction, content, and contexts in the individual document; and (b) based on patterns of visual extraction and content across the documents in the document set, wherein the hierarchical structure includes small chunks comprising series of words from within individual sentences;

for at least some of the small chunks, in a process separate from identifying the small chunks, automatically selecting text from sentences surrounding the small chunks as labels for semantic roles played by the small chunks in a transaction described by the individual documents;

standardizing the labels for semantic roles across the documents in the document set; and

annotating documents in the document set wherein the annotations include locations of the identified small chunks and standardized labels for the semantic roles played by the identified small chunks at those locations.