US 11,816,482 B2
Generalized acceleration of matrix multiply accumulate operations
Brent Ralph Boswell, Aloha, OR (US); Ming Y. Siu, Santa Clara, CA (US); Jack H. Choquette, Palo Alto, CA (US); Jonah M. Alben, San Jose, CA (US); and Stuart Oberman, Sunnyvale, CA (US)
Assigned to NVIDIA Corporation, Santa Clara, CA (US)
Filed by NVIDIA Corporation, Santa Clara, CA (US)
Filed on Aug. 18, 2022, as Appl. No. 17/890,706.
Application 17/890,706 is a continuation of application No. 17/351,161, filed on Jun. 17, 2021.
Application 17/351,161 is a continuation of application No. 17/141,082, filed on Jan. 4, 2021.
Application 17/141,082 is a continuation of application No. 16/459,191, filed on Jul. 1, 2019, granted, now 10,884,734, issued on Jan. 5, 2021.
Application 16/459,191 is a continuation of application No. 15/826,435, filed on Nov. 29, 2017, granted, now 10,338,919, issued on Jul. 2, 2019.
Claims priority of provisional application 62/503,159, filed on May 8, 2017.
Prior Publication US 2022/0405098 A1, Dec. 22, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/30 (2018.01); G06F 9/38 (2018.01); G06T 1/20 (2006.01)
CPC G06F 9/30014 (2013.01) [G06F 9/3001 (2013.01); G06F 9/3012 (2013.01); G06F 9/30036 (2013.01); G06F 9/3851 (2013.01); G06T 1/20 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A processor, comprising:
an instruction cache;
an L1 cache;
an L2 cache;
a crossbar (Xbar);
arithmetic logic units (ALUs);
a front end unit to read commands written by a host processor;
a work distribution unit to dispatch tasks to a plurality of processing clusters;
a register file to store matrices specified in a matrix-fused multiply accumulate (MFMA) instruction, wherein the MFMA instruction is to multiply a first matrix with a second matrix and sum a result with a third matrix, and wherein each element of the matrices is to be encoded as signed integer;
logic circuitry to calculate a dot product, wherein the dot product includes:
accumulating a plurality of partial products generated by multiplying each element of a first vector with a corresponding element of a second vector; and
summing the plurality of partial products with an element of a matrix; and wherein results of the MFMA instruction are to be accumulated in the register file.