US 11,809,516 B2
Apparatus and method for vector computing incorporating with matrix multiply and accumulation calculation
Zhou Hong, Shanghai (CN); and YuFei Zhang
Assigned to Shanghai Biren Technology Co., Ltd, Shanghai (CN)
Filed by Shanghai Biren Technology Co., Ltd, Shanghai (CN)
Filed on Jul. 2, 2021, as Appl. No. 17/366,485.
Claims priority of application No. 202011132750.6 (CN), filed on Oct. 21, 2020.
Prior Publication US 2022/0121727 A1, Apr. 21, 2022
Int. Cl. G06F 9/30 (2018.01); G06F 1/16 (2006.01); G06F 7/57 (2006.01); G06F 9/38 (2018.01)
CPC G06F 1/16 (2013.01) [G06F 7/57 (2013.01); G06F 9/3001 (2013.01); G06F 9/30036 (2013.01); G06F 9/30043 (2013.01); G06F 9/3802 (2013.01)] 16 Claims
OG exemplary drawing
 
1. An apparatus for vector computing incorporating with matrix multiply and accumulation (MMA) calculation, comprising:
a streaming multiprocessor (SM), comprising a general-purpose register (GPR); and
a general matrix multiply (GEMM) calculation unit, comprising an instruction queue and a first arithmetic logical unit (ALU),
wherein the first ALU coupled to the GPR is arranged operably to perform MMA calculation according to a GEMM instruction stored in the instruction queue, and store a calculation result in the GPR,
wherein the SM comprises a second ALU, the second ALU coupled to the instruction queue is arranged operably to: when a fetched instruction is the GEMM instruction, obtain source data from the GPR, and push the GEMM instruction and the source data into the instruction queue,
wherein the second ALU comprises:
a GEMM operation code (opcode) mapping table, arranged operably to store a first opcode of the GEMM instruction;
a demultiplexer, comprising an input terminal, a first output terminal, and a second output terminal, wherein the input terminal is coupled to an opcode register and a source register the opcode register is arranged operably to store a second opcode, the source register is arranged operably to store a first address in the GPR, the first output terminal is coupled to a pipeline, and the second output terminal is coupled to the instruction queue;
a reading circuit, coupled to the GPR and the instruction queue; and
a comparator, coupled to the GEMM opcode mapping table and the demultiplexer, arranged operably to determine whether the first opcode matches the second opcode; and when the first opcode matches the second opcode, output a first control signal to the demultiplexer to output the second opcode to the instruction queue, and output a second control signal to the reading circuit so as to drive the reading circuit to read the source data from the first address in the GPR, and output the source data to the instruction queue.