CPC G06F 1/16 (2013.01) [G06F 7/57 (2013.01); G06F 9/3001 (2013.01); G06F 9/30036 (2013.01); G06F 9/30043 (2013.01); G06F 9/3802 (2013.01)] | 16 Claims |
1. An apparatus for vector computing incorporating with matrix multiply and accumulation (MMA) calculation, comprising:
a streaming multiprocessor (SM), comprising a general-purpose register (GPR); and
a general matrix multiply (GEMM) calculation unit, comprising an instruction queue and a first arithmetic logical unit (ALU),
wherein the first ALU coupled to the GPR is arranged operably to perform MMA calculation according to a GEMM instruction stored in the instruction queue, and store a calculation result in the GPR,
wherein the SM comprises a second ALU, the second ALU coupled to the instruction queue is arranged operably to: when a fetched instruction is the GEMM instruction, obtain source data from the GPR, and push the GEMM instruction and the source data into the instruction queue,
wherein the second ALU comprises:
a GEMM operation code (opcode) mapping table, arranged operably to store a first opcode of the GEMM instruction;
a demultiplexer, comprising an input terminal, a first output terminal, and a second output terminal, wherein the input terminal is coupled to an opcode register and a source register the opcode register is arranged operably to store a second opcode, the source register is arranged operably to store a first address in the GPR, the first output terminal is coupled to a pipeline, and the second output terminal is coupled to the instruction queue;
a reading circuit, coupled to the GPR and the instruction queue; and
a comparator, coupled to the GEMM opcode mapping table and the demultiplexer, arranged operably to determine whether the first opcode matches the second opcode; and when the first opcode matches the second opcode, output a first control signal to the demultiplexer to output the second opcode to the instruction queue, and output a second control signal to the reading circuit so as to drive the reading circuit to read the source data from the first address in the GPR, and output the source data to the instruction queue.
|