US 11,809,981 B1
	Performing hardware operator fusion
Animesh Jain, Sunnyvale, CA (US); Tobias Joseph Kastulus Edler von Koch, Austin, TX (US); Yizhi Liu, Fremont, CA (US); Taemin Kim, Portland, OR (US); Jindrich Zejda, Saratoga, CA (US); Yida Wang, Palo Alto, CA (US); Vinod Sharma, Menlo Park, CA (US); Richard John Heaton, San Jose, CA (US); and Randy Renfu Huang, Morgan Hill, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Nov. 27, 2019, as Appl. No. 16/698,753.
Int. Cl. G06N 3/063 (2023.01); G06F 9/30 (2018.01); G06F 9/54 (2006.01)

CPC G06N 3/063 (2013.01) [G06F 9/30007 (2013.01); G06F 9/545 (2013.01)]

18 Claims

1. A method of generating executable instructions for a hardware accelerator, comprising:

receiving a first kernel of a first operator, the first kernel including:

first read instructions to a first virtual data node to obtain first input data for the first operator,

first operator instructions of applying the first operator to the first input data to generate first output data, and

first write instructions to a second virtual data node to store the first output data;

receiving a second kernel of a second operator, the second kernel including:

second read instructions to the second virtual data node to fetch elements of the first output data to assemble second input data for the second operator;

second operator instructions of applying the second operator to the second input data to generate second output data; and

second write instructions to a third virtual data node to store the second output data;

determining to fuse the first operator with the second operator to generate a fused operator;

based on the first read instructions providing inputs to the fused operator, converting the first read instructions to the first virtual data node into off-chip read instructions to obtain the first input data from an off-chip memory external to the hardware accelerator;

based on the second write instructions storing outputs of the fused operator, converting the second write instructions to the third virtual data node into off-chip write instructions to store the second output data at the off-chip memory;

determining, based on a mapping between the first output data and the second input data, pairs of corresponding first write instructions and second read instructions;

for each pair of corresponding first write instruction and second read instruction, converting the corresponding first write instruction and second read instruction to, respectively, an on-chip write instruction to store the first output data at an on-chip memory internal to the hardware accelerator and an on-chip read instruction to read the second input data from the on-chip memory;

extracting, from the first kernel, the first operator instructions;

extracting, from the second kernel, the second operator instructions; and

generating an instruction file executable by the hardware accelerator including the off-chip read instructions, the first operator instructions, the on-chip write instructions, the on-chip read instructions, the second operator instructions, and the off-chip write instructions.