US 9,811,556 B2
Source code search engine
Nathan Fontenot, Georgetown, TX (US); Fionnuala G. Gunter, Austin, TX (US); Michael T. Strosaker, Austin, TX (US); and George C. Wilson, Austin, TX (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Nov. 3, 2016, as Appl. No. 15/342,183.
Application 15/342,183 is a continuation of application No. 14/735,419, filed on Jun. 10, 2015.
Prior Publication US 2017/0046250 A1, Feb. 16, 2017
Int. Cl. G06F 17/30 (2006.01); G06F 9/44 (2006.01); G06F 11/36 (2006.01)
CPC G06F 17/30424 (2013.01) [G06F 8/31 (2013.01); G06F 11/3668 (2013.01); G06F 17/30637 (2013.01); G06F 17/30657 (2013.01); G06F 8/77 (2013.01)] 1 Claim
OG exemplary drawing
 
1. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions executable by a processor to cause the processor to perform a method comprising:
creating a respective abstract syntax tree (AST) of each of a user-defined query source code data set and at least one target source code data set, wherein the user-defined query source code data set comprises a selected portion of source code in a given programming language comprising a complete function, wherein the target source code data set comprises at least one file within at least one repository containing source code in the given programming language;
calculating a respective first similarity value for each of one or more portions of each of the at least one target source code data sets, wherein each respective first similarity value comprises a topological measure of similarity between the user-defined query source code data set and each respective portion of the at least one target source code data set, which comprises a respective target source code abstract syntax subtree, wherein calculating the respective first similarity value further comprises:
calculating, for the query source code abstract syntax tree, a first number of vertices and edges;
calculating, for each respective target source code abstract syntax subtree, a respective second number of vertices and edges;
calculating, for each respective target source code abstract syntax subtree a respective absolute value of a difference between the first number and the respective second number; and
comparing, for each respective target source code abstract syntax subtree, the respective absolute value to a first threshold;
identifying portions of each of the at least one target source code data sets having the respective first similarity value less than or equal to the first threshold, wherein the first threshold comprises a permissible difference in the number of vertices, edges, or vertices and edges between the user-defined query source code abstract syntax tree and the respective target source code abstract syntax subtree;
calculating a respective second similarity value for each portion of the target source code data set having the respective first similarity value less than or equal to the first threshold, the respective second similarity value comprising a semantic measure of similarity between the user-defined query source code data set and each respective portion of the target source code data set having the respective first similarity value less than or equal to the first threshold, wherein calculating the respective second similarity value further comprises:
identifying one or more series of operations to transform the target source code abstract syntax subtree to the query source code abstract syntax tree, wherein said series of operations comprises one or more of insert, delete, and rename operations;
calculating, for each identified series of operations, a cost of the identified series of operations, wherein the cost of the identified series of operations is associated with one or more of insert, delete, and rename operations;
wherein the cost of the identified series of operations is the respective second similarity value; and
selecting the series of operations having a lowest cost;
outputting, to a user interface, each portion of each target source code data set having the second similarity value less than or equal to a second threshold, wherein each portion is ranked according to the second similarity value.