US 9,811,573 B1
Lineage information management in data analytics
Dong Xiang, Shanghai (CN); Stephen Todd, Shrewsbury, MA (US); and Qiyan Chen, Shanghai (CN)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC Corporation, Hopkinton, MA (US)
Filed on Sep. 27, 2013, as Appl. No. 14/39,537.
Int. Cl. G06F 17/30 (2006.01)
CPC G06F 17/30557 (2013.01) 17 Claims
OG exemplary drawing
 
1. A method comprising steps of:
obtaining an input data set and a first data analytics workload, wherein the first data analytics workload comprises one or more execution parameters for executing the first data analytics workload based on the input data set, and wherein the execution parameters comprise a location of the input data set;
obtaining an identifier specific to the first data analytics workload according to one or more predefined rules, wherein obtaining the identifier comprises creating at least one content address according to and uniquely identifying both of the input data set and the one or more execution parameters to generate the identifier;
commencing execution of the first data analytics workload based on the one or more execution parameters and the input data set, wherein the first data analytics workload writes output data generated during the execution of the first data analytics workload;
registering meta data associated with the output data generated during the execution of the first data analytics workload at a meta data store using said at least one identifier, wherein the meta data comprises lineage information for tracing and verifying at least one of creation, movement, use and alteration of the output data generated during execution of the first data analytics workload based at least in part on said identifier;
storing the meta data associated with the output data temporarily in a temporary store prior to completing execution of the first data analytics workload, wherein the lineage information is stored with both parent meta data associated with a parent file and child meta data associated with one or more child data files associated with the parent file;
merging the temporarily stored meta data from the temporary store into the meta data store upon completing execution of the first data analytics workload; and
performing one or more functions utilizing the lineage information from the merged meta data, the one or more functions comprising one or more of a data provenance function, a data analytics work scheduling function, and a data de-duplication function;
wherein performing the data provenance function comprises replaying the first data analytics workload by tracking a footprint derived from the lineage information to validate the output data generated during the execution of the first data analytics workload;
wherein performing the data analytics workload scheduling function comprises querying the meta data server to locate the lineage information responsive to re-obtaining the first data analytics workload, and returning data from an enterprise data store without re-executing the first data analytics workload responsive to locating the lineage information;
wherein performing the data de-duplication function comprises, responsive to obtaining the first data analytics workload and a second data analytics workload, comparing the first identifier with a second identifier specific to the second data analytics workload and, responsive to the first identifier matching the second identifier, registering the first and second data analytics workloads to point to common data; and
wherein one or more of the above steps are performed via at least one processing device.