US 7,480,817 B2
Method for replicating data based on probability of concurrent failure
Jinliang Fan, Redmond, Wash. (US); Zhen Liu, Tarrytown, N.Y. (US); and Dimitrios Pendarakis, Westport, Conn. (US)
Assigned to International Business Machines Corporation, Armonk, N.Y. (US)
Filed on Mar. 31, 2006, as Appl. No. 11/395,018.
Prior Publication US 2007/0234102 A1, Oct. 04, 2007
Int. Cl. G06F 11/00 (2006.01)
U.S. Cl. 714—4  [714/6; 370/238] 1 Claim
OG exemplary drawing
 
1. A computer-implemented method for replicating data, the method comprising the steps of:
determining, by a source node having a non-volatile data storage area, that the source node has data in the non-volatile data storage area to be replicated;
surveying, by the source node, all nodes coupled to the source node via a network so as to determine candidate replication nodes, the nodes being geographically distributed data storage entities, and the candidate replication nodes being the nodes that are functional, communicating nodes with memory capacity available to store at least a portion of the data to be replicated;
acquiring, by the source node, coordinates for each of the candidate replication nodes;
using, by the source node, the coordinates to determine a geographic location of each of the candidate replication nodes;
using, by the source node, the coordinates to determine a communication cost for each of the candidate replication nodes, the communication cost being determined based on communication parameters that include a physical distance, an electrical pathway distance, a number of switches in an electrical pathway, a cost of establishing a connection, and an electrical pathway signal carrying capacity;
rating, by the source node, each of the geographic locations based on probability of a concurrent failure of the source node and the candidate replication node, the probability being based on historical data and predictive mathematical models, the historical data including statistical records of previous events, and the predictive mathematical models including a model of a combination of independent and correlated failures;
using, by the source node, a branch-and-bound algorithm to assign values to sets of the candidate replication nodes based on a combination of the communication costs and the ratings of the geographic locations of the candidate replication nodes;
selecting, by the source node, one of the sets of candidate replication nodes based on the values that are assigned, the one set of candidate replication nodes being selected so as to obtain a lowest value of the combination of the communication cost and the probability of a concurrent failure;
replicating the data to be replicated on the nodes of the one set of candidate replication nodes; and
at least periodically monitoring, by the source node, all nodes coupled to the source node via the network to determine availability of new nodes.