CPC G06F 3/0634 (2013.01) [G06F 3/0604 (2013.01); G06F 3/067 (2013.01); G06F 3/0619 (2013.01); G06F 3/0631 (2013.01)] | 20 Claims |
1. A method for maintaining fault tolerance in a storage cluster, comprising:
receiving, by a management component associated with a distributed data store on a cluster of host machines, a request to place a first host machine of the cluster of host machines in a maintenance mode, wherein the first host machine stores given data of the distributed data store;
after receiving the request, determining, by the management component, whether a second host machine that does not currently store any data of the distributed data store exists in the cluster of host machines;
determining, by the management component, based on whether the second host machine exists in the cluster of host machines, whether to transfer the given data of the distributed data store from the first host machine to the second host machine;
determining, by the management component, a number of failures to tolerate (FTT) of the distributed data store;
performing, based on the number of FTT of the distributed data store, at least one of:
decrementing, by the management component, the number of FTT of the distributed data store by one,
deactivating, by the management component. the distributed data store: or
recreating. by the management component. a state associated with the given data of the distributed data store on the second host machine; and
after determining whether to transfer the given data of the distributed data store from the first host machine to the second host machine, initiating, by the management component, the maintenance mode on the first host machine.
|