US 7,516,360 B2
System and method for execution of a job in a distributed computing architecture
Utz Bacher, Tuebingen (Germany); Oliver Benke, Leinfelden-Echterdingen (Germany); Boas Betzler, Magstadt (Germany); Thomas Lumpp, Reutlingen (Germany); and Eberhard Pasch, Tuebingen (Germany)
Assigned to International Business Machines Corporation, Armonk, N.Y. (US)
Filed on Sep. 09, 2004, as Appl. No. 10/937,682.
Claims priority of application No. 03103377 (EP), filed on Sep. 12, 2003.
Prior Publication US 2005/0081097 A1, Apr. 14, 2005
Int. Cl. G06F 11/00 (2006.01); G06F 11/20 (2006.01)
U.S. Cl. 714—12  [714/15; 714/43] 17 Claims
OG exemplary drawing
 
16. A method for executing jobs in a distributed computing infrastructure having a distributed management server, worker clients, and systems selectable as failover systems, wherein said distributed management server gets requests to perform a task, divides the task into smaller jobs, selects worker clients for each job and sends said jobs to said selected worker clients, said method at said systems being selectable as failover systems, said method comprising the steps of:
selecting a failover system by at least one worker client;
receiving checkpointing information from said at least one worker client;
monitoring said worker client in order to detect a failure;
taking over and continuing execution of said job by said failover system using said checkpointing information in case of a failure being detected; and
assigning at least one existing or a newly created failover system to the failover system which is continuing execution of said job.