Description of problem: If a defined qdisk heuristic "program" does not have a defined timeout the "program" will run till it completes. The drawback to this is that the time to complete could be longer than (interval*tko) time. An example of this would: <heuristic interval="2" program="sleep 100" score="1"/> What is needed is a timer around the calling function to terminate the "program" as failed since it exceeded (interval*tko) time and declare that node failed heuristic. static int check_heuristic(struct h_data *h, int block) { ... ret = waitpid(h->childpid, &status, block?0:WNOHANG); ... } Version-Release number of selected component (if applicable): cman-2.0.115-34.el5 How reproducible: Everytime Steps to Reproduce: 1. Setup cluster with qdisk 2. Define a heuristic "program" that will exceed (interval*tko) time. 3. Start qiskd Actual results: The qdisk daemon does not noticed that heuristic "program" has ran for too long and exceeded (interval*tko) time. Expected results: The qdisk daemon should noticed that heuristic "program" has ran for too long and exceed (interval*tko) time. Additional info:
Created attachment 448783 [details] Fix
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=f2bfc93101e06cba918c2bb0c11ab6d668788019
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0036.html