Bug 589266

Summary: qdisk should have a timer for heuristic "programs" instead of relying on "program" to provide timeout
Product: Red Hat Enterprise Linux 5 Reporter: Shane Bradley <sbradley>
Component: cmanAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: cluster-maint, edamato, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cman-2.0.115-55.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 636243 (view as bug list) Environment:
Last Closed: 2011-01-13 22:33:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Fix none

Description Shane Bradley 2010-05-05 17:22:46 UTC
Description of problem:

If a defined qdisk heuristic "program" does not have a defined timeout
the "program" will run till it completes. The drawback to this is that
the time to complete could be longer than (interval*tko) time.

An example of this would:
<heuristic interval="2" program="sleep 100" score="1"/>

What is needed is a timer around the calling function to terminate the
"program" as failed since it exceeded (interval*tko) time and declare
that node failed heuristic.

static int check_heuristic(struct h_data *h, int block) {
   ...
   ret = waitpid(h->childpid, &status, block?0:WNOHANG);
   ...
} 

Version-Release number of selected component (if applicable):
cman-2.0.115-34.el5

How reproducible:
Everytime

Steps to Reproduce:
1. Setup cluster with qdisk
2. Define a heuristic "program" that will exceed (interval*tko) time.
3. Start qiskd
  
Actual results:
The qdisk daemon does not noticed that heuristic "program" has ran for
too long and exceeded (interval*tko) time.

Expected results:
The qdisk daemon should noticed that heuristic "program" has ran for
too long and exceed (interval*tko) time.

Additional info:

Comment 1 Lon Hohberger 2010-09-21 19:22:11 UTC
Created attachment 448783 [details]
Fix

Comment 7 errata-xmlrpc 2011-01-13 22:33:49 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0036.html