Bug 435189

Summary: fenced admin override does not update cman, preventing rgmanager recovery
Product: Red Hat Enterprise Linux 5 Reporter: Lon Hohberger <lhh>
Component: cmanAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: low Docs Contact:
Priority: low    
Version: 5.2CC: cluster-maint, edamato, ricardo.arguello, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0347 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 11:58:52 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
Fix none

Description Lon Hohberger 2008-02-27 15:47:22 EST
Description of problem:

If you override failed fencing, the member is never considered fenced:

 1   M      4   2008-02-27 16:10:38  node1.local
 2   X     16                        node2.local
        Node has not been fenced since it went down

This prevents rgmanager from recovering, since it waits for fencing.  It should say:

 1   M      4   2008-02-27 16:10:38  node1.local
 2   X     16                        node2.local
        Last fenced:   2008-02-27 15:24:16 by override

Version-Release number of selected component (if applicable): Current
How reproducible: 100%
Steps to Reproduce:
1. Configure fencing
2. Take a node's fencing device down
3. Take the node down
4. Issue override
  
Actual results:  Override works, but rgmanager doesn't recover services. 
Restarting rgmanager will cause services to be started.

Expected results:  Rgmanager should recover services immediately after the
override completes.
Comment 1 Lon Hohberger 2008-02-27 15:47:22 EST
Created attachment 296116 [details]
Fix
Comment 2 Lon Hohberger 2008-02-27 15:50:27 EST
Test packages (sorry, x86_64 + srpms only) here:

http://people.redhat.com/lhh/cman-2.0.73-1.el5.5.1lhh.src.rpm
http://people.redhat.com/lhh/cman-2.0.73-1.el5.5.1lhh.x86_64.rpm
Comment 3 Lon Hohberger 2008-02-27 17:24:47 EST
Other workaround includes configuring manual fencing (blech) and adding it as a
second fence level for each cluster node.
Comment 4 Lon Hohberger 2008-02-29 14:24:33 EST
Pushed to git in rhel5/master branches.
Comment 8 Lon Hohberger 2008-03-27 15:34:06 EDT
Mar 27 15:30:43 molly clurgmgrd[3942]: <info> Waiting for node #1 to be fenced 
Mar 27 15:32:18 molly fenced[1603]: agent "fence_xvm" reports: Timed out waiting
for response 
Mar 27 15:32:18 molly fenced[1603]: fence "frederick" failed
...
[root@molly ~]# echo frederick > /var/run/cluster/fenced_override
...
Mar 27 15:32:21 molly fenced[1603]: fence "frederick" overridden by
administrator intervention
Mar 27 15:32:21 molly clurgmgrd[3942]: <info> Node #1 fenced; continuing 
Mar 27 15:32:22 molly clurgmgrd[3942]: <notice> Taking over service service:test
from down member frederick 
Mar 27 15:32:22 molly clurgmgrd[3942]: <notice> Service service:test started 
...
[root@molly ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2008-03-27 14:17:26  /dev/xvdb1
   1   X   1032                        frederick
   2   M   1020   2008-03-27 14:17:12  molly
[root@molly ~]# cman_tool nodes -f
Node  Sts   Inc   Joined               Name
   0   M      0   2008-03-27 14:17:26  /dev/xvdb1
   1   X   1032                        frederick
       Last fenced:   2008-03-27 15:27:55 by override
   2   M   1020   2008-03-27 14:17:12  molly

Comment 10 errata-xmlrpc 2008-05-21 11:58:52 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html