Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 437867

Summary: Fence storm with fence_egenera
Product: [Retired] Red Hat Cluster Suite Reporter: Bryn M. Reeves <bmr>
Component: fenceAssignee: Jim Parsons <jparsons>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 4CC: bstevens, cfeist, cluster-maint, cww, edamato, lpleiman, mgrac, nstraz, tao
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: fence-1.32.65-1.el4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-10 16:31:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 459501    
Attachments:
Description Flags
test boot watch patch none

Description Bryn M. Reeves 2008-03-17 19:59:07 UTC
Description of problem:
When initiating a manual fencing operation from one blade in an Egenera frame,
the node carrying out the fencing considers its fencing operation to have failed
and becomes stuck in a loop repeatedly trying to fence the target node.

In fact, the target has been successfully fenced and is repeatedly interrupted
during bootup by the subsequent retries from the fencing node.

Version-Release number of selected component (if applicable):
fence-1.32.50-2

How reproducible:
Unsure

Steps to Reproduce:
1. Manually initiate fencing from one node.
  
Actual results:
Fencing node continually reports that fencing has failed & re-attempts the
operation. Target node never gets to complete booting.

Expected results:
The first successful fence operation is correctly recognised by the fencing node
and is not re-tried forever.

Additional info:

Comment 1 RHEL Program Management 2008-03-17 20:08:29 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 11 Jason Willeford 2008-03-24 16:02:15 UTC
Once a patch has been provided, the customer has agreed to test.

Comment 16 Jim Parsons 2008-04-02 15:54:34 UTC
Oh, nevermind! Above comment says they will test for us ;)

Comment 18 Kiersten (Kerri) Anderson 2008-05-07 16:14:17 UTC
Moving this out to 4.8, and we need a way to reproduce this issue.

Comment 31 Issue Tracker 2008-08-08 15:35:06 UTC
As we know, the egenera has redundant control blades that can power on/off
the individual blades. The cluster configuration here has been configured
to fence a machine twice, once from each cblade. This is creating a race
condition between fence_egenera script on each one. There are two
solutions to this the way that I see it. 

1. Fix configuration to fence from the second control blade only if the
attempted fence failed on the first control blade. 

2. Fix issue in fence_egenera that causes race condition in the first
place. It appears that the script does not understand what to do with the
"Booting" status, so it reboots the node. This can be fixed by either
waiting until the "Booting" status changes to something that it recognizes
or just go ahead and force a reboot in the "Booting" stage. In either case
it must return success at the end to prevent the current condition.




This event sent from IssueTracker by calvin_g_smith 
 issue 164929

Comment 32 Issue Tracker 2008-08-08 22:31:53 UTC
Attached is a patch for the fence_egenera script that recognizes if the
machine is already booting up and returns success if that is the case. 


This event sent from IssueTracker by calvin_g_smith 
 issue 164929
it_file 147843

Comment 33 Jason Willeford 2008-08-11 18:29:59 UTC
Created attachment 313993 [details]
test boot watch patch

Comment 35 Chris Feist 2009-06-10 16:31:08 UTC
This is fixed in RHEL4.8 in fence-1.32.65-1.el4 and beyond.