Description of problem: When initiating a manual fencing operation from one blade in an Egenera frame, the node carrying out the fencing considers its fencing operation to have failed and becomes stuck in a loop repeatedly trying to fence the target node. In fact, the target has been successfully fenced and is repeatedly interrupted during bootup by the subsequent retries from the fencing node. Version-Release number of selected component (if applicable): fence-1.32.50-2 How reproducible: Unsure Steps to Reproduce: 1. Manually initiate fencing from one node. Actual results: Fencing node continually reports that fencing has failed & re-attempts the operation. Target node never gets to complete booting. Expected results: The first successful fence operation is correctly recognised by the fencing node and is not re-tried forever. Additional info:
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Once a patch has been provided, the customer has agreed to test.
Oh, nevermind! Above comment says they will test for us ;)
Moving this out to 4.8, and we need a way to reproduce this issue.
As we know, the egenera has redundant control blades that can power on/off the individual blades. The cluster configuration here has been configured to fence a machine twice, once from each cblade. This is creating a race condition between fence_egenera script on each one. There are two solutions to this the way that I see it. 1. Fix configuration to fence from the second control blade only if the attempted fence failed on the first control blade. 2. Fix issue in fence_egenera that causes race condition in the first place. It appears that the script does not understand what to do with the "Booting" status, so it reboots the node. This can be fixed by either waiting until the "Booting" status changes to something that it recognizes or just go ahead and force a reboot in the "Booting" stage. In either case it must return success at the end to prevent the current condition. This event sent from IssueTracker by calvin_g_smith issue 164929
Attached is a patch for the fence_egenera script that recognizes if the machine is already booting up and returns success if that is the case. This event sent from IssueTracker by calvin_g_smith issue 164929 it_file 147843
Created attachment 313993 [details] test boot watch patch
This is fixed in RHEL4.8 in fence-1.32.65-1.el4 and beyond.