+++ This bug was initially created as a clone of Bug #460054 +++ Description of problem: I hit this problem again during RHEL 5.4 testing with revolver. In my three node cluster, dash-01 was continuously fencing dash-02 until I intervened and rebooted dash-01. Version-Release number of selected component (if applicable): cman-2.0.101-1.el5 How reproducible: Unknown Steps to Reproduce: 1. run revolver Actual results: Message repeated in /var/log/messages on dash-01: May 19 14:28:18 dash-01 fenced[8514]: fencing node "dash-02" May 19 14:28:25 dash-01 fenced[8514]: agent "fence_apc" reports: Success: Rebooted Traceback (most recent call last): File "/sbin/fence_apc", line 216, in ? main() File "/sbin/fence_apc", line 211, in main conn.close() File "/usr/lib/python2.4/site-packages/pexpect.py", line 666, in close raise Except May 19 14:28:25 dash-01 fenced[8514]: agent "fence_apc" reports: ionPexpect ('close() could not terminate the child using terminate()') pexpect.ExceptionPexpect: close() could not terminate the child using terminate() Exception exceptions.OSError: <exceptions.OSError instance at 0x2b28b126bc20> in <bound method fspawn. May 19 14:28:25 dash-01 fenced[8514]: agent "fence_apc" reports: __del__ of <fencing.fspawn object at 0x2b28b0012e90>> ignored May 19 14:28:25 dash-01 fenced[8514]: fence "dash-02" failed Which cleans up as: Success: Rebooted Traceback (most recent call last): File "/sbin/fence_apc", line 216, in ? main() File "/sbin/fence_apc", line 211, in main conn.close() File "/usr/lib/python2.4/site-packages/pexpect.py", line 666, in close raise ExceptionPexpect ('close() could not terminate the child using terminate()') pexpect.ExceptionPexpect: close() could not terminate the child using terminate() Exception exceptions.OSError: <exceptions.OSError instance at 0x2b28b126bc20> in <bound method fspawn.__del__ of <fencing.fspawn object at 0x2b28b0012e90>> ignored This looks like the exception which fence_apc should actually catch is ExecptionPexpect instead of OSError. Expected results: Additional info:
While running regressions on 5.3.z I was able to hit this with the fence_wti agent also. Jun 5 00:52:52 z1 fenced[5635]: fencing node "z4" Jun 5 00:52:59 z1 fenced[5635]: agent "fence_wti" reports: Success: Rebooted Traceback (most recent call last): File "/sbin/fence_wti", line 109, in ? main() File "/sbin/fen ce_wti", line 106, in main conn.close() File "/usr/lib/python2.4/site-packages/pexpect.py", line 666, in close raise Except Jun 5 00:52:59 z1 fenced[5635]: agent "fence_wti" reports: ionPexpect ('close() could not terminate the child using terminate()') pexpect.ExceptionPexpect: close() could not termina te the child using terminate() Exception exceptions.OSError: <exceptions.OSError instance at 0xb7eedd0c> in <bound method fspawn.__de Jun 5 00:52:59 z1 fenced[5635]: agent "fence_wti" reports: l__ of <fencing.fspawn object at 0xb7c7492c>> ignored Jun 5 00:52:59 z1 fenced[5635]: fence "z4" failed This eventually led to z1 being overwhelmed with telnet processes and z1 needed to be fenced. All fence agents which use pexpect.py should handle the ExceptionPexpect exception on conn.close()
Fixed in: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commit;h=ed2e22dd942b8bb2fcea3a453cc97fda4b5d9020
Verified that handling of ExceptionPexpect is included in cman-2.0.108-1.el5.
I hit this during revolver testing: Jul 20 18:14:10 basic-p2 fenced[1699]: agent "fence_lpar" reports: Success: Rebooted Traceback (most recent call last): File "/sbin/fence_lpar", line 134, in ? main() File "/sbin/fence_lpar", line 128, in main except exceptions.OSError: NameError: global name 'exceptions' is not defined Exception exceptions.O Jul 20 18:14:10 basic-p2 fenced[1699]: agent "fence_lpar" reports: SError: <exceptions.OSError instance at 0xf7cfe120> in <bound method fspawn.__del__ of <fencing.fspawn object at 0xf7cf37d0>> ignored Success: Rebooted Traceback (most recent call last): File "/sbin/fence_lpar", line 134, in ? main() File "/sbin/fence_lpar", line 128, in main except exceptions.OSError: NameError: global name 'exceptions' is not defined Exception exceptions.OSError: <exceptions.OSError instance at 0xf7cfe120> in <bound method fspawn.__del__ of <fencing.fspawn object at 0xf7cf37d0>> ignored It appears that a line was added to check against for OSError, but exceptions was never imported in any of the fence agents the line was added to.
Created attachment 354830 [details] Patch to fix exceptions.OSError - import + VMWare Proposed patch to fix problem found during tests.
Fixed in: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commit;h=82a489cba5637cfcebf43a6e8b3312e8a4555351
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1341.html