Bug 501586

Summary: fence agents (fence_apc, fence_wti) fails with pexpect exception
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: cmanAssignee: Marek Grac <mgrac>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: cfeist, cluster-maint, edamato
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cman-2.0.112-1.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 460054
: 501890 504589 (view as bug list) Environment:
Last Closed: 2009-09-02 11:09:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 460054    
Bug Blocks: 501890, 504589    
Attachments:
Description Flags
Patch to fix exceptions.OSError - import + VMWare none

Description Nate Straz 2009-05-19 20:51:55 UTC
+++ This bug was initially created as a clone of Bug #460054 +++

Description of problem:

I hit this problem again during RHEL 5.4 testing with revolver.  In my three node cluster, dash-01 was continuously fencing dash-02 until I intervened and rebooted dash-01.

Version-Release number of selected component (if applicable):
cman-2.0.101-1.el5

How reproducible:
Unknown

Steps to Reproduce:
1. run revolver
  
Actual results:

Message repeated in /var/log/messages on dash-01:

May 19 14:28:18 dash-01 fenced[8514]: fencing node "dash-02"
May 19 14:28:25 dash-01 fenced[8514]: agent "fence_apc" reports: Success: Rebooted Traceback (most recent call last):   File "/sbin/fence_apc", line 216, in ?     main()   File "/sbin/fence_apc", line 211, in main     conn.close()   File "/usr/lib/python2.4/site-packages/pexpect.py", line 666, in close     raise Except
May 19 14:28:25 dash-01 fenced[8514]: agent "fence_apc" reports: ionPexpect ('close() could not terminate the child using terminate()') pexpect.ExceptionPexpect: close() could not terminate the child using terminate() Exception exceptions.OSError: <exceptions.OSError instance at 0x2b28b126bc20> in <bound method fspawn.
May 19 14:28:25 dash-01 fenced[8514]: agent "fence_apc" reports: __del__ of <fencing.fspawn object at 0x2b28b0012e90>> ignored
May 19 14:28:25 dash-01 fenced[8514]: fence "dash-02" failed

Which cleans up as:

Success: Rebooted
Traceback (most recent call last):
   File "/sbin/fence_apc", line 216, in ?
     main()
   File "/sbin/fence_apc", line 211, in main
     conn.close()
   File "/usr/lib/python2.4/site-packages/pexpect.py", line 666, in close
    raise ExceptionPexpect ('close() could not terminate the child using terminate()') pexpect.ExceptionPexpect: close() could not terminate the child using terminate()
 Exception exceptions.OSError: <exceptions.OSError instance at 0x2b28b126bc20> in <bound method fspawn.__del__ of <fencing.fspawn object at 0x2b28b0012e90>> ignored

This looks like the exception which fence_apc should actually catch is ExecptionPexpect instead of OSError.

Expected results:


Additional info:

Comment 3 Nate Straz 2009-06-05 16:55:16 UTC
While running regressions on 5.3.z I was able to hit this with the fence_wti agent also.

Jun  5 00:52:52 z1 fenced[5635]: fencing node "z4"
Jun  5 00:52:59 z1 fenced[5635]: agent "fence_wti" reports: Success: Rebooted Traceback (most recent call last):   File "/sbin/fence_wti", line 109, in ?     main()   File "/sbin/fen
ce_wti", line 106, in main     conn.close()   File "/usr/lib/python2.4/site-packages/pexpect.py", line 666, in close     raise Except
Jun  5 00:52:59 z1 fenced[5635]: agent "fence_wti" reports: ionPexpect ('close() could not terminate the child using terminate()') pexpect.ExceptionPexpect: close() could not termina
te the child using terminate() Exception exceptions.OSError: <exceptions.OSError instance at 0xb7eedd0c> in <bound method fspawn.__de
Jun  5 00:52:59 z1 fenced[5635]: agent "fence_wti" reports: l__ of <fencing.fspawn object at 0xb7c7492c>> ignored
Jun  5 00:52:59 z1 fenced[5635]: fence "z4" failed

This eventually led to z1 being overwhelmed with telnet processes and z1 needed to be fenced.

All fence agents which use pexpect.py should handle the ExceptionPexpect exception on conn.close()

Comment 5 Nate Straz 2009-06-18 19:04:03 UTC
Verified that handling of ExceptionPexpect is included in cman-2.0.108-1.el5.

Comment 6 Nate Straz 2009-07-21 22:01:42 UTC
I hit this during revolver testing:

Jul 20 18:14:10 basic-p2 fenced[1699]: agent "fence_lpar" reports: Success: Rebooted Traceback (most recent call last):   File "/sbin/fence_lpar", line 134, in ?     main()   File "/sbin/fence_lpar", line 128, in main     except exceptions.OSError: NameError: global name 'exceptions' is not defined Exception exceptions.O
Jul 20 18:14:10 basic-p2 fenced[1699]: agent "fence_lpar" reports: SError: <exceptions.OSError instance at 0xf7cfe120> in <bound method fspawn.__del__ of <fencing.fspawn object at 0xf7cf37d0>> ignored

 Success: Rebooted Traceback (most recent call last):
   File "/sbin/fence_lpar", line 134, in ?     main()
   File "/sbin/fence_lpar", line 128, in main     except exceptions.OSError:
NameError: global name 'exceptions' is not defined Exception exceptions.OSError: <exceptions.OSError instance at 0xf7cfe120> in <bound method fspawn.__del__ of <fencing.fspawn object at 0xf7cf37d0>> ignored

It appears that a line was added to check against for OSError, but exceptions was never imported in any of the fence agents the line was added to.

Comment 8 Marek Grac 2009-07-23 09:37:03 UTC
Created attachment 354830 [details]
Patch to fix exceptions.OSError - import + VMWare

Proposed patch to fix problem found during tests.

Comment 13 errata-xmlrpc 2009-09-02 11:09:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1341.html