248006 – fence_bladecenter fails when blade is not physically present

Bug 248006 - fence_bladecenter fails when blade is not physically present

Summary: fence_bladecenter fails when blade is not physically present

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.0
Hardware:	i386
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Marek Grac
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	487501
TreeView+	depends on / blocked

Reported:	2007-07-12 17:37 UTC by Scott Thistle
Modified:	2018-10-27 14:29 UTC (History)
CC List:	5 users (show)
Fixed In Version:	cman-2.0.115-16.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 08:41:05 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Modified version of fence_bladecenter (7.79 KB, text/plain) 2007-08-16 09:35 UTC, Cato Feness	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2010:0266	0	normal	SHIPPED_LIVE	cman bug fix and enhancement update	2010-03-29 12:54:44 UTC

Description Scott Thistle 2007-07-12 17:37:29 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4

Description of problem:
If a blade is not present (i.e. removed for maintenance), the fence_bladecenter cannot check the state as it is reported empty. I think it is something simple to fix for those versed in perl. Normally the fence only runs against a blade that is present. If the blade is removed while running, you run into this issue.

My case below. Blade #3 is a good node. Blade #2 was removed. The fence does not work with the blade removed.

system> env -T system:blade[3]
OK
system:blade[3]> power -state
On
system:blade[3]> env -T system:blade[2]
The target bay is empty.
system:blade[3]> env -T system:blade[1]
OK
system:blade[1]>

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. Bring up cluster on two nodes
2. Physically remove blade running the service
3. Fence fails shown in log

Actual Results:
The clustered service does not failover to standby node

Expected Results:
Clustered service should failover. Fence should detect that fenced node is no longer present in Blade_Center instead of hanging

Additional info:
Got this from James Parsons - RHCS Mailing list..

I believe this is what you want to happen...if state cannot be checked, fenced keeps trying. How could you determine it was safe to stop without persisting some value like the number of fence tries, and trying to reason out whether it was safe to stop? This will not happen if you remove the blade from the cluster before physically removing it. It is a snap to do this with one of the UIs, if you are not prejudiced against UIs :).

Also, removing the node from cluster membership before jerking it out of the rack tells rgmanager to move any services off of it - rather than having to depend on heartbeat failure to make this happen.

That said, if the blade catches fire and a cage IT guy notices and jerks it quick, (using his IT Oven Mitt, of course) it is silly for fenced to keep incessantly trying when the thing no longer even exists. Perhaps the correct solution would be to have the fence_bladecenter report success if the bladecenter admin unit reports that 'no status is available' for a particular blade - obviously if the thing is not there, it should be safe to say it is fenced :)

If this addresses your situation (I think it does), now would be a REALLY good time to file a ticket requesting this behavior - like today! I'll post a fixed version to the ticket when it is ready.

Thanks to Lon for discussing this with me...;)

Regards,

-Jim

Comment 1 Cato Feness 2007-08-16 09:35:26 UTC

Created attachment 161641 [details]
Modified version of fence_bladecenter

Comment 2 Cato Feness 2007-08-16 09:36:54 UTC

I ran into the same problem, and modified the fence_bladecenter script so that:

- Turning a blade off will report success if the blade is absent
- Rebooting a blade will report  success if the blade is absent

Cluster Suite seems to only use the "reboot"-command. The modified script is
attached.

Comment 3 Marek Grac 2009-10-14 13:08:35 UTC

Patch is in upstream, now:

option --missing-as-off

http://git.fedorahosted.org/git/cluster.git?p=fence-agents.git;a=commit;h=460609f46485cd39f1ea158a00e1ca8fa3dd809f

Comment 5 Chris Ward 2010-02-11 10:09:18 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 6 Scott Thistle 2010-02-11 12:32:45 UTC

The previous fix worked for me 2007-08-16 from Cato. We are unable to run any BETA versions of software.

Comment 7 Jaroslav Kortus 2010-03-15 16:54:52 UTC

Yet again, manpage for fence_agent was not properly updated, so this change is undocumented feature.

Comment 8 Perry Myers 2010-03-15 17:42:18 UTC

(In reply to comment #7)
> Yet again, manpage for fence_agent was not properly updated, so this change is
> undocumented feature.    

Jaroslav, can we mark this bug as VERIFIED and file a new bug against the man page issue?  I don't want to hold up RHEL5.5 release due to a man page deficiency but do want to track it so that we get it fixed for the next release.

Thanks

Comment 9 Jaroslav Kortus 2010-03-16 11:41:30 UTC

Moving manpages request to bug 573990 and marking this as verified.

Comment 11 errata-xmlrpc 2010-03-30 08:41:05 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html

Note You need to log in before you can comment on or make changes to this bug.