Bug 276541

Summary:

fence_impilan blocks alternative fencing agents when connectivity to IPMI fails.

Product:

Red Hat Enterprise Linux 5

Reporter:

Reiner Rottmann <rrottmann>

Component:

cman

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

5.0

CC:

bkahn, bstevens, cfeist, cluster-maint, cmarthal, djansa, hlawatschek, jfriesse, michael.jansen, zheka

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

cman-2.0.100-1.el5

Doc Type:

Bug Fix

Doc Text:

Cause: Long TCP connection timeout Consequence: Fence_ipmi fails after long timeout, and no other fence device was called then Fix: Make TCP connection timeout shorter Result: Fence_ipmi fails, but after short timeout so other fence device can be called

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-09-02 11:08:36 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

472370

Attachments:

Description	Flags
Patch fixing this bug	none
Patch to add resending of IPMI command if it has not taken effect yet.	none

Description Reiner Rottmann 2007-09-04 15:48:48 UTC

Description of problem:
If there is no connection to IPMI that is used as fencing device, fence_imilan
fails and no other fencing devices get the chance to intervene (perhaps due to a
very long timeout).

Version-Release number of selected component (if applicable):
fence-1.32.25-1 - fence_ipmilan

How reproducible:
Everytime you disable connectivity to IPMI and want to use IPMI for fencing on
the cluster. For example with iptables rule that rejects packets to its
destination IP.

Steps to Reproduce:
1. iptables -A OUTPUT -d <ipmi_ip> -j REJECT
2. fence_node <nodename>
3. watch /var/log/messages and output of command
  
Actual results:
You will get a timeout after a _very_ long period. In the meantime and after the
command finally returns a failure, alternative fencing agents aren't tried. So
the whole fencing process fails even if there are other fencing agents enabled
and verified to work otherwise.


Expected results:
After a reasonable amount of time (much less than now) the agent should return
that it failed to fence the node. Then other fencing agents should get a chance
to fence the failed server with success.


Additional info:
# uname -a
Linux axqa02rc_1 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)

Comment 1 Jim Parsons 2007-09-12 19:44:37 UTC

Lon - what are your thoughts about adding/adjusting the ipmilan timeout?

Comment 2 Jan Friesse 2008-11-20 13:56:48 UTC

*** Bug 401481 has been marked as a duplicate of this bug. ***

Comment 3 Jan Friesse 2008-11-20 14:01:22 UTC

*** Bug 452894 has been marked as a duplicate of this bug. ***

Comment 4 Jan Friesse 2008-11-20 14:18:46 UTC

Created attachment 324178 [details]
Patch fixing this bug

Bug was cause by very long timeout in IPMI agent.

This patch adjust timeout to default value 10s which should be enough for most today IPMI implementations. It also removes retries, because this job is done
by fenced.

Because some devices still need longer timeouts, timeout is adjustable by parameter -t (or timeout for stdin and XML configuration).

Comment 7 Kris Lindgren 2009-02-07 00:35:13 UTC

Created attachment 331179 [details]
Patch to add resending of IPMI command if it has not taken effect yet.

Patch to add re-sending of IPMI power on or off command if the command has not worked yet.  Also changes the timeout values and the number of retries done for the power off or power on command.

Comment 8 Kris Lindgren 2009-02-07 00:38:18 UTC

Hello,

While this patch does fix the bug that happens when the connection to IMPI times out it does not fix the exact same issue if an ipmi command fails to successfully complete. In our particular case with fence_ipmilan doing a reboot command the IPMI card would succesfully power the machine off, however 9 times out of 10 it would not receive or not process the ipmi power on command. This would cause fence_ipmilan to wait 25 seconds (checking the status of the chassis power every 5 seconds) and then abort saying that it was unable to power the machine back on. Mean while the following error gets logged:

Feb 4 16:27:05 vzcluster1 ccsd[16660]: Attempt to close an unopened CCS descriptor (9390).
Feb 4 16:27:05 vzcluster1 ccsd[16660]: Error while processing disconnect: Invalid request descriptor

and the cluster does not attempt to do any other configured fencing method for the node.

I have attached a patch that increase's the number of retires for the status command from 5 to 7 and decreases the time waited between the retires to 2 seconds. Also, if the server is not within the correct expected state it re-sends the command that we want it to perform, sleep 2 seconds and then check the status again.

Also, I am not sure why the logic was to not attempt to resend the ipmi command if it has not taken effect yet. The worse thing I can see happening would be a delayed ipmi action, in which the ipmi card performs requested action and then is asked to perform it again (which would change nothing).

Eitheway with both of our DELL and Supermicro IPMI implementations this has fixed our IPMI fencing issues.

Comment 9 Jan Friesse 2009-02-11 09:41:58 UTC

Thanks for patch. It seems doesn't hurt anything so it will be included in new version. Again, thanks for patch.

Comment 12 Jan Friesse 2009-05-19 07:43:15 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Cause: Long TCP connection timeout
Consequence: Fence_ipmi fails after long timeout, and no other fence device was called then
Fix: Make TCP connection timeout shorter
Result: Fence_ipmi fails, but after short timeout so other fence device can be called

Comment 13 Michael Jansen 2009-07-15 05:11:18 UTC

Hi!

I get similar timeout problems, but they are due to a different cause.

I have IBM x3550 servers that I would like to use with IPMI fencing.  The BMC that comes standard with these machines shares ethernet with linux (eth0).

What happens is that fence_ipmilan tries to fence the other node, it
uses the sequence

perhaps (chassis power status)

chassis power off
chassis power status (wait for this to return "off")
chassis power on
chassis power status (wait for this to return "on")

and the BMC controller for some difficult to understand reason shuts itself
down after the "chassis power off".  I do not know how to correct this behaviour
in the BMC.  But: why does the ipmilan fence not use

chassis power cycle

seeing that there is a positive/negative response from the BMC controller?  I suspect that would work.  But there does not seem to be an option in fence_ipmilan
to use the "power cycle".

Comment 14 Jan Friesse 2009-07-15 10:55:26 UTC

(In reply to comment #13)
> Hi!
> 
> I get similar timeout problems, but they are due to a different cause.
> 
> I have IBM x3550 servers that I would like to use with IPMI fencing.  The BMC
> that comes standard with these machines shares ethernet with linux (eth0).
> 
> What happens is that fence_ipmilan tries to fence the other node, it
> uses the sequence
> 
> perhaps (chassis power status)
> 
> chassis power off
> chassis power status (wait for this to return "off")
> chassis power on
> chassis power status (wait for this to return "on")
> 
> and the BMC controller for some difficult to understand reason shuts itself
> down after the "chassis power off".  I do not know how to correct this
> behaviour
> in the BMC.  But: why does the ipmilan fence not use
> 
> chassis power cycle
> 
> seeing that there is a positive/negative response from the BMC controller?  I
> suspect that would work.  But there does not seem to be an option in
> fence_ipmilan
> to use the "power cycle".  

This is totally different problem then reported. If you can, please create new bugzilla, and we can discuss it more.

Thanks,
  Honza

Comment 15 Jan Friesse 2009-07-16 15:13:49 UTC

(In reply to comment #14)
> (In reply to comment #13)
> > Hi!
> > 
> > I get similar timeout problems, but they are due to a different cause.
> > 
> > I have IBM x3550 servers that I would like to use with IPMI fencing.  The BMC
> > that comes standard with these machines shares ethernet with linux (eth0).
> > 
> > What happens is that fence_ipmilan tries to fence the other node, it
> > uses the sequence
> > 
> > perhaps (chassis power status)
> > 
> > chassis power off
> > chassis power status (wait for this to return "off")
> > chassis power on
> > chassis power status (wait for this to return "on")
> > 
> > and the BMC controller for some difficult to understand reason shuts itself
> > down after the "chassis power off".  I do not know how to correct this
> > behaviour
> > in the BMC.  But: why does the ipmilan fence not use
> > 
> > chassis power cycle
> > 
> > seeing that there is a positive/negative response from the BMC controller?  I
> > suspect that would work.  But there does not seem to be an option in
> > fence_ipmilan
> > to use the "power cycle".  
> 
> This is totally different problem then reported. If you can, please create new
> bugzilla, and we can discuss it more.
> 
> Thanks,
>   Honza  

I think, I found better solution. I copied your problem to https://bugzilla.redhat.com/show_bug.cgi?id=482913, because maybe this is same problem (I hope). Please, comment that bug, rather than this.

Regards,
  Honza

Comment 17 errata-xmlrpc 2009-09-02 11:08:36 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1341.html