Bug 482913 - ipmi fence failed
Summary: ipmi fence failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.5
Hardware: x86_64
OS: Linux
urgent
medium
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 522529 5.5TechNotes-Updates 564012 576036
TreeView+ depends on / blocked
 
Reported: 2009-01-28 19:37 UTC by Flo
Modified: 2016-04-26 15:56 UTC (History)
12 users (show)

Fixed In Version: cman-2_0_115-5_el5
Doc Type: Bug Fix
Doc Text:
Cause: ======= Strange behavior of some HW IPMI implementations (for example IBM x3550, brand new supermicro motherboard). Consequence: ============ Old behavior: perhaps (chassis power status) chassis power off chassis power status (wait for this to return "off") chassis power on chassis power status (wait for this to return "on") and the BMC controller for some difficult to understand reason shuts itself down after the "chassis power off", so it will be not able to power on machine. Fix: ==== Add support for power cycle command, which doesn't shut BMC controller. Result: ======= Old behavior is default, so nothing change without reconfiguration. But now, there is new method option, which can have value cycle, what will cause a new behavior (use ipmi power cycle command). Example of usage:... <fencedevices> <fencedevice agent="fence_ipmilan_new" ipaddr="1.2.3.4" login="root" name="ipmifd1" passwd="password" method="cycle" /> ...
Clone Of:
: 564012 576036 (view as bug list)
Environment:
Last Closed: 2010-03-30 08:38:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Proposed patch (6.34 KB, patch)
2009-07-20 12:12 UTC, Jan Friesse
no flags Details | Diff
Binary version (compiled on RHEL 5.3) of ipmi lan with patch (10.06 KB, application/x-gzip)
2009-07-20 12:14 UTC, Jan Friesse
no flags Details
Patch commited to RHEL55 git branch (8.23 KB, patch)
2009-09-01 10:26 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0266 0 normal SHIPPED_LIVE cman bug fix and enhancement update 2010-03-29 12:54:44 UTC

Description Flo 2009-01-28 19:37:15 UTC
Description of problem:
IPMI Fencemodule in RHEL 5 Clustersuite produces poweroff, not reboot.
When I manually send "power off, power on" to IPMI-CLI the result is correct (reboot) - Verified with "power status".

So there must be a Bug in the fence_ipmilan module.

Output from our /etc/sysconfig/ipmi:
## Path:        Hardware/IPMI
## Description: Enable IPMI_POWEROFF if you want the IPMI poweroff module to be loaded.
## Type:        yesno
## Default:     "no"
## Config:      ipmi
# Enable IPMI_POWEROFF if you want the IPMI
# poweroff module to be loaded.
IPMI_POWEROFF=no

## Path:        Hardware/IPMI
## Description: Enable IPMI_POWERCYCLE if you want the system to be power-cycled on reboot
## Type:        yesno
## Default:     "no"
## Config:      ipmi
# Enable IPMI_POWERCYCLE if you want the system to be power-cycled (power
# down, delay briefly, power on) rather than power off, on systems
# that support such.  IPMI_POWEROFF=yes is also required.
IPMI_POWERCYCLE=no

IPMI Information:
ipmitool> mc info
Device ID                 : 17
Device Revision           : 1
Firmware Revision         : 1.29
IPMI Version              : 2.0
Manufacturer ID           : 11
Manufacturer Name         : Unknown (0xb)
Product ID                : 0 (0x0000)
Device Available          : yes
Provides Device SDRs      : yes

Version-Release number of selected component (if applicable):
RHEL 5.2

How reproducible:
Clustersuite (Node Fencingconfiguration)

Steps to Reproduce:
1. Configure Intel Ipmi as fencing-device.
2. Reproduce a networkinterruption (remove networkcalbe).
3. Check machine
  
Actual results:
Poweroff (not rebooting)

Expected results:
Poweroff (not rebooting)

Additional info:

Comment 1 Paul Kennedy 2009-02-09 22:03:09 UTC
The component for this but was set to "Cluster_Administration", which designates the cluster administration guide. This does not appear to be a documentation bug; rather it appears to be a bug with the fence_ipmilan module, as indicated in problem description. 

Setting component to "cman".

Comment 2 Jan Friesse 2009-02-10 16:10:22 UTC
Can you please test component from command line with -v switch and send output? 

Please run like this:
fence_ipmilan -a 'ip' -l root -p pass -v

Result should be something like:
Rebooting machine @ IPMI:ip...Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power off'...
Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power on'...
Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power status'...
Done

Comment 3 Flo 2009-02-11 08:33:10 UTC
Sorry, thats not possible at the moment.
It's productive customersystem. :-(

We tested fencing (/usr/bin/ipmitool) in our nightly switch from RH3 to RH5 Cluster (old STONITH-Devices not supported in RH5-Clustersuite).

The result was:
* /usr/bin/ipmitool -> correct -> reboot
* fence_ipmilan -> incorrect -> power down

Comment 4 Jan Friesse 2009-02-11 08:59:10 UTC
It's sad :(

You tested ipmitool with parameter chassis power reset. right? STONITH devices works in same way (have implemented on/off and reboot action with chassis power xxx command).

Fence_ipmilan works in different way. It first checks, whether chassis is powered on, if yes, power off, check status, test if chassis is really powered off, and very same for following power on.

I'm really not able to reproduce this bug. Please check http://sources.redhat.com/cluster/wiki/IPMI_FencingConfig if there is anything, what can helps you (like a different BMC NIC). Otherwise, it can help, if you have same device and test fencing on it.

From fencing point of view, this bug is not critical, because main function of fencing (power off) works.

Comment 6 Jan Friesse 2009-07-13 08:10:49 UTC
1. No it isn't. Fence ipmi should reboot machine.
2. Can be in case, somebody can provide me info, because I'm really not able to reproduce it.

If I run:
fence_ipmilan -i bar-08-mm -p pass -l root -v
Rebooting machine @ IPMI:bar-08-mm...Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power off'...
Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power on'...
Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power on'...
Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power on'...
Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'...
Done

You can see, machine is powered off and then on. I tested this on RHEL 5.3 with fence_ipmilan agent from 5.4.

Are you able to try run fence_ipmi command and send me output of what it doing (with -v)?

Regards,
  Honza

Comment 7 Jan Friesse 2009-07-16 15:14:10 UTC
From https://bugzilla.redhat.com/show_bug.cgi?id=276541:

Comment #13 From  Michael Jansen (michael.jansen.au)

Hi!

I get similar timeout problems, but they are due to a different cause.

I have IBM x3550 servers that I would like to use with IPMI fencing.  The BMC
that comes standard with these machines shares ethernet with linux (eth0).

What happens is that fence_ipmilan tries to fence the other node, it
uses the sequence

perhaps (chassis power status)

chassis power off
chassis power status (wait for this to return "off")
chassis power on
chassis power status (wait for this to return "on")

and the BMC controller for some difficult to understand reason shuts itself
down after the "chassis power off".  I do not know how to correct this
behaviour
in the BMC.  But: why does the ipmilan fence not use

chassis power cycle

seeing that there is a positive/negative response from the BMC controller?  I
suspect that would work.  But there does not seem to be an option in
fence_ipmilan
to use the "power cycle".

Comment 8 Jan Friesse 2009-07-20 12:12:35 UTC
Created attachment 354335 [details]
Proposed patch

Patch adding power cycle:

Default behaviour (off/get status/on) doesn't work on
some IPMI implementations, because chasiss power off will
turn off IPMI management card. Next power on cannot be
done automatically. But it looks, like chassis power cycle
is supported and do what we need -> reset machine.

Patch add support for -M (method) option, which can have
values:
- onoff - default old behaviour
- cycle - use new power cycle

Comment 9 Jan Friesse 2009-07-20 12:14:57 UTC
Created attachment 354336 [details]
Binary version (compiled on RHEL 5.3) of ipmi lan with patch

sha1: 1b923ca1205214c661eb68027c377d91e46ea399  fence_ipmilan.gz

Comment 10 Jan Friesse 2009-07-20 12:16:00 UTC
Can somebody of you confirm/deny, that this patch solved your problem?

Comment 11 Florencia Fotorello 2009-07-28 16:03:45 UTC
Hello Jan,

A customer wants to try this patch, but he needs a procedure to test it.

Could you provide a procedure to test this patch?

Thanks in advance.

Regards,

Florencia

Comment 12 Jan Friesse 2009-07-29 12:37:14 UTC
It depends if customer wants compile source code or not.

- in case of compilation - patch source codes, compiles -> result is fence_ipmilan
- in case of NOT compile - download https://bugzilla.redhat.com/attachment.cgi?id=354336, gunzip

In both cases, there is fence_ipmilan. This can be moved directly to /sbin/ (so overwrite existing agent), or better move to /sbin/fence_ipmilan_new.

Now only thing what need to be done is to change cluster.conf, and add "method" parameter with value "cycle" to ipmi fence agent.

So it can look like:
...

  <fencedevices>
                <fencedevice agent="fence_ipmilan_new" ipaddr="1.2.3.4" login="root" name="ipmifd1" passwd="password" method="cycle" />
...

After runnig ccs_tool update everything it can be tested by fence_node (or from luci).

Comment 13 Florencia Fotorello 2009-07-29 16:17:22 UTC
Thanks, I'll let you know the results.

Regards.

Comment 14 phil 2009-07-30 17:28:14 UTC
I was seeing this same behavior on a brand new supermicro motherboard, I am not sure the model of the supermicro IPMI module.

The new fence_ipmilan with an updated cluster.conf is working great now.

Comment 15 Florencia Fotorello 2009-08-05 20:58:32 UTC
The new fence_ipmilan works perfectly.

Thanks a lot for your help!

Regards,

--
Florencia Fotorello
Global Support Services
Red Hat Latin America

Comment 16 Florencia Fotorello 2009-08-27 13:14:29 UTC
Hello,

Could you please let me know when the new fence_ipmilan will be available in RHN to use in production systems?

Thanks in advance.

Regards,

Comment 17 Florencia Fotorello 2009-08-31 20:48:57 UTC
Hello,

Could you please confirm me if this fix will be this week in RHEL5.4?

Thanks.

Comment 18 Jan Friesse 2009-09-01 10:26:26 UTC
Created attachment 359372 [details]
Patch commited to RHEL55 git branch

This is final patch commited to GIT (fea471dc31137bb3c2583f369cf3af4a0e7eefcb)

Comment 19 Jan Friesse 2009-09-01 10:34:22 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Cause:
=======
Strange behavior of some HW IPMI implementations (for example IBM x3550, brand new supermicro motherboard).

Consequence:
============
Old behavior:
perhaps (chassis power status)
chassis power off
chassis power status (wait for this to return "off")
chassis power on
chassis power status (wait for this to return "on")

and the BMC controller for some difficult to understand reason shuts itself
down after the "chassis power off", so it will be not able to power on machine.

Fix:
====
Add support for power cycle command, which doesn't shut BMC controller.

Result:
=======
Old behavior is default, so nothing change without reconfiguration. But now, there is new method option, which can have value cycle, what will cause a new behavior (use ipmi power cycle command).

Example of usage:...

  <fencedevices>
                <fencedevice agent="fence_ipmilan_new" ipaddr="1.2.3.4"
login="root" name="ipmifd1" passwd="password" method="cycle" />
...

Comment 20 Jan Friesse 2009-09-01 10:35:36 UTC
(In reply to comment #17)
> Hello,
> 
> Could you please confirm me if this fix will be this week in RHEL5.4?
> 
> Thanks.  

No, it will be in 5.5 and maybe in 5.4.z.

Comment 26 Peter Robinson 2010-03-19 17:24:03 UTC
We're seeing the same issues on the same hardware (IBM 3550) on RHEL-4. Is there plans to backport this patch to that too?

Comment 28 Aladin 2010-03-26 13:44:21 UTC
We are using ipmi fencing on 2 IBM x3650 nodes. If a fail-over starts, the
passive node keeps fencing the other node over and over, because it is not
getting a "fence success message within 10s (default timeout)". 

Is it possible to increase this timeout, which is supported by
/sbin/fence_ipmilan, in cluster.conf? and if yes how?

Can the option "method=cycle" solve this issue?

Thanx in advance,
Ala' Abu-Sharar.

Comment 29 Jan Friesse 2010-03-29 07:55:24 UTC
(In reply to comment #28)
Hi,

> We are using ipmi fencing on 2 IBM x3650 nodes. If a fail-over starts, the
> passive node keeps fencing the other node over and over, because it is not
> getting a "fence success message within 10s (default timeout)". 
> 
> Is it possible to increase this timeout, which is supported by
> /sbin/fence_ipmilan, in cluster.conf? and if yes how?

it is possible by passing timeout= parameter.

> 
> Can the option "method=cycle" solve this issue?
> 

It depends on your problem. If problem is shut down of BMC, then yes. If problem is really timeout, then it CAN (because instead of series getstatus/poweroff/getstatus/poweron/getstatus only one powercycle is sent).

> Thanx in advance,
> Ala' Abu-Sharar.

Comment 30 Aladin 2010-03-29 17:08:16 UTC
(In reply to comment #29)
> (In reply to comment #28)
> Hi,
> 
> > We are using ipmi fencing on 2 IBM x3650 nodes. If a fail-over starts, the
> > passive node keeps fencing the other node over and over, because it is not
> > getting a "fence success message within 10s (default timeout)". 
> > 
> > Is it possible to increase this timeout, which is supported by
> > /sbin/fence_ipmilan, in cluster.conf? and if yes how?
> 
> it is possible by passing timeout= parameter.
> 
> > 
> > Can the option "method=cycle" solve this issue?
> > 
> 
> It depends on your problem. If problem is shut down of BMC, then yes. If
> problem is really timeout, then it CAN (because instead of series
> getstatus/poweroff/getstatus/poweron/getstatus only one powercycle is sent).
> 
> > Thanx in advance,
> > Ala' Abu-Sharar.    

Thanx Jan,

Yesterday I have add the parameter by editing cluster.conf on both nodes
manually as follows and it had worked:

<fence>
	<method name="1">
		<device timeout="20" lanplus="" name="fnc1"/>
	</method>
</fence>

the strange thing is, whenever I start "system-config-cluster" it keeps
complaining about syntax failure in cluster.conf! Which had led me to post
my question, thinking that this parameter is not accepted, since the GUI does
not have a field for timeout, You can never trust GUIs :)

We are using RHEL 5.4 with shipped "system-config-cluster". Do I need
to do anything, or just ignore the error message?

Regards,
Ala'.

Comment 31 errata-xmlrpc 2010-03-30 08:38:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html

Comment 32 Jan Friesse 2010-03-31 11:43:49 UTC
Hi,

> 
> the strange thing is, whenever I start "system-config-cluster" it keeps
> complaining about syntax failure in cluster.conf! Which had led me to post
> my question, thinking that this parameter is not accepted, since the GUI does
> not have a field for timeout, You can never trust GUIs :)
> 
> We are using RHEL 5.4 with shipped "system-config-cluster". Do I need
> to do anything, or just ignore the error message?

Ignore that message. You can try to file bug on system-config-cluster component to get fixed that.

> 
> Regards,
> Ala'.    

Regards,
  Honza


Note You need to log in before you can comment on or make changes to this bug.