Description of problem: IPMI Fencemodule in RHEL 5 Clustersuite produces poweroff, not reboot. When I manually send "power off, power on" to IPMI-CLI the result is correct (reboot) - Verified with "power status". So there must be a Bug in the fence_ipmilan module. Output from our /etc/sysconfig/ipmi: ## Path: Hardware/IPMI ## Description: Enable IPMI_POWEROFF if you want the IPMI poweroff module to be loaded. ## Type: yesno ## Default: "no" ## Config: ipmi # Enable IPMI_POWEROFF if you want the IPMI # poweroff module to be loaded. IPMI_POWEROFF=no ## Path: Hardware/IPMI ## Description: Enable IPMI_POWERCYCLE if you want the system to be power-cycled on reboot ## Type: yesno ## Default: "no" ## Config: ipmi # Enable IPMI_POWERCYCLE if you want the system to be power-cycled (power # down, delay briefly, power on) rather than power off, on systems # that support such. IPMI_POWEROFF=yes is also required. IPMI_POWERCYCLE=no IPMI Information: ipmitool> mc info Device ID : 17 Device Revision : 1 Firmware Revision : 1.29 IPMI Version : 2.0 Manufacturer ID : 11 Manufacturer Name : Unknown (0xb) Product ID : 0 (0x0000) Device Available : yes Provides Device SDRs : yes Version-Release number of selected component (if applicable): RHEL 5.2 How reproducible: Clustersuite (Node Fencingconfiguration) Steps to Reproduce: 1. Configure Intel Ipmi as fencing-device. 2. Reproduce a networkinterruption (remove networkcalbe). 3. Check machine Actual results: Poweroff (not rebooting) Expected results: Poweroff (not rebooting) Additional info:
The component for this but was set to "Cluster_Administration", which designates the cluster administration guide. This does not appear to be a documentation bug; rather it appears to be a bug with the fence_ipmilan module, as indicated in problem description. Setting component to "cman".
Can you please test component from command line with -v switch and send output? Please run like this: fence_ipmilan -a 'ip' -l root -p pass -v Result should be something like: Rebooting machine @ IPMI:ip...Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power on'... Spawning: '/usr/bin/ipmitool -I lan -H 'ip' -U 'root' -P 'pass' -v chassis power status'... Done
Sorry, thats not possible at the moment. It's productive customersystem. :-( We tested fencing (/usr/bin/ipmitool) in our nightly switch from RH3 to RH5 Cluster (old STONITH-Devices not supported in RH5-Clustersuite). The result was: * /usr/bin/ipmitool -> correct -> reboot * fence_ipmilan -> incorrect -> power down
It's sad :( You tested ipmitool with parameter chassis power reset. right? STONITH devices works in same way (have implemented on/off and reboot action with chassis power xxx command). Fence_ipmilan works in different way. It first checks, whether chassis is powered on, if yes, power off, check status, test if chassis is really powered off, and very same for following power on. I'm really not able to reproduce this bug. Please check http://sources.redhat.com/cluster/wiki/IPMI_FencingConfig if there is anything, what can helps you (like a different BMC NIC). Otherwise, it can help, if you have same device and test fencing on it. From fencing point of view, this bug is not critical, because main function of fencing (power off) works.
1. No it isn't. Fence ipmi should reboot machine. 2. Can be in case, somebody can provide me info, because I'm really not able to reproduce it. If I run: fence_ipmilan -i bar-08-mm -p pass -l root -v Rebooting machine @ IPMI:bar-08-mm...Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power on'... Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power on'... Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power on'... Spawning: '/usr/bin/ipmitool -I lan -H 'bar-08-mm' -U 'root' -P 'pass' -v chassis power status'... Done You can see, machine is powered off and then on. I tested this on RHEL 5.3 with fence_ipmilan agent from 5.4. Are you able to try run fence_ipmi command and send me output of what it doing (with -v)? Regards, Honza
From https://bugzilla.redhat.com/show_bug.cgi?id=276541: Comment #13 From Michael Jansen (michael.jansen.au) Hi! I get similar timeout problems, but they are due to a different cause. I have IBM x3550 servers that I would like to use with IPMI fencing. The BMC that comes standard with these machines shares ethernet with linux (eth0). What happens is that fence_ipmilan tries to fence the other node, it uses the sequence perhaps (chassis power status) chassis power off chassis power status (wait for this to return "off") chassis power on chassis power status (wait for this to return "on") and the BMC controller for some difficult to understand reason shuts itself down after the "chassis power off". I do not know how to correct this behaviour in the BMC. But: why does the ipmilan fence not use chassis power cycle seeing that there is a positive/negative response from the BMC controller? I suspect that would work. But there does not seem to be an option in fence_ipmilan to use the "power cycle".
Created attachment 354335 [details] Proposed patch Patch adding power cycle: Default behaviour (off/get status/on) doesn't work on some IPMI implementations, because chasiss power off will turn off IPMI management card. Next power on cannot be done automatically. But it looks, like chassis power cycle is supported and do what we need -> reset machine. Patch add support for -M (method) option, which can have values: - onoff - default old behaviour - cycle - use new power cycle
Created attachment 354336 [details] Binary version (compiled on RHEL 5.3) of ipmi lan with patch sha1: 1b923ca1205214c661eb68027c377d91e46ea399 fence_ipmilan.gz
Can somebody of you confirm/deny, that this patch solved your problem?
Hello Jan, A customer wants to try this patch, but he needs a procedure to test it. Could you provide a procedure to test this patch? Thanks in advance. Regards, Florencia
It depends if customer wants compile source code or not. - in case of compilation - patch source codes, compiles -> result is fence_ipmilan - in case of NOT compile - download https://bugzilla.redhat.com/attachment.cgi?id=354336, gunzip In both cases, there is fence_ipmilan. This can be moved directly to /sbin/ (so overwrite existing agent), or better move to /sbin/fence_ipmilan_new. Now only thing what need to be done is to change cluster.conf, and add "method" parameter with value "cycle" to ipmi fence agent. So it can look like: ... <fencedevices> <fencedevice agent="fence_ipmilan_new" ipaddr="1.2.3.4" login="root" name="ipmifd1" passwd="password" method="cycle" /> ... After runnig ccs_tool update everything it can be tested by fence_node (or from luci).
Thanks, I'll let you know the results. Regards.
I was seeing this same behavior on a brand new supermicro motherboard, I am not sure the model of the supermicro IPMI module. The new fence_ipmilan with an updated cluster.conf is working great now.
The new fence_ipmilan works perfectly. Thanks a lot for your help! Regards, -- Florencia Fotorello Global Support Services Red Hat Latin America
Hello, Could you please let me know when the new fence_ipmilan will be available in RHN to use in production systems? Thanks in advance. Regards,
Hello, Could you please confirm me if this fix will be this week in RHEL5.4? Thanks.
Created attachment 359372 [details] Patch commited to RHEL55 git branch This is final patch commited to GIT (fea471dc31137bb3c2583f369cf3af4a0e7eefcb)
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: ======= Strange behavior of some HW IPMI implementations (for example IBM x3550, brand new supermicro motherboard). Consequence: ============ Old behavior: perhaps (chassis power status) chassis power off chassis power status (wait for this to return "off") chassis power on chassis power status (wait for this to return "on") and the BMC controller for some difficult to understand reason shuts itself down after the "chassis power off", so it will be not able to power on machine. Fix: ==== Add support for power cycle command, which doesn't shut BMC controller. Result: ======= Old behavior is default, so nothing change without reconfiguration. But now, there is new method option, which can have value cycle, what will cause a new behavior (use ipmi power cycle command). Example of usage:... <fencedevices> <fencedevice agent="fence_ipmilan_new" ipaddr="1.2.3.4" login="root" name="ipmifd1" passwd="password" method="cycle" /> ...
(In reply to comment #17) > Hello, > > Could you please confirm me if this fix will be this week in RHEL5.4? > > Thanks. No, it will be in 5.5 and maybe in 5.4.z.
We're seeing the same issues on the same hardware (IBM 3550) on RHEL-4. Is there plans to backport this patch to that too?
We are using ipmi fencing on 2 IBM x3650 nodes. If a fail-over starts, the passive node keeps fencing the other node over and over, because it is not getting a "fence success message within 10s (default timeout)". Is it possible to increase this timeout, which is supported by /sbin/fence_ipmilan, in cluster.conf? and if yes how? Can the option "method=cycle" solve this issue? Thanx in advance, Ala' Abu-Sharar.
(In reply to comment #28) Hi, > We are using ipmi fencing on 2 IBM x3650 nodes. If a fail-over starts, the > passive node keeps fencing the other node over and over, because it is not > getting a "fence success message within 10s (default timeout)". > > Is it possible to increase this timeout, which is supported by > /sbin/fence_ipmilan, in cluster.conf? and if yes how? it is possible by passing timeout= parameter. > > Can the option "method=cycle" solve this issue? > It depends on your problem. If problem is shut down of BMC, then yes. If problem is really timeout, then it CAN (because instead of series getstatus/poweroff/getstatus/poweron/getstatus only one powercycle is sent). > Thanx in advance, > Ala' Abu-Sharar.
(In reply to comment #29) > (In reply to comment #28) > Hi, > > > We are using ipmi fencing on 2 IBM x3650 nodes. If a fail-over starts, the > > passive node keeps fencing the other node over and over, because it is not > > getting a "fence success message within 10s (default timeout)". > > > > Is it possible to increase this timeout, which is supported by > > /sbin/fence_ipmilan, in cluster.conf? and if yes how? > > it is possible by passing timeout= parameter. > > > > > Can the option "method=cycle" solve this issue? > > > > It depends on your problem. If problem is shut down of BMC, then yes. If > problem is really timeout, then it CAN (because instead of series > getstatus/poweroff/getstatus/poweron/getstatus only one powercycle is sent). > > > Thanx in advance, > > Ala' Abu-Sharar. Thanx Jan, Yesterday I have add the parameter by editing cluster.conf on both nodes manually as follows and it had worked: <fence> <method name="1"> <device timeout="20" lanplus="" name="fnc1"/> </method> </fence> the strange thing is, whenever I start "system-config-cluster" it keeps complaining about syntax failure in cluster.conf! Which had led me to post my question, thinking that this parameter is not accepted, since the GUI does not have a field for timeout, You can never trust GUIs :) We are using RHEL 5.4 with shipped "system-config-cluster". Do I need to do anything, or just ignore the error message? Regards, Ala'.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html
Hi, > > the strange thing is, whenever I start "system-config-cluster" it keeps > complaining about syntax failure in cluster.conf! Which had led me to post > my question, thinking that this parameter is not accepted, since the GUI does > not have a field for timeout, You can never trust GUIs :) > > We are using RHEL 5.4 with shipped "system-config-cluster". Do I need > to do anything, or just ignore the error message? Ignore that message. You can try to file bug on system-config-cluster component to get fixed that. > > Regards, > Ala'. Regards, Honza