1191709 – HA VMs fail to start on other hosts after power failure

Bug 1191709 - HA VMs fail to start on other hosts after power failure

Summary: HA VMs fail to start on other hosts after power failure

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-core
Sub Component:
Version:	3.5
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Martin Perina
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-02-11 19:35 UTC by Siddharth Patil
Modified:	2015-02-12 10:59 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-02-12 10:59:18 UTC
oVirt Team:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
VDSM logs from vmh-01 and vmh-03, and engine.log (1.25 MB, application/zip) 2015-02-11 19:35 UTC, Siddharth Patil	no flags	Details
View All

Description Siddharth Patil 2015-02-11 19:35:17 UTC

Created attachment 990615 [details]
VDSM logs from vmh-01 and vmh-03, and engine.log

Description of problem:
HA-enabled VMs fail to start on other hosts after complete failure of Power Management enabled host.

Version-Release number of selected component (if applicable):

How reproducible:
Unknown

Steps to Reproduce:
1. Create a 3 host cluster (vmh-01, vmh-02 and vmh-03) with Power Management enabled through IPMI. Shared storage through iSCSI backend.
2. Create a new VM with HA enabled on vmh-02
3. Turn off power to vmh-02

Actual results:
The VM is not started on another host with the cluster. The VM status shows as "Unknown" in the Web UI. The following error keeps showing up in the UI:

"Host vmh-02 is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued."

Expected results:
The VM should restart on another host in the cluster.

Additional info:

When vmh-02 is up and running, results from fence_ipmilan are as expected:

Getting status of IPMI:10.9.1.11...Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
Chassis power = On
Done 

Hosts: CentOS 6.6 with oVirt release 3.5.1
ovirt-engine: 3.5.1.1-1.el6

Relevant logs from vmh-01, vmh-03 and ovirt-engine are attached.

Comment 1 Martin Perina 2015-02-12 07:33:22 UTC

I've configured testing cluster with 2 hosts and shared NFS storage all on Centos 6.6 with ovirt-engine-3.5.1.1-1.el6 and vdsm-4.16.10-8.gitc937927.el6. I've made one of the host non responding using these scenarios:

 1. Turn off the host using IPMI
 2. Block network connection between host and engine using iptables
 3. Shutdown network device on host completely

And on all 3 cases non responding host was always properly fenced and HA VM was restarted on the other host.

So the only strange thing I can find on your logs is that there was an error to get power status of host vmh-02 and error to shutdown the host properly.

So are you sure that your IPMI interface works as expected? I mean could you please test following commands:

 1. Turn off you server using IPMI

    fence_ipmilan -a <IP> -l <USER> -p <PASSWORD> -o off -v -P

 2. Wait a few sec until your server is really powered down

 3. Check your server status using IPMI

    fence_ipmilan -a <IP> -l <USER> -p <PASSWORD> -o status -v -P


You should receive following message:

  Chassis power = Off

Thanks

Comment 2 Siddharth Patil 2015-02-12 07:53:15 UTC

[root@vmh-01 ~]# fence_ipmilan -a 10.9.1.11 -l ADMIN -p XXXXXXXX -o off -v -P
Powering off machine @ IPMI:10.9.1.11...Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
ipmilan: Power still on
Failed
[root@vmh-01 ~]# fence_ipmilan -a 10.9.1.11 -l ADMIN -p XXXXXXXX -o status -v -P
Getting status of IPMI:10.9.1.11...Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'...
Chassis power = Off
Done

However, in my case, neither vmh-01 or vmh-03 were able to connect to the vmh-02 IPMI interface since power was completely turned off to the server. The test case is complete hardware failure.

Comment 3 Martin Perina 2015-02-12 08:32:47 UTC

If you completely turn off you power, then your IPMI interface will not work and we cannot get status and/or restart host. If you want to cover also this scenario, you need another fencing device (independent of your server), which will control power flow of your server (for example some APC integrated with your UPS) and configure it as secondary fencing device.

So can I close the bug (as it's not an oVirt bug)?

Comment 4 Siddharth Patil 2015-02-12 10:26:04 UTC

In that case, I feel this is much needed feature. The HA-enabled VMs must restart on another host if we lose all connectivity to a host--that's the whole point of high availability. 

Other products (e.g. Oracle VM) seem to work around this by reading timestamps off the shared storage, so why can't we?

Comment 5 Martin Perina 2015-02-12 10:59:18 UTC

In oVirt 3.5 we support only this scenario: if you have cluster with compatibility version >= 3.5, you can enable "Skip fencing if host is connected to storage" in Fencing policy tab of Cluster Detail popup in webadmin. If this feature is enabled, then we check is host is still connected to storage prior to execute PM stop. If host is connected, then PM stop is cancelled and whole Non Responding Treatment is finished with message:

  "Host XXX became non responsive and was not restarted due to the Cluster Fencing Policy"


But we don't have any option how to really determine if host is off when none of its PM devices are not accessible.

What you want may be covered by Sanlock Fencing feature (BZ1086178).

Anyway, I'm closing this bug, because your case was a misuse of fence agent included in host. If you want your host to be restarted even when it's completely without power you will need another external fencing device which will control power flow to your server (for example some APC based).

Note You need to log in before you can comment on or make changes to this bug.