Bug 1191709
Summary: | HA VMs fail to start on other hosts after power failure | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Siddharth Patil <siddharth> | ||||
Component: | ovirt-engine-core | Assignee: | Martin Perina <mperina> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Pavel Stehlik <pstehlik> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.5 | CC: | bugs, ecohen, gklein, iheim, lsurette, mperina, rbalakri, yeylon | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Unspecified | ||||||
Whiteboard: | infra | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-02-12 10:59:18 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
I've configured testing cluster with 2 hosts and shared NFS storage all on Centos 6.6 with ovirt-engine-3.5.1.1-1.el6 and vdsm-4.16.10-8.gitc937927.el6. I've made one of the host non responding using these scenarios: 1. Turn off the host using IPMI 2. Block network connection between host and engine using iptables 3. Shutdown network device on host completely And on all 3 cases non responding host was always properly fenced and HA VM was restarted on the other host. So the only strange thing I can find on your logs is that there was an error to get power status of host vmh-02 and error to shutdown the host properly. So are you sure that your IPMI interface works as expected? I mean could you please test following commands: 1. Turn off you server using IPMI fence_ipmilan -a <IP> -l <USER> -p <PASSWORD> -o off -v -P 2. Wait a few sec until your server is really powered down 3. Check your server status using IPMI fence_ipmilan -a <IP> -l <USER> -p <PASSWORD> -o status -v -P You should receive following message: Chassis power = Off Thanks [root@vmh-01 ~]# fence_ipmilan -a 10.9.1.11 -l ADMIN -p XXXXXXXX -o off -v -P Powering off machine @ IPMI:10.9.1.11...Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... ipmilan: Power still on Failed [root@vmh-01 ~]# fence_ipmilan -a 10.9.1.11 -l ADMIN -p XXXXXXXX -o status -v -P Getting status of IPMI:10.9.1.11...Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Chassis power = Off Done However, in my case, neither vmh-01 or vmh-03 were able to connect to the vmh-02 IPMI interface since power was completely turned off to the server. The test case is complete hardware failure. If you completely turn off you power, then your IPMI interface will not work and we cannot get status and/or restart host. If you want to cover also this scenario, you need another fencing device (independent of your server), which will control power flow of your server (for example some APC integrated with your UPS) and configure it as secondary fencing device. So can I close the bug (as it's not an oVirt bug)? In that case, I feel this is much needed feature. The HA-enabled VMs must restart on another host if we lose all connectivity to a host--that's the whole point of high availability. Other products (e.g. Oracle VM) seem to work around this by reading timestamps off the shared storage, so why can't we? In oVirt 3.5 we support only this scenario: if you have cluster with compatibility version >= 3.5, you can enable "Skip fencing if host is connected to storage" in Fencing policy tab of Cluster Detail popup in webadmin. If this feature is enabled, then we check is host is still connected to storage prior to execute PM stop. If host is connected, then PM stop is cancelled and whole Non Responding Treatment is finished with message: "Host XXX became non responsive and was not restarted due to the Cluster Fencing Policy" But we don't have any option how to really determine if host is off when none of its PM devices are not accessible. What you want may be covered by Sanlock Fencing feature (BZ1086178). Anyway, I'm closing this bug, because your case was a misuse of fence agent included in host. If you want your host to be restarted even when it's completely without power you will need another external fencing device which will control power flow to your server (for example some APC based). |
Created attachment 990615 [details] VDSM logs from vmh-01 and vmh-03, and engine.log Description of problem: HA-enabled VMs fail to start on other hosts after complete failure of Power Management enabled host. Version-Release number of selected component (if applicable): How reproducible: Unknown Steps to Reproduce: 1. Create a 3 host cluster (vmh-01, vmh-02 and vmh-03) with Power Management enabled through IPMI. Shared storage through iSCSI backend. 2. Create a new VM with HA enabled on vmh-02 3. Turn off power to vmh-02 Actual results: The VM is not started on another host with the cluster. The VM status shows as "Unknown" in the Web UI. The following error keeps showing up in the UI: "Host vmh-02 is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued." Expected results: The VM should restart on another host in the cluster. Additional info: When vmh-02 is up and running, results from fence_ipmilan are as expected: Getting status of IPMI:10.9.1.11...Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Chassis power = On Done Hosts: CentOS 6.6 with oVirt release 3.5.1 ovirt-engine: 3.5.1.1-1.el6 Relevant logs from vmh-01, vmh-03 and ovirt-engine are attached.