Bug 1215434

Summary: Hosted Engine failed to migrate due to network failure
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Stuart James <stuartjames>
Component: GeneralAssignee: Martin Sivák <msivak>
Status: CLOSED WONTFIX QA Contact: Artyom <alukiano>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: ---CC: alukiano, amureini, bugs, dfediuck, ecohen, gklein, istein, lsurette, nashok, rbalakri, rgolan, stuartjames, yeylon
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sla
Fixed In Version: Doc Type: Known Issue
Doc Text:
Cause: The ha agent call to repoStats is stalling because the iso domain exported by the engine is down because the engine VM itself didn't start yet. Consequence: The agent will stall and will not start the vm Fix: Avoid exporting an ISO domain by the engine VM altogether. Do so on a node which should be up before the engine is up. Result: The engine VM fails to start on this node
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-16 10:59:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stuart James 2015-04-26 14:52:01 UTC
Description of problem:

After setting up a fresh cluster of 2 nodes using the hosted-engine deploy on both node1 and node2 using an external NFS server for the engine, the machine hosting the engine failed due to network issue, the result of this is the engine failed to return to operation on the remaining node in the cluster.  

Both nodes have been installed a per instructions and are running ovirt-ha-broker and ovirt-ha-agent


Version-Release number of selected component (if applicable):
ovirt-host-deploy-1.3.1-1.el6.noarch
ovirt-engine-sdk-python-3.5.1.0-1.el6.noarch
ovirt-hosted-engine-setup-1.2.2-1.el6.noarch
ovirt-hosted-engine-ha-1.2.5-1.el6.noarch
ovirt-release35-003-1.noarch
Centos 6.6 latest

How reproducible:
On demand, unplug cable on node hosting the engine, wait for failover to occur.


Steps to Reproduce:
1. Unplug network cable of node hosting the "hosted engine"
2. watch the other node logs to see if take over occurs


Actual results:
Hosted-engine does not recover by failing over to the remaining node


Expected results:
The expected result is that other node should take over the hosted engine, in this case it did not.

Additional info:
The node not hosting the engine does not appear to do some sort of health check to determine if 1) it can talk to engine, 2) can it talk to other node , 3) can it talk to the gateway (this happens)

But from the logs it does not seem to do anything more (agent.log / broker.log)

Thread-2::INFO::2015-04-26 15:49:25,814::ping::52::ping.Ping::(action) Successfully pinged 192.168.0.1
Thread-3::INFO::2015-04-26 15:49:35,801::mgmt_bridge::59::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports
Thread-6::INFO::2015-04-26 15:50:09,957::engine_health::63::engine_health.CpuLoadNoEngine::(action) VM not on this host
Thread-5::INFO::2015-04-26 15:50:19,782::cpu_load_no_engine::153::cpu_load_no_engine.EngineHealth::(update_stat_file) VM not on this host
Thread-5::INFO::2015-04-26 15:50:19,782::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0071, engine=0.0000, non-engine=0.0071
Thread-2::INFO::2015-04-26 15:50:25,953::ping::52::ping.Ping::(action) Successfully pinged 192.168.0.1
Thread-3::INFO::2015-04-26 15:50:36,463::mgmt_bridge::59::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports

Comment 1 Stuart James 2015-04-26 15:11:19 UTC
MainThread::INFO::2015-04-26 13:14:37,810::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host node1.jamesnet.ca (id 1): {'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1346 (Sun Apr 26 13:04:14 2015)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostname': 'node1.jamesnet.ca', 'alive': True, 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'maintenance': False, 'host-ts': 1346}
MainThread::INFO::2015-04-26 13:14:37,810::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 15014.0, 'maintenance': False, 'cpu-load': 0.0079000000000000008, 'gateway': True}
MainThread::INFO::2015-04-26 13:14:37,810::states::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host node1.jamesnet.ca (id 1)
MainThread::INFO::2015-04-26 13:21:51,269::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2015-04-26 13:21:51,269::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host node1.jamesnet.ca (id: 1, score: 2400)
MainThread::INFO::2015-04-26 14:27:21,288::state_machine::160::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False}
MainThread::INFO::2015-04-26 14:27:21,288::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host node1.jamesnet.ca (id 1): {'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1346 (Sun Apr 26 13:04:14 2015)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostname': 'node1.jamesnet.ca', 'alive': False, 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'maintenance': False, 'host-ts': 1346}
MainThread::INFO::2015-04-26 14:27:21,288::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 15014.0, 'maintenance': False, 'cpu-load': 0.0071999999999999998, 'gateway': True}
MainThread::INFO::2015-04-26 14:27:21,288::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive


MainThread::ERROR::2015-04-26 
15:01:41,035::hosted_engine::632::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=1ecfc7fe-0728-43e6-86e8-1d47155c2427, host_id=2): timeout during domain acquisition
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 159, in _run_agent
    hosted_engine.HostedEngine(self.shutdown_requested)\
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 301, in start_monitoring
    self._initialize_vdsm()
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 508, in _initialize_vdsm
    self._sd_path = env_path.get_domain_path(self._config)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line 58, in get_domain_path
    .format(sd_uuid, parent))
Exception: path to storage domain 1ecfc7fe-0728-43e6-86e8-1d47155c2427 not found in /rhev/data-center/mnt
MainThread::WARNING::2015-04-26 15:01:41,060::hosted_engine::339::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to start monitoring domain (sd_uuid=1ecfc7fe-0728-43e6-86e8-1d47155c2427, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-04-26 15:01:41,060::hosted_engine::342::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 320, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 633, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=1ecfc7fe-0728-43e6-86e8-1d47155c2427, host_id=2): timeout during domain acquisition


The above logs imply firstly it does not have enough data(to decide to take over/what node is down in the network) and secondly something to do with monitoring fails(this is 2 hours after i first unplugged the cable on the second node). 

If it does not have enough to decide than what is needed. Id suggest if the can talk to the gateway, it knows the node hosting it is down(by the fact it can ping the gateway and has a local link) and the engine itself is non-responsive than it should probably takeover.

Even more concerning is after i plug in node1 the engine is still not alive.


[root@node1 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : False
Hostname                           : node1.jamesnet.ca
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 1346
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=1346 (Sun Apr 26 13:04:14 2015)
	host-id=1
	score=2400
	maintenance=False
	state=EngineUp


--== Host 2 status ==--

Status up-to-date                  : False
Hostname                           : node2.jamesnet.ca
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 2124
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=2124 (Sun Apr 26 13:04:45 2015)
	host-id=2
	score=2400
	maintenance=False
	state=EngineDown
[root@node1 ~]# ping 192.168.0.233
PING 192.168.0.233 (192.168.0.233) 56(84) bytes of data.
From 192.168.0.101 icmp_seq=2 Destination Host Unreachable
From 192.168.0.101 icmp_seq=3 Destination Host Unreachable

[root@node1 ~]# hosted-engine --check-liveliness
No handlers could be found for logger "otopi.__main__"
Hosted Engine is not up!

effectively by pulling one plug i have been able to knock out the engine.

Prior to this test i was able to migrate the engine between each, tell the engine and node it was in maintenance (interface + local maintenance), reboot servers and activate.

To recover (from node1)
[root@node1 ovirt-hosted-engine-ha]# hosted-engine --vm-start
VM exists and is down, destroying it
Machine destroyed

c1ee0fdf-5270-456d-99eb-efa41d614008
	Status = WaitForLaunch
	nicModel = rtl8139,pv
	emulatedMachine = pc
	pid = 0
	vmName = HostedEngine
	devices = [{'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '95ea6be9-2f60-4de9-9d6e-652b6baf1609', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '/mnt/iso/CentOS-6.6-x86_64-minimal.iso', 'type': 'disk'}, {'index': '0', 'iface': 'virtio', 'format': 'raw', 'bootOrder': '1', 'poolID': '00000000-0000-0000-0000-000000000000', 'volumeID': '32e0ab9b-673b-495a-af41-c626a88a269f', 'imageID': '46e0ee67-b5f0-4f57-af3d-f81560527d4e', 'specParams': {}, 'readonly': 'false', 'domainID': '1ecfc7fe-0728-43e6-86e8-1d47155c2427', 'optional': 'false', 'deviceId': '46e0ee67-b5f0-4f57-af3d-f81560527d4e', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:7c:bc:0d', 'linkActive': 'true', 'network': 'ovirtmgmt', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': '59969f54-d52d-41df-b42d-bd2fea88c765', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': '2f6fc4a7-67d1-4842-8a62-6719b43edffd', 'alias': 'console0'}]
	smp = 2
	vmType = kvm
	display = qxl
	displaySecurePort = -1
	memSize = 4096
	displayPort = -1
	cpuType = Opteron_G5
	clientIp = 
	guestDiskMapping = {}
	displayIp = 0
	spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir


After waiting a couple minutes i check local processes, seems as though it has issues 

[root@node1 ovirt-hosted-engine-ha]# ps aux | grep qemu
root      6653  0.0  0.0 103244   880 pts/0    S+   16:09   0:00 grep qemu
qemu     11227  0.0  0.0      0     0 ?        Z<   12:52   0:00 [supervdsmServer] <defunct>

[root@node2 ovirt-hosted-engine-ha]# ps aux | grep qemu
qemu     18747  0.0  0.0      0     0 ?        Z<   12:50   0:00 [supervdsmServer] <defunct>
root     25028  0.0  0.0 103244   880 pts/0    S+   16:10   0:00 grep qemu

Comment 2 Stuart James 2015-04-26 15:29:39 UTC
On rebooting Node2 the shutdown process hangs with trying to unmount storage domain, i had to manually press-hold to restart. Prior to the reboot it seems to have had an open file handle issue with the nfs server on the engine  (not surprising seeing the engine is down)


Apr 26 15:58:03 node2 kernel: nfs: server engine.jamesnet.ca not responding, timed out


i assume the defunct process relates to existing bug = 841486 , i still get this 
after a reboot of the node.


On rebooting node1 (where the engine was originally hosted) i also get the same error with the NFS, have to manually reboot (maybe it would timeout i do not know)

on node2 once node1 is back online (after the reboot)

ainThread::INFO::2015-04-26 13:14:37,810::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host node1.jamesnet.ca (id 1): {'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1346 (Sun Apr 26 13:04:14 2015)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostname': 'node1.jamesnet.ca', 'alive': True, 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'maintenance': False, 'host-ts': 1346}
MainThread::INFO::2015-04-26 13:14:37,810::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 15014.0, 'maintenance': False, 'cpu-load': 0.0079000000000000008, 'gateway': True}
MainThread::INFO::2015-04-26 13:14:37,810::states::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host node1.jamesnet.ca (id 1)
MainThread::INFO::2015-04-26 13:21:51,269::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2015-04-26 13:21:51,269::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host node1.jamesnet.ca (id: 1, score: 2400)
MainThread::INFO::2015-04-26 14:27:21,288::state_machine::160::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False}
MainThread::INFO::2015-04-26 14:27:21,288::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host node1.jamesnet.ca (id 1): {'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1346 (Sun Apr 26 13:04:14 2015)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostname': 'node1.jamesnet.ca', 'alive': False, 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'maintenance': False, 'host-ts': 1346}
MainThread::INFO::2015-04-26 14:27:21,288::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 15014.0, 'maintenance': False, 'cpu-load': 0.0071999999999999998, 'gateway': True}
MainThread::INFO::2015-04-26 14:27:21,288::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive


So to summarise i had to reboot both nodes to get the hosted engine back up, node2 appears to decide to take this up after it has re-established node1 communications (as i rebooted node2 first then followed by node1)

Comment 3 Doron Fediuck 2015-04-27 11:05:03 UTC
Hi James,
from initial look it seems that the other node (node2) was trying to take over, but lost its storage connection as well-

Exception: path to storage domain 1ecfc7fe-0728-43e6-86e8-1d47155c2427 not found in /rhev/data-center/mnt

Since you're using NFS the kernel lock you experienced is expected (unless you restore connectivity you may have processes going into D state). However, the main point is understanding that the nodes do not ping each other. They work with the shared storage. If your NFS storage is available to node2, it should have get a lease and start the HE VM.

Comment 4 Stuart James 2015-04-27 11:15:13 UTC
Node2 is currently hosting engine

[root@node2 ovirt-hosted-engine-ha]# showmount -e 192.168.0.222
Export list for 192.168.0.222:
/mnt/raid/ovirt      192.168.0.0/24
/mnt/raid/selfhosted 192.168.0.0/24

[root@node2 ovirt-hosted-engine-ha]# hosted-engine --check-liveliness
Hosted Engine is up!


Next i pull plug on node2, i run the following on node1



[root@node1 vdsm]# showmount -e 192.168.0.222
Export list for 192.168.0.222:
/mnt/raid/ovirt      192.168.0.0/24
/mnt/raid/selfhosted 192.168.0.0/24


[root@node1 vdsm]# hosted-engine --check-liveliness
No handlers could be found for logger "otopi.__main__"
Hosted Engine is not up!

looking at logs

MainThread::INFO::2015-04-27 12:10:47,766::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host node2.jamesnet.ca (id: 2, score: 2400)
MainThread::INFO::2015-04-27 12:10:57,816::states::458::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (2400), attempting to start engine VM
MainThread::INFO::2015-04-27 12:10:57,826::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1430133057.83 type=state_transition detail=EngineDown-EngineStart hostname='node1.jamesnet.ca'
MainThread::INFO::2015-04-27 12:10:57,915::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? sent
MainThread::INFO::2015-04-27 12:10:58,489::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStart (score: 2400)
MainThread::INFO::2015-04-27 12:10:58,546::hosted_engine::917::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Ensuring VDSM state is clear for engine VM
MainThread::INFO::2015-04-27 12:10:58,594::hosted_engine::929::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Vdsm state for VM clean
MainThread::INFO::2015-04-27 12:10:58,594::hosted_engine::884::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start`
MainThread::INFO::2015-04-27 12:10:58,889::hosted_engine::890::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: 
c1ee0fdf-5270-456d-99eb-efa41d614008
	Status = WaitForLaunch
	nicModel = rtl8139,pv
	emulatedMachine = pc
	pid = 0
	vmName = HostedEngine
	devices = [{'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '95ea6be9-2f60-4de9-9d6e-652b6baf1609', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '/mnt/iso/CentOS-6.6-x86_64-minimal.iso', 'type': 'disk'}, {'index': '0', 'iface': 'virtio', 'format': 'raw', 'bootOrder': '1', 'poolID': '00000000-0000-0000-0000-000000000000', 'volumeID': '32e0ab9b-673b-495a-af41-c626a88a269f', 'imageID': '46e0ee67-b5f0-4f57-af3d-f81560527d4e', 'specParams': {}, 'readonly': 'false', 'domainID': '1ecfc7fe-0728-43e6-86e8-1d47155c2427', 'optional': 'false', 'deviceId': '46e0ee67-b5f0-4f57-af3d-f81560527d4e', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:7c:bc:0d', 'linkActive': 'true', 'network': 'ovirtmgmt', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': '59969f54-d52d-41df-b42d-bd2fea88c765', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': '2f6fc4a7-67d1-4842-8a62-6719b43edffd', 'alias': 'console0'}]
	smp = 2
	vmType = kvm
	display = qxl
	displaySecurePort = -1
	memSize = 4096
	displayPort = -1
	cpuType = Opteron_G5
	clientIp = 
	guestDiskMapping = {}
	displayIp = 0
	spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir

MainThread::INFO::2015-04-27 12:10:58,889::hosted_engine::891::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: 
MainThread::ERROR::2015-04-27 12:10:58,889::hosted_engine::904::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost
MainThread::INFO::2015-04-27 12:10:58,903::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1430133058.9 type=state_transition detail=EngineStart-EngineStarting hostname='node1.jamesnet.ca'
MainThread::INFO::2015-04-27 12:10:58,960::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent
MainThread::INFO::2015-04-27 12:10:59,522::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStarting (score: 2400)
MainThread::INFO::2015-04-27 12:11:09,584::states::721::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up..
MainThread::INFO::2015-04-27 12:11:09,593::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Mon Apr 27 12:21:09 2015 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'>
MainThread::INFO::2015-04-27 12:11:10,136::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStarting (score: 2400)


[root@node1 ovirt-hosted-engine-ha]# hosted-engine --check-liveliness
No handlers could be found for logger "otopi.__main__"
Hosted Engine is not up!


This is what i expected to happen yesterday, the secondary certainly had access to the NFS server as it is hosted separately entirely. The best i can do is try to re-create the issue, maybe there was some stale lock on an NFS volume

Comment 5 Stuart James 2015-04-27 11:17:46 UTC
sorry pasted the wrong result for liveliness 

[root@node1 ovirt-hosted-engine-ha]# hosted-engine --check-liveliness
Hosted Engine is up!

(this of course took a few minutes once node1 decided to take over)

Comment 6 Stuart James 2015-04-27 14:03:26 UTC
at 14:36, with the engine running on node1 i unplug node1 network and monitor the logs on node2.

for reference 
192.168.0.222 NFS server (self hosted engine / storage domain for virts)
192.168.0.233 Hosted Engine


agent.log
MainThread::INFO::2015-04-27 14:33:40,169::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 16541776
MainThread::INFO::2015-04-27 14:34:48,787::brokerlink::178::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(set_storage_domain) Success, id 140572602034064
MainThread::INFO::2015-04-27 14:34:48,787::hosted_engine::458::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started
MainThread::INFO::2015-04-27 14:34:48,882::hosted_engine::541::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 2 is acquired (file: /rhev/data-center/mnt/192.168.0.222:_mnt_raid_selfhosted/1ecfc7fe-0728-43e6-86e8-1d47155c2427/images/91fe3737-3a52-4d82-8db3-491a9e1cc444/e7acc9ed-5bde-47c2-baaf-bbfe6de6aba5)
MainThread::INFO::2015-04-27 14:35:09,886::hosted_engine::575::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Acquired lock on host id 2
MainThread::INFO::2015-04-27 14:35:09,917::state_machine::160::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False}
MainThread::INFO::2015-04-27 14:35:09,917::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host node1.jamesnet.ca (id 1): {'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=80019 (Mon Apr 27 14:35:08 2015)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostname': 'node1.jamesnet.ca', 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'maintenance': False, 'host-ts': 80019}
MainThread::INFO::2015-04-27 14:35:09,917::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 14984.0, 'maintenance': False, 'cpu-load': 0.0085000000000000006, 'gateway': True}
MainThread::INFO::2015-04-27 14:35:09,918::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1430141709.92 type=state_transition detail=StartState-ReinitializeFSM hostname='node2.jamesnet.ca'
MainThread::INFO::2015-04-27 14:35:09,997::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (StartState-ReinitializeFSM) sent? sent
MainThread::INFO::2015-04-27 14:35:10,562::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state ReinitializeFSM (score: 0)
MainThread::INFO::2015-04-27 14:35:20,643::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1430141720.64 type=state_transition detail=ReinitializeFSM-EngineDown hostname='node2.jamesnet.ca'
MainThread::INFO::2015-04-27 14:35:20,731::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineDown) sent? sent
MainThread::INFO::2015-04-27 14:35:21,262::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2015-04-27 14:35:31,333::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive
MainThread::INFO::2015-04-27 14:35:31,830::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2015-04-27 14:35:41,902::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive
MainThread::INFO::2015-04-27 14:35:42,425::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2015-04-27 14:35:52,495::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive
MainThread::INFO::2015-04-27 14:35:52,947::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2015-04-27 14:36:03,030::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive
MainThread::INFO::2015-04-27 14:36:03,474::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2015-04-27 14:36:13,539::states::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host node1.jamesnet.ca (id 1)
MainThread::INFO::2015-04-27 14:36:14,017::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2015-04-27 14:36:14,018::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host node1.jamesnet.ca (id: 1, score: 2400)

(at 14:57 there is no further details in agent.log beyond this point)


broker.log
Thread-35286::INFO::2015-04-27 14:36:09,658::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-35285::INFO::2015-04-27 14:36:09,659::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-35286::INFO::2015-04-27 14:36:09,660::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-4::INFO::2015-04-27 14:36:09,675::mem_free::53::mem_free.MemFree::(action) memFree: 14977
Thread-35287::INFO::2015-04-27 14:36:18,744::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-35288::INFO::2015-04-27 14:36:19,734::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-5::INFO::2015-04-27 14:36:47,838::cpu_load_no_engine::153::cpu_load_no_engine.EngineHealth::(update_stat_file) VM not on this host
Thread-5::INFO::2015-04-27 14:36:47,838::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0106, engine=0.0000, non-engine=0.0106
Thread-3::INFO::2015-04-27 14:36:48,339::mgmt_bridge::59::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports
Thread-35209::INFO::2015-04-27 14:36:50,365::engine_health::63::engine_health.CpuLoadNoEngine::(action) VM not on this host
Thread-35205::INFO::2015-04-27 14:36:50,517::ping::52::ping.Ping::(action) Successfully pinged 192.168.0.1
Thread-35208::INFO::2015-04-27 14:36:50,614::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0105, engine=0.0000, non-engine=0.0105
Thread-5::INFO::2015-04-27 14:37:47,947::cpu_load_no_engine::153::cpu_load_no_engine.EngineHealth::(update_stat_file) VM not on this host
Thread-5::INFO::2015-04-27 14:37:47,948::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0047, engine=0.0000, non-engine=0.0047
Thread-35205::INFO::2015-04-27 14:37:50,606::ping::52::ping.Ping::(action) Successfully pinged 192.168.0.1
Thread-35208::INFO::2015-04-27 14:37:50,735::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0045, engine=0.0000, non-engine=0.0045
Thread-35209::INFO::2015-04-27 14:37:50,737::engine_health::63::engine_health.CpuLoadNoEngine::(action) VM not on this host

(broker.log  are continuing saying the same thing)



[root@node2 ovirt-hosted-engine-ha]# ping 192.168.0.233
PING 192.168.0.233 (192.168.0.233) 56(84) bytes of data.
^C
--- 192.168.0.233 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 922ms



[root@node2 ovirt-hosted-engine-ha]# showmount -e 192.168.0.222
Export list for 192.168.0.222:
/mnt/raid/ovirt      192.168.0.0/24
/mnt/raid/selfhosted 192.168.0.0/24

[root@node2 ovirt-hosted-engine-ha]# findmnt 
TARGET                                                          SOURCE                                  FSTYPE      OPTIONS
(truncated)
├─/rhev/data-center/mnt/192.168.0.222:_mnt_raid_selfhosted      192.168.0.222:/mnt/raid/selfhosted      nfs         rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600
├─/rhev/data-center/mnt/192.168.0.222:_mnt_raid_ovirt           192.168.0.222:/mnt/raid/ovirt           nfs         rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600
└─/rhev/data-center/mnt/engine.jamesnet.ca:_var_lib_exports_iso engine.jamesnet.ca:/var/lib/exports/iso nfs         rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,soft,nosharecache,proto=tcp,timeo=600,r
[root@node2 ovirt-hosted-engine-ha]# ls /rhev/data-center/mnt/192.168.0.222\:_mnt_raid_selfhosted/
1ecfc7fe-0728-43e6-86e8-1d47155c2427  __DIRECT_IO_TEST__



I have waited 15 minutes and node2 has not taken over hosted-engine, i assume maybe its running on node1 at the moment but even if it were it should know it has no connectivity to gateway (i realise it can not really tell anyone anything about this as its network is down). 

What does node2 need to be convinced to take over, is there some sort of Quorum vote ? I would have thought that with it knowing it has access to the gateway and engine being down would be sufficient (even if it does not check on the host availability of node1 but it knows this down based on the logs as well)


From my previous testing there was a point made that it could not talk to the NFS server, it certainly can not speak to the NFS server on the engine for ISO_DOMAIN but it can for the SelfHosted Storage domain and the Storage domain in the default datacenter cluster.

The following freezes as engine is not alive (i believe on reboot it locks the box up)

"ls /rhev/data-center/mnt/engine.jamesnet.ca:_var_lib_exports_iso" (freezes on console)

Comment 7 Martin Sivák 2015-05-05 14:12:47 UTC
Hi,

lets see if I got your network topology right:

gateway 192.168.0.101
node1 ???
node2 192.168.0.233
nfs 192.168.0.222


Now first a short explanation of how the synchronization works:

- each host publishes its status periodically to the NFS storage "whiteboard"
- each host reads all status reports and uses it to evaluate the situation
- hosts are using relative timestamps in their reports

We need at least two reports from any other host to be able to establish the time reference properly or couple of minutes with no activity to consider the remote side to be down. That was the reason for:

14:27:21,288::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive

So we are not trying to establish the liveness of other hosts directly, but rely on their own reports, (relative) timestamps and sanlock.

Now what happens when the whole network dies:

Reports from node1 stop coming to NFS when you unplug the cable. Depending on how hard the NFS freezes some things might happen:

- NFS freezes hard and the kernel freezes the broker process
- agent might try to shut down the engine, but that will freeze (temporarily, but long enough) when sanlock tries to release the lock
- agent tries to get updates from the broker and freezes waiting for response
- the engine VM freezes or dies
- sanlock reboots the machine after some time, because it lost synchronization with the lease storage on NFS

Node 2 should keep running and see that there were no updates from node 1 for some time. At that time it should try to start the VM, but it will very probably fail during the first few attempts because sanlock protects the engine's hard drive and the timeout for sanlock leases is quite long.

The fact that you do not see any updates (or is it reporting the same all the time?) in agent.log on node2 would mean something locked up the agent process there as well. It has to report status every 10 seconds when it is running..

I think the following might be the reason indeed (some indirect freezes):

Apr 26 15:58:03 node2 kernel: nfs: server engine.jamesnet.ca not responding, timed out

[root@node2 ovirt-hosted-engine-ha]# ps aux | grep qemu
qemu     18747  0.0  0.0      0     0 ?        Z<   12:50   0:00 [supervdsmServer] <defunct>
root     25028  0.0  0.0 103244   880 pts/0    S+   16:10   0:00 grep qemu 



The first thing that comes to my mind here is the NFS mount for the ISO domain. Losing network on the host that runs the engine VM then effectively means that all hosts have lost a link to one NFS share and that might cause a chain reaction of freezes for services that talk to each other (supervdsm / vdsm / ovirt-ha-agent).

Comment 8 Stuart James 2015-05-05 16:24:38 UTC
The network topology is as follows

gateway - 192.168.0.1 (router)
DNS/IPA - 192.168.0.131 (VM on .222 box)
node1 - 192.168.0.101 (physical server)
node2 - 192.168.0.102 (physical server)
NFS server - 192.168.0.222 (physical server)
Engine - 192.168.0.233 (VM on node1 or node2)

All times on all hosts are sync'd to common NTP server.

In terms of the Defunct process, this appears to be the case even on a fresh reboot when the server is up and running in a good state with access to NFS engine ISO / Storage Domain (see bug 841486)


From my second test to fail it back to node2, node2 was aware that engine was down and also had access to shared storage, however it does not do anything to take over

14:36:14,017::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2015-04-27 14:36:14,018::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host node1.jamesnet.ca (id: 1, score: 2400)

(as i said above it did not have any further output into agent.log) 

The broker.log says it is not hosting the engine, what exactly is needed at this point for node2 to recover the engine? (it knows its down, it has access to shared storage, it can talk to gateway). Based on the last few comments in the agent.log (before it stopped entirely) it claimed not to have enough information (what more information is needed?)

Comment 9 Martin Sivák 2015-05-06 13:29:14 UTC
> (as i said above it did not have any further output into agent.log)

Which is the issue here. It means that the main loop in agent stopped working. The broker only monitors and talks to storage. It is the agent that decides when to take over. And nothing will happen when the agent is frozen...

So we need to find out why the agent is frozen. And I suspect it is because the ISO domain is provided by the engine VM, which is not reachable from node2.

Comment 10 Doron Fediuck 2015-10-25 15:32:53 UTC
It's been several months now.
Is this still reproducible on recent (3.5.5) ovirt versions?

Comment 11 Stuart James 2015-11-03 16:21:33 UTC
Yes, it is an issue with NFS and the failover of HA component (probably to do with NFS stale lock). Maybe you should update documentation to say it is not supported for HA configuration, i assume Iscsi is preferred route and NFS will always be problematic.

Comment 12 Martin Sivák 2015-11-04 09:52:59 UTC
Hi Allon,

is NFS with locking supported? I suppose it might break our own internal storage mechanisms badly for vdsm too.

Comment 13 Allon Mureinik 2015-11-08 08:27:33 UTC
Admittedly I only skimmed on the above comments, but a stale NFS mount has nothing to do with explicit locking a client may perform.

Regardless, the overarching principal in VDSM is that its owns the storage, and you shouldn't try to do anything fancy there outside of it.

Comment 14 Martin Sivák 2015-12-02 16:22:24 UTC
Ok so the issue here is related to the ISO domain that is exported from the hosted engine VM. Now what happens when the VM goes down:

- hosted engine agent uses repoStats to get an state of its own storage
- VDSM's repoStats call tries to read from all NFS mounts and will freeze once it reaches the ISO domain (not responding because the VM is down)
- in some time (TCP timeout, depends on the NFS configuration we use) it will timeout and enter the hosted engine error handler
- the handler increases error count, waits for 60 seconds and tries again
- the agent should quit after three failures in a row


It is theoretically possible the agent will be able to start the VM in three state machine cycles, but only if the NFS uses TCP and the initial state was a close one to EngineStart.


Allon: can we call getRepoStats selectively to only check the hosted engine storage domain we need?

We should also document that serving the ISO domain from the engine VM is not a good idea in hosted engine environment.

Comment 15 Allon Mureinik 2015-12-07 13:09:02 UTC
(In reply to Martin Sivák from comment #14)
> Ok so the issue here is related to the ISO domain that is exported from the
> hosted engine VM. Now what happens when the VM goes down:
> 
> - hosted engine agent uses repoStats to get an state of its own storage
Yikes. Why?
You already have sanlock there, just use it.

> Allon: can we call getRepoStats selectively to only check the hosted engine
> storage domain we need?
Nope.

> We should also document that serving the ISO domain from the engine VM is not 
> a good idea in hosted engine environment.
Not a good idea is a gross understatement.
Just block it.

Comment 16 Martin Sivák 2015-12-07 13:16:01 UTC
>> - hosted engine agent uses repoStats to get an state of its own storage
> Yikes. Why?
> You already have sanlock there, just use it.

We have to be sure the storage is properly initialized to instruct sanlock to create our own (hosted engine specific) lockspace / lease.

Btw, we do not have any access to the vdsm/SPM leases so we can't tell which one is that. But it might be a good idea to explore.

> Not a good idea is a gross understatement.

I fully agree, but nobody realized we do ask about that during engine install (and it is enabled by default).

> Just block it.

We can't. We do not have access to the engine-setup step atm. It is done manually by the user.

Comment 17 Allon Mureinik 2015-12-07 19:36:38 UTC
(In reply to Martin Sivák from comment #16)
> >> - hosted engine agent uses repoStats to get an state of its own storage
> > Yikes. Why?
> > You already have sanlock there, just use it.
> 
> We have to be sure the storage is properly initialized to instruct sanlock
> to create our own (hosted engine specific) lockspace / lease.
That's a built-in race.
Why not just try to create it and catch the error?

Comment 18 Martin Sivák 2015-12-08 08:56:53 UTC
Oh, but we do that too of course. There are other reasons too.. we can't start any VM without the properly initialized domain monitor.

Comment 19 Stuart James 2015-12-17 16:44:54 UTC
This needs to be documented properly as to how to deploy a properly HA environment for the Engine. My current understanding now is that due to NFS ISO domain export off the engine on a failed node(loss of network) causes entire HA component to fail. 

Moving the problem elsewhere (moving ISO domain to some other server) could also cause an issue if it were itself to fail (stale NFS locks from other nodes). Assuming Node1 hosted the engine, Node2 was sitting there happily doing nothing and ISO domain was provided by some entirely unrelated server. if this entirely unrelated server were to fail and node1 were to loose network i assume node2 at this point would fail to take over as well (as it does not have access to ISO domain hosted by entirely unrelated server that has failed)

Question is, where should ISO domain be hosted if it can cause the entire HA setup to fail.

Comment 20 Martin Sivák 2016-05-04 08:04:23 UTC
*** Bug 1332813 has been marked as a duplicate of this bug. ***

Comment 21 Stuart James 2016-05-04 10:23:35 UTC
I see bug 1332813 has replicated this issue, how exactly is the configuration supposed to achieve HA if taking the ISO domain offline causes the failover or takeover to freeze. 

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Administration_Guide/chap-Storage.html

"ISO Domain: ISO domains store ISO files (or logical CDs) used to install and boot operating systems and applications for the virtual machines. An ISO domain removes the data center's need for physical media. An ISO domain can be shared across different data centers. ISO domains can only be NFS-based. Only one ISO domain can be added to a data center."

Maybe it would make more sense to offer the ISO Domain as something other than NFS to avoid locks, the failover between the Manager should not be dependant upon and ISO Domain being available to the nodes hosting the engine, even worse if it locks up due to the fact its NFS based which is the experience i had (so even once the original engine hosting the ISO domain is back online with full network the locks on the nodes to nfs are stale and seems like process is blocking as logging stopped).

Maybe for HA configuration the ISO domain should be provided over resilient GlusterFS replica rather than the NFSv3 config that is currently mandatory for ISO domain.