Bug 1215434
Summary: | Hosted Engine failed to migrate due to network failure | ||
---|---|---|---|
Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Stuart James <stuartjames> |
Component: | General | Assignee: | Martin Sivák <msivak> |
Status: | CLOSED WONTFIX | QA Contact: | Artyom <alukiano> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | --- | CC: | alukiano, amureini, bugs, dfediuck, ecohen, gklein, istein, lsurette, nashok, rbalakri, rgolan, stuartjames, yeylon |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | sla | ||
Fixed In Version: | Doc Type: | Known Issue | |
Doc Text: |
Cause: The ha agent call to repoStats is stalling because the iso domain exported by the engine is down because the engine VM itself didn't start yet.
Consequence: The agent will stall and will not start the vm
Fix: Avoid exporting an ISO domain by the engine VM altogether. Do so on a node which should be up before the engine is up.
Result: The engine VM fails to start on this node
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2015-12-16 10:59:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Stuart James
2015-04-26 14:52:01 UTC
MainThread::INFO::2015-04-26 13:14:37,810::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host node1.jamesnet.ca (id 1): {'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1346 (Sun Apr 26 13:04:14 2015)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostname': 'node1.jamesnet.ca', 'alive': True, 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'maintenance': False, 'host-ts': 1346} MainThread::INFO::2015-04-26 13:14:37,810::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 15014.0, 'maintenance': False, 'cpu-load': 0.0079000000000000008, 'gateway': True} MainThread::INFO::2015-04-26 13:14:37,810::states::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host node1.jamesnet.ca (id 1) MainThread::INFO::2015-04-26 13:21:51,269::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2015-04-26 13:21:51,269::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host node1.jamesnet.ca (id: 1, score: 2400) MainThread::INFO::2015-04-26 14:27:21,288::state_machine::160::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2015-04-26 14:27:21,288::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host node1.jamesnet.ca (id 1): {'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1346 (Sun Apr 26 13:04:14 2015)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostname': 'node1.jamesnet.ca', 'alive': False, 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'maintenance': False, 'host-ts': 1346} MainThread::INFO::2015-04-26 14:27:21,288::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 15014.0, 'maintenance': False, 'cpu-load': 0.0071999999999999998, 'gateway': True} MainThread::INFO::2015-04-26 14:27:21,288::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive MainThread::ERROR::2015-04-26 15:01:41,035::hosted_engine::632::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=1ecfc7fe-0728-43e6-86e8-1d47155c2427, host_id=2): timeout during domain acquisition Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 159, in _run_agent hosted_engine.HostedEngine(self.shutdown_requested)\ File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 301, in start_monitoring self._initialize_vdsm() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 508, in _initialize_vdsm self._sd_path = env_path.get_domain_path(self._config) File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line 58, in get_domain_path .format(sd_uuid, parent)) Exception: path to storage domain 1ecfc7fe-0728-43e6-86e8-1d47155c2427 not found in /rhev/data-center/mnt MainThread::WARNING::2015-04-26 15:01:41,060::hosted_engine::339::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to start monitoring domain (sd_uuid=1ecfc7fe-0728-43e6-86e8-1d47155c2427, host_id=2): timeout during domain acquisition MainThread::WARNING::2015-04-26 15:01:41,060::hosted_engine::342::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 320, in start_monitoring self._initialize_domain_monitor() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 633, in _initialize_domain_monitor raise Exception(msg) Exception: Failed to start monitoring domain (sd_uuid=1ecfc7fe-0728-43e6-86e8-1d47155c2427, host_id=2): timeout during domain acquisition The above logs imply firstly it does not have enough data(to decide to take over/what node is down in the network) and secondly something to do with monitoring fails(this is 2 hours after i first unplugged the cable on the second node). If it does not have enough to decide than what is needed. Id suggest if the can talk to the gateway, it knows the node hosting it is down(by the fact it can ping the gateway and has a local link) and the engine itself is non-responsive than it should probably takeover. Even more concerning is after i plug in node1 the engine is still not alive. [root@node1 ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : False Hostname : node1.jamesnet.ca Host ID : 1 Engine status : unknown stale-data Score : 2400 Local maintenance : False Host timestamp : 1346 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1346 (Sun Apr 26 13:04:14 2015) host-id=1 score=2400 maintenance=False state=EngineUp --== Host 2 status ==-- Status up-to-date : False Hostname : node2.jamesnet.ca Host ID : 2 Engine status : unknown stale-data Score : 2400 Local maintenance : False Host timestamp : 2124 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2124 (Sun Apr 26 13:04:45 2015) host-id=2 score=2400 maintenance=False state=EngineDown [root@node1 ~]# ping 192.168.0.233 PING 192.168.0.233 (192.168.0.233) 56(84) bytes of data. From 192.168.0.101 icmp_seq=2 Destination Host Unreachable From 192.168.0.101 icmp_seq=3 Destination Host Unreachable [root@node1 ~]# hosted-engine --check-liveliness No handlers could be found for logger "otopi.__main__" Hosted Engine is not up! effectively by pulling one plug i have been able to knock out the engine. Prior to this test i was able to migrate the engine between each, tell the engine and node it was in maintenance (interface + local maintenance), reboot servers and activate. To recover (from node1) [root@node1 ovirt-hosted-engine-ha]# hosted-engine --vm-start VM exists and is down, destroying it Machine destroyed c1ee0fdf-5270-456d-99eb-efa41d614008 Status = WaitForLaunch nicModel = rtl8139,pv emulatedMachine = pc pid = 0 vmName = HostedEngine devices = [{'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '95ea6be9-2f60-4de9-9d6e-652b6baf1609', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '/mnt/iso/CentOS-6.6-x86_64-minimal.iso', 'type': 'disk'}, {'index': '0', 'iface': 'virtio', 'format': 'raw', 'bootOrder': '1', 'poolID': '00000000-0000-0000-0000-000000000000', 'volumeID': '32e0ab9b-673b-495a-af41-c626a88a269f', 'imageID': '46e0ee67-b5f0-4f57-af3d-f81560527d4e', 'specParams': {}, 'readonly': 'false', 'domainID': '1ecfc7fe-0728-43e6-86e8-1d47155c2427', 'optional': 'false', 'deviceId': '46e0ee67-b5f0-4f57-af3d-f81560527d4e', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:7c:bc:0d', 'linkActive': 'true', 'network': 'ovirtmgmt', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': '59969f54-d52d-41df-b42d-bd2fea88c765', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': '2f6fc4a7-67d1-4842-8a62-6719b43edffd', 'alias': 'console0'}] smp = 2 vmType = kvm display = qxl displaySecurePort = -1 memSize = 4096 displayPort = -1 cpuType = Opteron_G5 clientIp = guestDiskMapping = {} displayIp = 0 spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir After waiting a couple minutes i check local processes, seems as though it has issues [root@node1 ovirt-hosted-engine-ha]# ps aux | grep qemu root 6653 0.0 0.0 103244 880 pts/0 S+ 16:09 0:00 grep qemu qemu 11227 0.0 0.0 0 0 ? Z< 12:52 0:00 [supervdsmServer] <defunct> [root@node2 ovirt-hosted-engine-ha]# ps aux | grep qemu qemu 18747 0.0 0.0 0 0 ? Z< 12:50 0:00 [supervdsmServer] <defunct> root 25028 0.0 0.0 103244 880 pts/0 S+ 16:10 0:00 grep qemu On rebooting Node2 the shutdown process hangs with trying to unmount storage domain, i had to manually press-hold to restart. Prior to the reboot it seems to have had an open file handle issue with the nfs server on the engine (not surprising seeing the engine is down) Apr 26 15:58:03 node2 kernel: nfs: server engine.jamesnet.ca not responding, timed out i assume the defunct process relates to existing bug = 841486 , i still get this after a reboot of the node. On rebooting node1 (where the engine was originally hosted) i also get the same error with the NFS, have to manually reboot (maybe it would timeout i do not know) on node2 once node1 is back online (after the reboot) ainThread::INFO::2015-04-26 13:14:37,810::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host node1.jamesnet.ca (id 1): {'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1346 (Sun Apr 26 13:04:14 2015)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostname': 'node1.jamesnet.ca', 'alive': True, 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'maintenance': False, 'host-ts': 1346} MainThread::INFO::2015-04-26 13:14:37,810::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 15014.0, 'maintenance': False, 'cpu-load': 0.0079000000000000008, 'gateway': True} MainThread::INFO::2015-04-26 13:14:37,810::states::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host node1.jamesnet.ca (id 1) MainThread::INFO::2015-04-26 13:21:51,269::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2015-04-26 13:21:51,269::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host node1.jamesnet.ca (id: 1, score: 2400) MainThread::INFO::2015-04-26 14:27:21,288::state_machine::160::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2015-04-26 14:27:21,288::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host node1.jamesnet.ca (id 1): {'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=1346 (Sun Apr 26 13:04:14 2015)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostname': 'node1.jamesnet.ca', 'alive': False, 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'maintenance': False, 'host-ts': 1346} MainThread::INFO::2015-04-26 14:27:21,288::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 15014.0, 'maintenance': False, 'cpu-load': 0.0071999999999999998, 'gateway': True} MainThread::INFO::2015-04-26 14:27:21,288::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive So to summarise i had to reboot both nodes to get the hosted engine back up, node2 appears to decide to take this up after it has re-established node1 communications (as i rebooted node2 first then followed by node1) Hi James, from initial look it seems that the other node (node2) was trying to take over, but lost its storage connection as well- Exception: path to storage domain 1ecfc7fe-0728-43e6-86e8-1d47155c2427 not found in /rhev/data-center/mnt Since you're using NFS the kernel lock you experienced is expected (unless you restore connectivity you may have processes going into D state). However, the main point is understanding that the nodes do not ping each other. They work with the shared storage. If your NFS storage is available to node2, it should have get a lease and start the HE VM. Node2 is currently hosting engine [root@node2 ovirt-hosted-engine-ha]# showmount -e 192.168.0.222 Export list for 192.168.0.222: /mnt/raid/ovirt 192.168.0.0/24 /mnt/raid/selfhosted 192.168.0.0/24 [root@node2 ovirt-hosted-engine-ha]# hosted-engine --check-liveliness Hosted Engine is up! Next i pull plug on node2, i run the following on node1 [root@node1 vdsm]# showmount -e 192.168.0.222 Export list for 192.168.0.222: /mnt/raid/ovirt 192.168.0.0/24 /mnt/raid/selfhosted 192.168.0.0/24 [root@node1 vdsm]# hosted-engine --check-liveliness No handlers could be found for logger "otopi.__main__" Hosted Engine is not up! looking at logs MainThread::INFO::2015-04-27 12:10:47,766::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host node2.jamesnet.ca (id: 2, score: 2400) MainThread::INFO::2015-04-27 12:10:57,816::states::458::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (2400), attempting to start engine VM MainThread::INFO::2015-04-27 12:10:57,826::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1430133057.83 type=state_transition detail=EngineDown-EngineStart hostname='node1.jamesnet.ca' MainThread::INFO::2015-04-27 12:10:57,915::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? sent MainThread::INFO::2015-04-27 12:10:58,489::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStart (score: 2400) MainThread::INFO::2015-04-27 12:10:58,546::hosted_engine::917::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Ensuring VDSM state is clear for engine VM MainThread::INFO::2015-04-27 12:10:58,594::hosted_engine::929::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Vdsm state for VM clean MainThread::INFO::2015-04-27 12:10:58,594::hosted_engine::884::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start` MainThread::INFO::2015-04-27 12:10:58,889::hosted_engine::890::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: c1ee0fdf-5270-456d-99eb-efa41d614008 Status = WaitForLaunch nicModel = rtl8139,pv emulatedMachine = pc pid = 0 vmName = HostedEngine devices = [{'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '95ea6be9-2f60-4de9-9d6e-652b6baf1609', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '/mnt/iso/CentOS-6.6-x86_64-minimal.iso', 'type': 'disk'}, {'index': '0', 'iface': 'virtio', 'format': 'raw', 'bootOrder': '1', 'poolID': '00000000-0000-0000-0000-000000000000', 'volumeID': '32e0ab9b-673b-495a-af41-c626a88a269f', 'imageID': '46e0ee67-b5f0-4f57-af3d-f81560527d4e', 'specParams': {}, 'readonly': 'false', 'domainID': '1ecfc7fe-0728-43e6-86e8-1d47155c2427', 'optional': 'false', 'deviceId': '46e0ee67-b5f0-4f57-af3d-f81560527d4e', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:7c:bc:0d', 'linkActive': 'true', 'network': 'ovirtmgmt', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': '59969f54-d52d-41df-b42d-bd2fea88c765', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': '2f6fc4a7-67d1-4842-8a62-6719b43edffd', 'alias': 'console0'}] smp = 2 vmType = kvm display = qxl displaySecurePort = -1 memSize = 4096 displayPort = -1 cpuType = Opteron_G5 clientIp = guestDiskMapping = {} displayIp = 0 spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir MainThread::INFO::2015-04-27 12:10:58,889::hosted_engine::891::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: MainThread::ERROR::2015-04-27 12:10:58,889::hosted_engine::904::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2015-04-27 12:10:58,903::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1430133058.9 type=state_transition detail=EngineStart-EngineStarting hostname='node1.jamesnet.ca' MainThread::INFO::2015-04-27 12:10:58,960::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent MainThread::INFO::2015-04-27 12:10:59,522::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStarting (score: 2400) MainThread::INFO::2015-04-27 12:11:09,584::states::721::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2015-04-27 12:11:09,593::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Mon Apr 27 12:21:09 2015 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> MainThread::INFO::2015-04-27 12:11:10,136::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStarting (score: 2400) [root@node1 ovirt-hosted-engine-ha]# hosted-engine --check-liveliness No handlers could be found for logger "otopi.__main__" Hosted Engine is not up! This is what i expected to happen yesterday, the secondary certainly had access to the NFS server as it is hosted separately entirely. The best i can do is try to re-create the issue, maybe there was some stale lock on an NFS volume sorry pasted the wrong result for liveliness [root@node1 ovirt-hosted-engine-ha]# hosted-engine --check-liveliness Hosted Engine is up! (this of course took a few minutes once node1 decided to take over) at 14:36, with the engine running on node1 i unplug node1 network and monitor the logs on node2. for reference 192.168.0.222 NFS server (self hosted engine / storage domain for virts) 192.168.0.233 Hosted Engine agent.log MainThread::INFO::2015-04-27 14:33:40,169::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 16541776 MainThread::INFO::2015-04-27 14:34:48,787::brokerlink::178::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(set_storage_domain) Success, id 140572602034064 MainThread::INFO::2015-04-27 14:34:48,787::hosted_engine::458::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started MainThread::INFO::2015-04-27 14:34:48,882::hosted_engine::541::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 2 is acquired (file: /rhev/data-center/mnt/192.168.0.222:_mnt_raid_selfhosted/1ecfc7fe-0728-43e6-86e8-1d47155c2427/images/91fe3737-3a52-4d82-8db3-491a9e1cc444/e7acc9ed-5bde-47c2-baaf-bbfe6de6aba5) MainThread::INFO::2015-04-27 14:35:09,886::hosted_engine::575::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Acquired lock on host id 2 MainThread::INFO::2015-04-27 14:35:09,917::state_machine::160::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2015-04-27 14:35:09,917::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host node1.jamesnet.ca (id 1): {'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=80019 (Mon Apr 27 14:35:08 2015)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostname': 'node1.jamesnet.ca', 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'maintenance': False, 'host-ts': 80019} MainThread::INFO::2015-04-27 14:35:09,917::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 14984.0, 'maintenance': False, 'cpu-load': 0.0085000000000000006, 'gateway': True} MainThread::INFO::2015-04-27 14:35:09,918::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1430141709.92 type=state_transition detail=StartState-ReinitializeFSM hostname='node2.jamesnet.ca' MainThread::INFO::2015-04-27 14:35:09,997::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (StartState-ReinitializeFSM) sent? sent MainThread::INFO::2015-04-27 14:35:10,562::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state ReinitializeFSM (score: 0) MainThread::INFO::2015-04-27 14:35:20,643::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1430141720.64 type=state_transition detail=ReinitializeFSM-EngineDown hostname='node2.jamesnet.ca' MainThread::INFO::2015-04-27 14:35:20,731::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineDown) sent? sent MainThread::INFO::2015-04-27 14:35:21,262::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2015-04-27 14:35:31,333::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive MainThread::INFO::2015-04-27 14:35:31,830::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2015-04-27 14:35:41,902::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive MainThread::INFO::2015-04-27 14:35:42,425::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2015-04-27 14:35:52,495::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive MainThread::INFO::2015-04-27 14:35:52,947::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2015-04-27 14:36:03,030::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive MainThread::INFO::2015-04-27 14:36:03,474::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2015-04-27 14:36:13,539::states::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host node1.jamesnet.ca (id 1) MainThread::INFO::2015-04-27 14:36:14,017::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2015-04-27 14:36:14,018::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host node1.jamesnet.ca (id: 1, score: 2400) (at 14:57 there is no further details in agent.log beyond this point) broker.log Thread-35286::INFO::2015-04-27 14:36:09,658::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-35285::INFO::2015-04-27 14:36:09,659::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-35286::INFO::2015-04-27 14:36:09,660::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-4::INFO::2015-04-27 14:36:09,675::mem_free::53::mem_free.MemFree::(action) memFree: 14977 Thread-35287::INFO::2015-04-27 14:36:18,744::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-35288::INFO::2015-04-27 14:36:19,734::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-5::INFO::2015-04-27 14:36:47,838::cpu_load_no_engine::153::cpu_load_no_engine.EngineHealth::(update_stat_file) VM not on this host Thread-5::INFO::2015-04-27 14:36:47,838::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0106, engine=0.0000, non-engine=0.0106 Thread-3::INFO::2015-04-27 14:36:48,339::mgmt_bridge::59::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-35209::INFO::2015-04-27 14:36:50,365::engine_health::63::engine_health.CpuLoadNoEngine::(action) VM not on this host Thread-35205::INFO::2015-04-27 14:36:50,517::ping::52::ping.Ping::(action) Successfully pinged 192.168.0.1 Thread-35208::INFO::2015-04-27 14:36:50,614::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0105, engine=0.0000, non-engine=0.0105 Thread-5::INFO::2015-04-27 14:37:47,947::cpu_load_no_engine::153::cpu_load_no_engine.EngineHealth::(update_stat_file) VM not on this host Thread-5::INFO::2015-04-27 14:37:47,948::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0047, engine=0.0000, non-engine=0.0047 Thread-35205::INFO::2015-04-27 14:37:50,606::ping::52::ping.Ping::(action) Successfully pinged 192.168.0.1 Thread-35208::INFO::2015-04-27 14:37:50,735::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0045, engine=0.0000, non-engine=0.0045 Thread-35209::INFO::2015-04-27 14:37:50,737::engine_health::63::engine_health.CpuLoadNoEngine::(action) VM not on this host (broker.log are continuing saying the same thing) [root@node2 ovirt-hosted-engine-ha]# ping 192.168.0.233 PING 192.168.0.233 (192.168.0.233) 56(84) bytes of data. ^C --- 192.168.0.233 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 922ms [root@node2 ovirt-hosted-engine-ha]# showmount -e 192.168.0.222 Export list for 192.168.0.222: /mnt/raid/ovirt 192.168.0.0/24 /mnt/raid/selfhosted 192.168.0.0/24 [root@node2 ovirt-hosted-engine-ha]# findmnt TARGET SOURCE FSTYPE OPTIONS (truncated) ├─/rhev/data-center/mnt/192.168.0.222:_mnt_raid_selfhosted 192.168.0.222:/mnt/raid/selfhosted nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600 ├─/rhev/data-center/mnt/192.168.0.222:_mnt_raid_ovirt 192.168.0.222:/mnt/raid/ovirt nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600 └─/rhev/data-center/mnt/engine.jamesnet.ca:_var_lib_exports_iso engine.jamesnet.ca:/var/lib/exports/iso nfs rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,soft,nosharecache,proto=tcp,timeo=600,r [root@node2 ovirt-hosted-engine-ha]# ls /rhev/data-center/mnt/192.168.0.222\:_mnt_raid_selfhosted/ 1ecfc7fe-0728-43e6-86e8-1d47155c2427 __DIRECT_IO_TEST__ I have waited 15 minutes and node2 has not taken over hosted-engine, i assume maybe its running on node1 at the moment but even if it were it should know it has no connectivity to gateway (i realise it can not really tell anyone anything about this as its network is down). What does node2 need to be convinced to take over, is there some sort of Quorum vote ? I would have thought that with it knowing it has access to the gateway and engine being down would be sufficient (even if it does not check on the host availability of node1 but it knows this down based on the logs as well) From my previous testing there was a point made that it could not talk to the NFS server, it certainly can not speak to the NFS server on the engine for ISO_DOMAIN but it can for the SelfHosted Storage domain and the Storage domain in the default datacenter cluster. The following freezes as engine is not alive (i believe on reboot it locks the box up) "ls /rhev/data-center/mnt/engine.jamesnet.ca:_var_lib_exports_iso" (freezes on console) Hi, lets see if I got your network topology right: gateway 192.168.0.101 node1 ??? node2 192.168.0.233 nfs 192.168.0.222 Now first a short explanation of how the synchronization works: - each host publishes its status periodically to the NFS storage "whiteboard" - each host reads all status reports and uses it to evaluate the situation - hosts are using relative timestamps in their reports We need at least two reports from any other host to be able to establish the time reference properly or couple of minutes with no activity to consider the remote side to be down. That was the reason for: 14:27:21,288::states::445::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive So we are not trying to establish the liveness of other hosts directly, but rely on their own reports, (relative) timestamps and sanlock. Now what happens when the whole network dies: Reports from node1 stop coming to NFS when you unplug the cable. Depending on how hard the NFS freezes some things might happen: - NFS freezes hard and the kernel freezes the broker process - agent might try to shut down the engine, but that will freeze (temporarily, but long enough) when sanlock tries to release the lock - agent tries to get updates from the broker and freezes waiting for response - the engine VM freezes or dies - sanlock reboots the machine after some time, because it lost synchronization with the lease storage on NFS Node 2 should keep running and see that there were no updates from node 1 for some time. At that time it should try to start the VM, but it will very probably fail during the first few attempts because sanlock protects the engine's hard drive and the timeout for sanlock leases is quite long. The fact that you do not see any updates (or is it reporting the same all the time?) in agent.log on node2 would mean something locked up the agent process there as well. It has to report status every 10 seconds when it is running.. I think the following might be the reason indeed (some indirect freezes): Apr 26 15:58:03 node2 kernel: nfs: server engine.jamesnet.ca not responding, timed out [root@node2 ovirt-hosted-engine-ha]# ps aux | grep qemu qemu 18747 0.0 0.0 0 0 ? Z< 12:50 0:00 [supervdsmServer] <defunct> root 25028 0.0 0.0 103244 880 pts/0 S+ 16:10 0:00 grep qemu The first thing that comes to my mind here is the NFS mount for the ISO domain. Losing network on the host that runs the engine VM then effectively means that all hosts have lost a link to one NFS share and that might cause a chain reaction of freezes for services that talk to each other (supervdsm / vdsm / ovirt-ha-agent). The network topology is as follows gateway - 192.168.0.1 (router) DNS/IPA - 192.168.0.131 (VM on .222 box) node1 - 192.168.0.101 (physical server) node2 - 192.168.0.102 (physical server) NFS server - 192.168.0.222 (physical server) Engine - 192.168.0.233 (VM on node1 or node2) All times on all hosts are sync'd to common NTP server. In terms of the Defunct process, this appears to be the case even on a fresh reboot when the server is up and running in a good state with access to NFS engine ISO / Storage Domain (see bug 841486) From my second test to fail it back to node2, node2 was aware that engine was down and also had access to shared storage, however it does not do anything to take over 14:36:14,017::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2015-04-27 14:36:14,018::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host node1.jamesnet.ca (id: 1, score: 2400) (as i said above it did not have any further output into agent.log) The broker.log says it is not hosting the engine, what exactly is needed at this point for node2 to recover the engine? (it knows its down, it has access to shared storage, it can talk to gateway). Based on the last few comments in the agent.log (before it stopped entirely) it claimed not to have enough information (what more information is needed?) > (as i said above it did not have any further output into agent.log)
Which is the issue here. It means that the main loop in agent stopped working. The broker only monitors and talks to storage. It is the agent that decides when to take over. And nothing will happen when the agent is frozen...
So we need to find out why the agent is frozen. And I suspect it is because the ISO domain is provided by the engine VM, which is not reachable from node2.
It's been several months now. Is this still reproducible on recent (3.5.5) ovirt versions? Yes, it is an issue with NFS and the failover of HA component (probably to do with NFS stale lock). Maybe you should update documentation to say it is not supported for HA configuration, i assume Iscsi is preferred route and NFS will always be problematic. Hi Allon, is NFS with locking supported? I suppose it might break our own internal storage mechanisms badly for vdsm too. Admittedly I only skimmed on the above comments, but a stale NFS mount has nothing to do with explicit locking a client may perform. Regardless, the overarching principal in VDSM is that its owns the storage, and you shouldn't try to do anything fancy there outside of it. Ok so the issue here is related to the ISO domain that is exported from the hosted engine VM. Now what happens when the VM goes down: - hosted engine agent uses repoStats to get an state of its own storage - VDSM's repoStats call tries to read from all NFS mounts and will freeze once it reaches the ISO domain (not responding because the VM is down) - in some time (TCP timeout, depends on the NFS configuration we use) it will timeout and enter the hosted engine error handler - the handler increases error count, waits for 60 seconds and tries again - the agent should quit after three failures in a row It is theoretically possible the agent will be able to start the VM in three state machine cycles, but only if the NFS uses TCP and the initial state was a close one to EngineStart. Allon: can we call getRepoStats selectively to only check the hosted engine storage domain we need? We should also document that serving the ISO domain from the engine VM is not a good idea in hosted engine environment. (In reply to Martin Sivák from comment #14) > Ok so the issue here is related to the ISO domain that is exported from the > hosted engine VM. Now what happens when the VM goes down: > > - hosted engine agent uses repoStats to get an state of its own storage Yikes. Why? You already have sanlock there, just use it. > Allon: can we call getRepoStats selectively to only check the hosted engine > storage domain we need? Nope. > We should also document that serving the ISO domain from the engine VM is not > a good idea in hosted engine environment. Not a good idea is a gross understatement. Just block it. >> - hosted engine agent uses repoStats to get an state of its own storage > Yikes. Why? > You already have sanlock there, just use it. We have to be sure the storage is properly initialized to instruct sanlock to create our own (hosted engine specific) lockspace / lease. Btw, we do not have any access to the vdsm/SPM leases so we can't tell which one is that. But it might be a good idea to explore. > Not a good idea is a gross understatement. I fully agree, but nobody realized we do ask about that during engine install (and it is enabled by default). > Just block it. We can't. We do not have access to the engine-setup step atm. It is done manually by the user. (In reply to Martin Sivák from comment #16) > >> - hosted engine agent uses repoStats to get an state of its own storage > > Yikes. Why? > > You already have sanlock there, just use it. > > We have to be sure the storage is properly initialized to instruct sanlock > to create our own (hosted engine specific) lockspace / lease. That's a built-in race. Why not just try to create it and catch the error? Oh, but we do that too of course. There are other reasons too.. we can't start any VM without the properly initialized domain monitor. This needs to be documented properly as to how to deploy a properly HA environment for the Engine. My current understanding now is that due to NFS ISO domain export off the engine on a failed node(loss of network) causes entire HA component to fail. Moving the problem elsewhere (moving ISO domain to some other server) could also cause an issue if it were itself to fail (stale NFS locks from other nodes). Assuming Node1 hosted the engine, Node2 was sitting there happily doing nothing and ISO domain was provided by some entirely unrelated server. if this entirely unrelated server were to fail and node1 were to loose network i assume node2 at this point would fail to take over as well (as it does not have access to ISO domain hosted by entirely unrelated server that has failed) Question is, where should ISO domain be hosted if it can cause the entire HA setup to fail. *** Bug 1332813 has been marked as a duplicate of this bug. *** I see bug 1332813 has replicated this issue, how exactly is the configuration supposed to achieve HA if taking the ISO domain offline causes the failover or takeover to freeze. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Administration_Guide/chap-Storage.html "ISO Domain: ISO domains store ISO files (or logical CDs) used to install and boot operating systems and applications for the virtual machines. An ISO domain removes the data center's need for physical media. An ISO domain can be shared across different data centers. ISO domains can only be NFS-based. Only one ISO domain can be added to a data center." Maybe it would make more sense to offer the ISO Domain as something other than NFS to avoid locks, the failover between the Manager should not be dependant upon and ISO Domain being available to the nodes hosting the engine, even worse if it locks up due to the fact its NFS based which is the experience i had (so even once the original engine hosting the ISO domain is back online with full network the locks on the nodes to nfs are stale and seems like process is blocking as logging stopped). Maybe for HA configuration the ISO domain should be provided over resilient GlusterFS replica rather than the NFSv3 config that is currently mandatory for ISO domain. |