Description of problem: 1. Hosted Engine Storage on Fibre Channel 2. Host is running HE 3. loses network 4. score down by 1600 points 5. ha-agent decides to shutdown HE VM 6. broker does not responding to agent (notify state_transition) 7. agent restarts, goto 6. (agent talks fine to broker on restart, but times out on state_transition) ~~~ Result: HostedEngine VM not shutdown, stuck forever on this host, which does not have network. I thought it was related, but BZ 1359059 is supposed to be fixed here: $ cat installed-rpms | grep engine-ha ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch MainThread::INFO::2016-09-21 10:47:08,718::states::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to gateway status MainThread::INFO::2016-09-21 10:47:08,720::hosted_engine::461::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 1800) MainThread::ERROR::2016-09-21 10:47:18,770::states::413::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host <HOSTNAME> (id 2) score is significantly better than local score, shutting down VM on this host MainThread::INFO::2016-09-21 10:47:18,772::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1474426038.77 type=state_transition detail=EngineUp-EngineStop hostname='<HOSTNAME>' <30 seconds silence> MainThread::ERROR::2016-09-21 10:47:48,802::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out Meanwhile, in the broker: Thread-624275::INFO::2016-09-21 10:47:16,268::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-624276::INFO::2016-09-21 10:47:16,268::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-624276::INFO::2016-09-21 10:47:16,269::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-300607::INFO::2016-09-21 10:47:16,282::mem_free::53::mem_free.MemFree::(action) memFree: 247721 Thread-624277::INFO::2016-09-21 10:47:19,729::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-624277::INFO::2016-09-21 10:47:19,767::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-624278::INFO::2016-09-21 10:47:19,768::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-624278::INFO::2016-09-21 10:47:19,769::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed So the agent restarts... and it's talking fine to the broker: MainThread::INFO::2016-09-21 10:55:39,353::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '192.168.7.254'} MainThread::INFO::2016-09-21 10:55:39,358::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140626423492496 MainThread::INFO::2016-09-21 10:55:39,358::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'ovirtmgmt', 'address': '0'} Refreshes the shared storage metadata and, suprise: MainThread::INFO::2016-09-21 11:19:08,576::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1474427948.58 type=state_transition detail=StartState-ReinitializeFSM hostname='<HOSTNAME>' MainThread::ERROR::2016-09-21 11:19:38,607::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out MainThread::ERROR::2016-09-21 11:19:38,617::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': '<HOSTNAME>'}: Connection timed out' - trying to restart agent Restarts again... Same thing. The agent keeps looping forever, it never tries to shutdown the HE VM again. When network comes up, agent and broker talk notifications again. MainThread::INFO::2016-09-21 11:29:50,541::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1474428590.54 type=state_transition detail=ReinitializeFSM-EngineStarting hostname='<HOSTNAME>' MainThread::INFO::2016-09-21 11:29:50,622::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineStarting) sent? sent FYI: BZ 1359059 is supposed to be fixed here: $ cat installed-rpms | grep engine-ha ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch BZ 1377161 should not be related as they are talking via file. Issues: 1. looks like the broker was somehow stuck due to the network down. Since it uses a file socket (broker.socket) it shouldn't be related to the network down, but the log timing say it is. Stuck related to the ping failure? Or BZ 1359059 fix was not 100%? 2. ha-agent first priority on failover is to stop the HE VM and free sanlock, not sending emails or updating the shared storage metadata. ha-agent should not keep restarting when failing to do "non essential things". This breaks the failover mechanism. If VM was stopped on first try (instead of restarting to send an email?), the HE VM would have failed over. If the ha-agent shutdown the engine and later looped somewhere else, it would be fine, as the HE would have failed over, in critical situations that's probably good enough. 3. Why don't we restart the broker as well? It might be hung somewhere. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch How reproducible: 100% on customer site. Actual results: HostedEngine stuck in one host, without network, forever Expected results: HostedEngine failover
Can we please get the debug log from both broker and agent? Was that a new clean deployment or an upgraded one? The fact that you actually get a timeout indicates that the fix is working (it was stuck forever before). My guess is that we query DNS somewhere and that causes the issue.
One more question.. what was the configured email address to receive the notifications and the smtp server?
Hi Martin, Thanks for looking into this. Regarding our talk yesterday, I am afraid we are back to the drawing board... $ cat broker.conf [email] smtp-server = localhost smtp-port = 25 source-email = root@localhost destination-emails = root@localhost [notify] state_transition = maintenance|start|stop|migrate|up|down -> from the conf_volume and also confirmed in the hosts I'll try to find some time to reproduce this within the next hours. Also uploading the logs now.
(In reply to Martin Sivák from comment #2) > Can we please get the debug log from both broker and agent? Was that a new > clean deployment or an upgraded one? > Brand new. Logs attached.
Hi Martin, Can't this be solved by stopping the 'restart ha-agent' during HE shutdown? 1. Wouldn't it make the whole mechanism more reliable? If it's detected that the HE should be shut down due to lower score, then that's probably what should be done. If something else is failing, then after the HE is down it can re-enable the 'restart' thing to try to fix it. AFAIK, in this case, even if it doesn't push the host status to the shared storage, the HE VM would eventually fail over, right? 2. Any ideas on what exactly to debug at customer side? I'm thinking of creating some debug package to log more information on what exactly is going on. 3. Do you still think this could be related to DNS? They mail server is localhost and the host itself is in /etc/hosts. Thanks,
Just so we have the info here. We were discussing this with Germano and a test without notifications is needed. The Python smtplib tries to resolve its own hostname in https://hg.python.org/cpython/file/2.7/Lib/smtplib.py#l265 and the reverse lookup might be doing the DNS call.
We are now waiting for the results of the proposed test. Martin, wouldn't one of these two changes make sense? 1. skip mail notifications in case of network related score penalty, or allow them to fail without restarting the agent. 2. disable restarting the agent after it's found that HE should be shutdown.
Martin, We changed, in notifications.py (broker) def notify(**kwargs): <...> return send_email(smtp_config, email_body) to return True So send_mail is not called within the broker. Broker stops pinging at 10:36 Thread-10::INFO::2016-10-05 10:35:04,373::ping::52::ping.Ping::(action) Successfully pinged 192.168.7.254 Thread-10::INFO::2016-10-05 10:36:04,461::ping::52::ping.Ping::(action) Successfully pinged 192.168.7.254 Thread-10::WARNING::2016-10-05 10:36:24,468::ping::48::ping.Ping::(action) Failed to ping 192.168.7.254 Thread-10::WARNING::2016-10-05 10:36:34,474::ping::48::ping.Ping::(action) Failed to ping 192.168.7.254 But Same results from agent side: MainThread::ERROR::2016-10-05 10:46:57,348::states::413::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host XXX (id 1) score is significantly better than local score, shutting down VM on this host MainThread::INFO::2016-10-05 10:46:57,351::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1475635617.35 type=state_transition detail=EngineUp-EngineStop hostname='XXX' MainThread::ERROR::2016-10-05 10:47:27,381::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out MainThread::ERROR::2016-10-05 10:47:27,381::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'rhv-prod-h03.etanah.ptgkl.gov.my'}: Connection timed out' - trying to restart agent So the problem is somewhere else? Thanks, Germano
Hi Martin, The new logs came in. This time it did not even lower the score as the agent timed out before there. Nevertheless, the error is very, very similar to the other occurrences. From what I can see, the issue is caused by this, in the broker: def notify(**kwargs): .... logger.debug("nofity: %s" % (repr(kwargs),)) <--- typo, 'nofity' .... heconf = config.Config(logger=logger) heconf.refresh_local_conf_file( <----------------- taking too long localcopy_filename=constants.NOTIFY_CONF_FILE, archive_fname=constants.NOTIFY_CONF_FILE_ARCHIVE_FNAME, ) .... During "network cable unplug" refresh_local_conf_file takes too long, way more than the 30 seconds the agent waits. So the agent times out. Look at this: Broker receives the notification request from the agent: Thread-5::DEBUG::2016-10-28 18:25:52,951::listener::163::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Input: notify time=1477650352.95 type=state_transition detail=EngineUp-EngineUpBadHealth hostname='XXX' Thread-5::DEBUG::2016-10-28 18:25:52,952::notifications::50::ovirt_hosted_engine_ha.broker.notifications.Notifications::(notify) nofity: {'hostname': 'XXX', 'type': 'state_transition', 'detail': 'EngineUp-EngineUpBadHealth', 'time': '1477650352.95'} 12 minutes later... Thread-5::DEBUG::2016-10-28 18:37:25,918::config::251::ovirt_hosted_engine_ha.broker.notifications.Notifications.config::(refresh_local_conf_file) Reading 'broker.conf' from '/rhev/data-center/mnt/blockSD/ccd8c49a-11ec-4078-bc33-71f07c536f2e/images/ad0be349-ce6c-41e2-9c25-cc9c2ecc829b/656d4a6f-b68a-4dbc-aba2-9ae4d1dfcb65' Thread-5::DEBUG::2016-10-28 18:37:25,918::heconflib::69::ovirt_hosted_engine_ha.broker.notifications.Notifications.config::(_dd_pipe_tar) executing: 'sudo -u vdsm dd if=/rhev/data-center/mnt/blockSD/ccd8c49a-11ec-4078-bc33-71f07c536f2e/images/ad0be349-ce6c-41e2-9c25-cc9c2ecc829b/656d4a6f-b68a-4dbc-aba2-9ae4d1dfcb65 bs=4k' Thread-5::DEBUG::2016-10-28 18:37:25,918::heconflib::70::ovirt_hosted_engine_ha.broker.notifications.Notifications.config::(_dd_pipe_tar) executing: 'tar -tvf -' I also noticed it's taking roughly the same time for this function to get the vm.conf. And it looks very similar. One thing that I think it related is that the ISO domain went down when the network was unplugged. I don't know at which step in refresh_local_conf_file we reload storage domain ids on vdsm (if we do that), but the whole thing moved on at the exact time vdsm gave up on the ISO domain. Thread-12::ERROR::2016-10-28 18:37:25,918::monitor::484::Storage.Monitor::(_pathChecked) Error checking path /rhev/data-center/mnt/192.168.3.122:_volume1_iso__domain/d58998e0-a388-4c80-b6b6-0a0676e74c5d/dom_md/metadata Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 482, in _pathChecked delay = result.delay() File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 367, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/192.168.3.122:_volume1_iso__domain/d58998e0-a388-4c80-b6b6-0a0676e74c5d/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/192.168.3.122:_volume1_iso__domain/d58998e0-a388-4c80-b6b6-0a0676e74c5d/dom_md/metadata\': Input/output error\n")) jsonrpc.Executor/3::DEBUG::2016-10-28 18:37:25,925::hsm::2367::Storage.HSM::(__prefetchDomains) Found SD uuids: (u'd58998e0-a388-4c80-b6b6-0a0676e74c5d',) jsonrpc.Executor/3::DEBUG::2016-10-28 18:37:25,926::hsm::2427::Storage.HSM::(connectStorageServer) knownSDs: {ccd8c49a-11ec-4078-bc33-71f07c536f2e: storage.blockSD.findDomain, 4462059e-90bf-4c29-8d9f-715cb326d752: storage.blockSD.findDomain, d58998e0-a388-4c80-b6b6-0a0676e74c5d: storage.nfsSD.findDomain} jsonrpc.Executor/3::INFO::2016-10-28 18:37:25,927::logUtils::52::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'1e835544-1ac3-4f2f-a31e-4b7a6bccac6f'}]} Was the broker waiting on vdsm? I'm uploading the logs for you. For reference, here are the times: Thread-6::INFO::2016-10-28 18:12:17,569::ping::52::ping.Ping::(action) Successfully pinged 192.168.7.254 <Network outage start> Thread-6::WARNING::2016-10-28 18:13:27,658::ping::48::ping.Ping::(action) Failed to ping 192.168.7.254 <Network outage stop> Thread-6::INFO::2016-10-28 18:34:51,763::ping::52::ping.Ping::(action) Successfully pinged 192.168.7.254 Thanks
Hi Germano, this is entirely possible! I will dig into the config refresh routine to see what can be done. Is the customer using NFS, iSCSI, FC or Gluster?
(In reply to Martin Sivák from comment #21) > Hi Germano, this is entirely possible! I will dig into the config refresh > routine to see what can be done. Is the customer using NFS, iSCSI, FC or > Gluster? Storage is FC, ISO Domain is NFS. When network goes down, so does ISO Domain.
2 attempts to reproduce this issue by blocking storage communication (using iptables): 1.block communication to nfs port on the storage server. -vm was down and after restoring communication brought up properly. 2.isolated the iso domain to a different IP and blocked only the communication to that IP (nfs/tcp). -agent got to a point where : (getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE and didn't move from there after restoring connectivity the agent recovered and vm was up and running.
(In reply to Yanir from comment #23) > 2 attempts to reproduce this issue by blocking storage communication (using > iptables): > > 1.block communication to nfs port on the storage server. > > -vm was down and after restoring communication brought up properly. > > 2.isolated the iso domain to a different IP and blocked only the > communication to that IP (nfs/tcp). > > -agent got to a point where : (getEngineVMOVF) Extracting Engine VM OVF > from the OVF_STORE and didn't move from there > after restoring connectivity the agent recovered and vm was up and running. I think that "network down" here is meant for disconnected physically NIC with no link up, which might be preventing ha-agent from getting to broker on host. iptables won't simulate real network disconnection, as they're blocking only specific range of IPs or ports, and TCP/IP stack continues working on that interface. I'd be trying to reproduce this issue with disabling NIC and then bringing it back on line.
(In reply to Nikolai Sednev from comment #24) > (In reply to Yanir from comment #23) > > 2 attempts to reproduce this issue by blocking storage communication (using > > iptables): > > > > 1.block communication to nfs port on the storage server. > > > > -vm was down and after restoring communication brought up properly. > > > > 2.isolated the iso domain to a different IP and blocked only the > > communication to that IP (nfs/tcp). > > > > -agent got to a point where : (getEngineVMOVF) Extracting Engine VM OVF > > from the OVF_STORE and didn't move from there > > after restoring connectivity the agent recovered and vm was up and running. > > I think that "network down" here is meant for disconnected physically NIC > with no link up, which might be preventing ha-agent from getting to broker > on host. > iptables won't simulate real network disconnection, as they're blocking only > specific range of IPs or ports, and TCP/IP stack continues working on that > interface. The network can be disconnected on the other side of the path... A drop (vs. block) should simulate it well enough. > > I'd be trying to reproduce this issue with disabling NIC and then bringing > it back on line.
(In reply to Yaniv Kaul from comment #25) > (In reply to Nikolai Sednev from comment #24) > > (In reply to Yanir from comment #23) > > > 2 attempts to reproduce this issue by blocking storage communication (using > > > iptables): > > > > > > 1.block communication to nfs port on the storage server. > > > > > > -vm was down and after restoring communication brought up properly. > > > > > > 2.isolated the iso domain to a different IP and blocked only the > > > communication to that IP (nfs/tcp). > > > > > > -agent got to a point where : (getEngineVMOVF) Extracting Engine VM OVF > > > from the OVF_STORE and didn't move from there > > > after restoring connectivity the agent recovered and vm was up and running. > > > > I think that "network down" here is meant for disconnected physically NIC > > with no link up, which might be preventing ha-agent from getting to broker > > on host. > > iptables won't simulate real network disconnection, as they're blocking only > > specific range of IPs or ports, and TCP/IP stack continues working on that > > interface. > > The network can be disconnected on the other side of the path... > A drop (vs. block) should simulate it well enough. > > > > > I'd be trying to reproduce this issue with disabling NIC and then bringing > > it back on line. Totally possible, network might get lost with several possible scenarios. Even erasing the IP address from NIC might work.
(In reply to Nikolai Sednev from comment #26) > (In reply to Yaniv Kaul from comment #25) > > (In reply to Nikolai Sednev from comment #24) > > > (In reply to Yanir from comment #23) > > > > 2 attempts to reproduce this issue by blocking storage communication (using > > > > iptables): > > > > > > > > 1.block communication to nfs port on the storage server. > > > > > > > > -vm was down and after restoring communication brought up properly. > > > > > > > > 2.isolated the iso domain to a different IP and blocked only the > > > > communication to that IP (nfs/tcp). > > > > > > > > -agent got to a point where : (getEngineVMOVF) Extracting Engine VM OVF > > > > from the OVF_STORE and didn't move from there > > > > after restoring connectivity the agent recovered and vm was up and running. > > > > > > I think that "network down" here is meant for disconnected physically NIC > > > with no link up, which might be preventing ha-agent from getting to broker > > > on host. > > > iptables won't simulate real network disconnection, as they're blocking only > > > specific range of IPs or ports, and TCP/IP stack continues working on that > > > interface. > > > > The network can be disconnected on the other side of the path... > > A drop (vs. block) should simulate it well enough. > > > > > > > > I'd be trying to reproduce this issue with disabling NIC and then bringing > > > it back on line. > > Totally possible, network might get lost with several possible scenarios. > Even erasing the IP address from NIC might work. blocking using iptables was done by using drop rule (on nfs port) on the storage server side , e.g : sudo iptables -I OUTPUT -d <IP> -p tcp --dport=2049 -j DROP / sudo iptables -I INPUT -d <IP> -p tcp --dport=2049 -j DROP
(In reply to Nikolai Sednev from comment #24) > I think that "network down" here is meant for disconnected physically NIC > with no link up, which might be preventing ha-agent from getting to broker > on host. AFAICS, the ha-agent and ha-broker talk via AF_UNIX. And this problem also reproduces running packages where BZ1377161 is fixed. From what I can see the problem is not between the agent and the broker, but between broker and vdsm (refresh_local_conf_file). The communication channel between the broker and vdsm is just fine. But vdsm simply doesn't answer the broker because it's too worried about the unreachable ISO SD. So everything times out and the agent restarts. > iptables won't simulate real network disconnection, as they're blocking only > specific range of IPs or ports, and TCP/IP stack continues working on that > interface. Right, iptables is different from cable unplug as the interface link doesn't go down and the Host IP is not removed from the FIB. But I don't think this is relevant here. The HE SD is over FC, which is never disconnected. Then when network goes down it looks like ha-agent hangs on refreshing it's local conf files because vdsm is "busy" on the ISO Domain - which is now unreachable - (and probably reproduces on any other SD that depends on the network). This is similar to your attempt #2. Attempt #1 is unrelated to this problem. If the ISO Domain (or possibly any other iSCSI/NFS domain) is unreachable due to the network down, it doesn't really seem to matter if the disconnect is local or in a different point in the network.
Hi, Any progress on this? Is more data required? Thanks
Hi Germano, we think we know all we needed and we even have reproducing steps. We will need some time to fix this though as the storage interaction is a bit tricky.
Hi Martin, looking at the "Depends On" bug I understand the amount of work involved is substantial. Thank you!
Moving to 4.2 as we need to cooperate with the storage team on this.
Martin, can you explain what is the storage related issues blocking you here?
The issue here is that a lost connection to an unrelated NFS ISO domain causes the monitoring loop (we call repoStats iirc) to stale and wait for the NFS timeout before deciding we have an issue. This is related to our request for domain specific status check verb. We obviously also lose the connection to the hosted engine related storage in case of total network failure, but that is nothing we can solve in software.
(In reply to Martin Sivák from comment #34) > The issue here is that a lost connection to an unrelated NFS ISO domain > causes the monitoring loop (we call repoStats iirc) to stale and wait for > the NFS timeout before deciding we have an issue. repoStat cannot block (I looked again at the code now). It must be some other request accessing storage, or hosted engine agent accessing storage directly (like bug 1332813?). > This is related to our > request for domain specific status check verb. I'm not sure this is really needed, but I posted this: https://gerrit.ovirt.org/77224
Any update?
I think that unless https://bugzilla.redhat.com/show_bug.cgi?id=1524119 is not properly verified, this bug should remain open, that is because we won't get proper migration of SHE-VM from ha-host with failed network interface to another best-score ha-host.
1. Hosted Engine Storage on Fibre Channel 2. Host is running HE 3. loses network 4. score down by 1600 points 5. ha-agent decides to shutdown HE VM 6. broker does not responding to agent (notify state_transition) 7. agent restarts, goto 6. (agent talks fine to broker on restart, but times out on state_transition) After step 3 I clearly see from second ha-host purple-vds2 that score on first ha-host which was running SHE-VM has changed to 1800 instead of 1600 as was reported. Then after a few moments it has dropped to 0. [root@purple-vds2 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : purple-vds1.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 1800 stopped : False Local maintenance : False crc32 : c155f4ad local_conf_timestamp : 73843 Host timestamp : 73839 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=73839 (Wed Dec 20 15:09:12 2017) host-id=1 score=1800 vm_conf_refresh_time=73843 (Wed Dec 20 15:09:16 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : purple-vds2.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 744c0e8a local_conf_timestamp : 73748 Host timestamp : 73745 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=73745 (Wed Dec 20 15:09:23 2017) host-id=2 score=3400 vm_conf_refresh_time=73748 (Wed Dec 20 15:09:27 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False [root@purple-vds2 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : purple-vds1.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"} Score : 0 stopped : True Local maintenance : False crc32 : cc791af5 local_conf_timestamp : 73924 Host timestamp : 73899 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=73899 (Wed Dec 20 15:10:11 2017) host-id=1 score=0 vm_conf_refresh_time=73924 (Wed Dec 20 15:10:37 2017) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : purple-vds2.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 0c7ebc95 local_conf_timestamp : 73854 Host timestamp : 73850 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=73850 (Wed Dec 20 15:11:09 2017) host-id=2 score=3400 vm_conf_refresh_time=73854 (Wed Dec 20 15:11:13 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False Then failover tarted shifting SHE-VM to second ha-host vds2: [root@purple-vds2 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : purple-vds1.qa.lab.tlv.redhat.com Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : cc791af5 local_conf_timestamp : 73924 Host timestamp : 73899 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=73899 (Wed Dec 20 15:10:11 2017) host-id=1 score=0 vm_conf_refresh_time=73924 (Wed Dec 20 15:10:37 2017) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : purple-vds2.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : 5a9b55b8 local_conf_timestamp : 73982 Host timestamp : 73978 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=73978 (Wed Dec 20 15:13:16 2017) host-id=2 score=3400 vm_conf_refresh_time=73982 (Wed Dec 20 15:13:21 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False On first ha-host I've seen error message as appears in attachment. Finally SHE-VM got started on second ha-host vds2: [root@purple-vds2 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : purple-vds1.qa.lab.tlv.redhat.com Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : cc791af5 local_conf_timestamp : 73924 Host timestamp : 73899 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=73899 (Wed Dec 20 15:10:11 2017) host-id=1 score=0 vm_conf_refresh_time=73924 (Wed Dec 20 15:10:37 2017) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : purple-vds2.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : 52b8c1b3 local_conf_timestamp : 74141 Host timestamp : 74137 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=74137 (Wed Dec 20 15:15:55 2017) host-id=2 score=3400 vm_conf_refresh_time=74141 (Wed Dec 20 15:15:59 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False Moving to verified as reported issue had not been reproduced any more. Tested on these components: rhvm-appliance-4.2-20171207.0.el7.noarch ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch vdsm-4.20.9.3-1.el7ev.x86_64
Created attachment 1370467 [details] Screenshot from 2017-12-20 15-15-45.png
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1472
BZ<2>Jira Resync