Description of problem: HA score on migration source host zeroed. Take a look at time frame 6:09 from screencast from here https://drive.google.com/a/redhat.com/file/d/1u0Z74jwESdOKPJqqOsAO1rL7XDvY_7KE/view?usp=sharing. Version-Release number of selected component (if applicable): ovirt-engine-appliance-4.2-20171210.1.el7.centos.noarch ovirt-hosted-engine-setup-2.2.1-0.0.master.20171206172737.gitd3001c8.el7.centos.noarch How reproducible: 100% Steps to Reproduce: 1.Deploy SHE over Gluster on pair ha-hosts using --ansible. 2.Attach NFS data storage domain. 3.Assuming that host2 is running SHE-VM and its SPM, migrate from UI SHE-VM to host1. 4.Once environment is stable after a while and SHE-VM is running on host1, migrate SHE-VM back to host2. 5.HA score on host1 zeroed for a moment. Actual results: HA score on migration source host zeroed. Expected results: HA score should not be zeroed on source host. Additional info: Logs from both hosts and the engine attached. Look at logs from 12/12/17 1:40PM.
Created attachment 1366610 [details] logs from source host alma03
Created attachment 1366611 [details] logs from destination host alma04
Created attachment 1366616 [details] logs from the engine
What is 'for a moment' ?
(In reply to Yaniv Kaul from comment #4) > What is 'for a moment' ? For a time frame of 6:09-6:14 as appears on a screencast. Zeroed for ~5 seconds.
It seems like an issue talking to vdsm which resulted in restarting the agent. This is why you see the score zero. We need to understand what was the issue with vdsm causing the restart- MainThread::INFO::2017-12-12 13:43:21,220::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::ERROR::2017-12-12 13:43:41,857::hosted_engine::434::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unhandled monitoring loop exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 431, in start_monitoring self._monitoring_loop() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 476, in _monitoring_loop self._initialize_vdsm() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 569, in _initialize_vdsm logger=self._log File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 442, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 398, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 15 seconds MainThread::WARNING::2017-12-12 13:43:43,878::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2017-12-12 13:43:43,878::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2017-12-12 13:43:54,383::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.1-0.0.master.20171130131317 started
I've also seen these http://pastebin.test.redhat.com/540340 on alma04, which may be a side affect though.
May the code changes in fix in https://bugzilla.redhat.com/show_bug.cgi?id=1524119 be related to this issue?
Zeroing the score for 5 seconds is not a big deal and a failure when connecting to VDSM is definitely the cause. But to know why it failed to connect we need to know: - What was the VDSM version? - Did vdsm restart? And yes, the issues we uncovered in vdsm might have something to do with it.
Please check again with a fix for Bug 1524119. If the issue still exists please re-open. *** This bug has been marked as a duplicate of bug 1524119 ***
Created attachment 1367440 [details] latest screencast
Artyom could not migrate because of broken vdsm recovery (bug 1524119 and bug 1522901) that was only needed because vdsm crashed (bug 1522878) This bug happened specifically because of bug 1522878 - and is either a duplicate or not a bug at all (crashing VDSM MUST cause the score to drop - this is not a bug!). We can use the other bug number to close this bug, but it does not change the resolution. The root cause for both of them is the same except we found another issue in the middle in Artyom's case. Ansible vs regular does not matter here, because it only affects deployment and the engine database, the part you test here is the same in both flows