Bug 1524989 - HA score on migration source host zeroed.
Summary: HA score on migration source host zeroed.
Keywords:
Status: CLOSED DUPLICATE of bug 1524119
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.HostedEngine
Version: 4.2.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Doron Fediuck
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-12-12 12:05 UTC by Nikolai Sednev
Modified: 2017-12-13 15:55 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-12-13 13:14:26 UTC
oVirt Team: SLA
Embargoed:
nsednev: planning_ack?
nsednev: devel_ack?
nsednev: testing_ack?


Attachments (Terms of Use)
logs from source host alma03 (14.54 MB, application/x-xz)
2017-12-12 12:09 UTC, Nikolai Sednev
no flags Details
logs from destination host alma04 (16.02 MB, application/x-xz)
2017-12-12 12:10 UTC, Nikolai Sednev
no flags Details
logs from the engine (201.39 KB, application/x-gzip)
2017-12-12 12:15 UTC, Nikolai Sednev
no flags Details
latest screencast (12.33 MB, application/octet-stream)
2017-12-13 15:39 UTC, Nikolai Sednev
no flags Details

Description Nikolai Sednev 2017-12-12 12:05:13 UTC
Description of problem:
HA score on migration source host zeroed.

Take a look at time frame 6:09 from screencast from here https://drive.google.com/a/redhat.com/file/d/1u0Z74jwESdOKPJqqOsAO1rL7XDvY_7KE/view?usp=sharing.

Version-Release number of selected component (if applicable):
ovirt-engine-appliance-4.2-20171210.1.el7.centos.noarch
ovirt-hosted-engine-setup-2.2.1-0.0.master.20171206172737.gitd3001c8.el7.centos.noarch

How reproducible:
100%

Steps to Reproduce:
1.Deploy SHE over Gluster on pair ha-hosts using --ansible.
2.Attach NFS data storage domain.
3.Assuming that host2 is running SHE-VM and its SPM, migrate from UI SHE-VM to host1.
4.Once environment is stable after a while and SHE-VM is running on host1, migrate SHE-VM back to host2.
5.HA score on host1 zeroed for a moment.

Actual results:
HA score on migration source host zeroed.

Expected results:
HA score should not be zeroed on source host.


Additional info:
Logs from both hosts and the engine attached.
Look at logs from 12/12/17 1:40PM.

Comment 1 Nikolai Sednev 2017-12-12 12:09:20 UTC
Created attachment 1366610 [details]
logs from source host alma03

Comment 2 Nikolai Sednev 2017-12-12 12:10:22 UTC
Created attachment 1366611 [details]
logs from destination host alma04

Comment 3 Nikolai Sednev 2017-12-12 12:15:21 UTC
Created attachment 1366616 [details]
logs from the engine

Comment 4 Yaniv Kaul 2017-12-12 15:28:00 UTC
What is 'for a moment' ?

Comment 5 Nikolai Sednev 2017-12-13 06:47:53 UTC
(In reply to Yaniv Kaul from comment #4)
> What is 'for a moment' ?

For a time frame of 6:09-6:14 as appears on a screencast. 
Zeroed for ~5 seconds.

Comment 6 Doron Fediuck 2017-12-13 09:43:48 UTC
It seems like an issue talking to vdsm which resulted in restarting the agent. This is why you see the score zero. We need to understand what was the issue with vdsm causing the restart-

MainThread::INFO::2017-12-12 13:43:21,220::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400)
MainThread::ERROR::2017-12-12 13:43:41,857::hosted_engine::434::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unhandled monitoring loop exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 431, in start_monitoring
    self._monitoring_loop()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 476, in _monitoring_loop
    self._initialize_vdsm()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 569, in _initialize_vdsm
    logger=self._log
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 442, in connect_vdsm_json_rpc
    __vdsm_json_rpc_connect(logger, timeout)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 398, in __vdsm_json_rpc_connect
    timeout=VDSM_MAX_RETRY * VDSM_DELAY
RuntimeError: Couldn't  connect to VDSM within 15 seconds
MainThread::WARNING::2017-12-12 13:43:43,878::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2017-12-12 13:43:43,878::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
MainThread::INFO::2017-12-12 13:43:54,383::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.1-0.0.master.20171130131317 started

Comment 7 Nikolai Sednev 2017-12-13 10:00:07 UTC
I've also seen these http://pastebin.test.redhat.com/540340 on alma04, which may be a side affect though.

Comment 8 Nikolai Sednev 2017-12-13 10:13:05 UTC
May the code changes in fix in https://bugzilla.redhat.com/show_bug.cgi?id=1524119 be related to this issue?

Comment 9 Martin Sivák 2017-12-13 10:23:37 UTC
Zeroing the score for 5 seconds is not a big deal and a failure when connecting to VDSM is definitely the cause.

But to know why it failed to connect we need to know:

- What was the VDSM version?
- Did vdsm restart?

And yes, the issues we uncovered in vdsm might have something to do with it.

Comment 10 Doron Fediuck 2017-12-13 10:30:05 UTC
Please check again with a fix for Bug 1524119.
If the issue still exists please re-open.

*** This bug has been marked as a duplicate of bug 1524119 ***

Comment 14 Nikolai Sednev 2017-12-13 15:39:25 UTC
Created attachment 1367440 [details]
latest screencast

Comment 16 Martin Sivák 2017-12-13 15:55:58 UTC
Artyom could not migrate because of broken vdsm recovery (bug 1524119 and bug 1522901) that was only needed because vdsm crashed (bug 1522878)

This bug happened specifically because of bug 1522878 - and is either a duplicate or not a bug at all (crashing VDSM MUST cause the score to drop - this is not a bug!). We can use the other bug number to close this bug, but it does not change the resolution.

The root cause for both of them is the same except we found another issue in the middle in Artyom's case.

Ansible vs regular does not matter here, because it only affects deployment and the engine database, the part you test here is the same in both flows


Note You need to log in before you can comment on or make changes to this bug.