Bug 1524989

Summary:

HA score on migration source host zeroed.

Product:

[oVirt] ovirt-engine

Reporter:

Nikolai Sednev <nsednev>

Component:

BLL.HostedEngine

Assignee:

Doron Fediuck <dfediuck>

Status:

CLOSED DUPLICATE

QA Contact:

meital avital <mavital>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.2.0

CC:

bugs, msivak, nsednev

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Flags:

nsednev: planning_ack?
nsednev: devel_ack?
nsednev: testing_ack?

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-12-13 13:14:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

SLA

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs from source host alma03	none
logs from destination host alma04	none
logs from the engine	none
latest screencast	none

Description Nikolai Sednev 2017-12-12 12:05:13 UTC

Description of problem:
HA score on migration source host zeroed.

Take a look at time frame 6:09 from screencast from here https://drive.google.com/a/redhat.com/file/d/1u0Z74jwESdOKPJqqOsAO1rL7XDvY_7KE/view?usp=sharing.

Version-Release number of selected component (if applicable):
ovirt-engine-appliance-4.2-20171210.1.el7.centos.noarch
ovirt-hosted-engine-setup-2.2.1-0.0.master.20171206172737.gitd3001c8.el7.centos.noarch

How reproducible:
100%

Steps to Reproduce:
1.Deploy SHE over Gluster on pair ha-hosts using --ansible.
2.Attach NFS data storage domain.
3.Assuming that host2 is running SHE-VM and its SPM, migrate from UI SHE-VM to host1.
4.Once environment is stable after a while and SHE-VM is running on host1, migrate SHE-VM back to host2.
5.HA score on host1 zeroed for a moment.

Actual results:
HA score on migration source host zeroed.

Expected results:
HA score should not be zeroed on source host.


Additional info:
Logs from both hosts and the engine attached.
Look at logs from 12/12/17 1:40PM.

Comment 1 Nikolai Sednev 2017-12-12 12:09:20 UTC

Created attachment 1366610 [details]
logs from source host alma03

Comment 2 Nikolai Sednev 2017-12-12 12:10:22 UTC

Created attachment 1366611 [details]
logs from destination host alma04

Comment 3 Nikolai Sednev 2017-12-12 12:15:21 UTC

Created attachment 1366616 [details]
logs from the engine

Comment 4 Yaniv Kaul 2017-12-12 15:28:00 UTC

What is 'for a moment' ?

Comment 5 Nikolai Sednev 2017-12-13 06:47:53 UTC

(In reply to Yaniv Kaul from comment #4)
> What is 'for a moment' ?

For a time frame of 6:09-6:14 as appears on a screencast. 
Zeroed for ~5 seconds.

Comment 6 Doron Fediuck 2017-12-13 09:43:48 UTC

It seems like an issue talking to vdsm which resulted in restarting the agent. This is why you see the score zero. We need to understand what was the issue with vdsm causing the restart-

MainThread::INFO::2017-12-12 13:43:21,220::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400)
MainThread::ERROR::2017-12-12 13:43:41,857::hosted_engine::434::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unhandled monitoring loop exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 431, in start_monitoring
    self._monitoring_loop()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 476, in _monitoring_loop
    self._initialize_vdsm()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 569, in _initialize_vdsm
    logger=self._log
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 442, in connect_vdsm_json_rpc
    __vdsm_json_rpc_connect(logger, timeout)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 398, in __vdsm_json_rpc_connect
    timeout=VDSM_MAX_RETRY * VDSM_DELAY
RuntimeError: Couldn't  connect to VDSM within 15 seconds
MainThread::WARNING::2017-12-12 13:43:43,878::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2017-12-12 13:43:43,878::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
MainThread::INFO::2017-12-12 13:43:54,383::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.1-0.0.master.20171130131317 started

Comment 7 Nikolai Sednev 2017-12-13 10:00:07 UTC

I've also seen these http://pastebin.test.redhat.com/540340 on alma04, which may be a side affect though.

Comment 8 Nikolai Sednev 2017-12-13 10:13:05 UTC

May the code changes in fix in https://bugzilla.redhat.com/show_bug.cgi?id=1524119 be related to this issue?

Comment 9 Martin Sivák 2017-12-13 10:23:37 UTC

Zeroing the score for 5 seconds is not a big deal and a failure when connecting to VDSM is definitely the cause.

But to know why it failed to connect we need to know:

- What was the VDSM version?
- Did vdsm restart?

And yes, the issues we uncovered in vdsm might have something to do with it.

Comment 10 Doron Fediuck 2017-12-13 10:30:05 UTC

Please check again with a fix for Bug 1524119.
If the issue still exists please re-open.

*** This bug has been marked as a duplicate of bug 1524119 ***

Comment 14 Nikolai Sednev 2017-12-13 15:39:25 UTC

Created attachment 1367440 [details]
latest screencast

Comment 16 Martin Sivák 2017-12-13 15:55:58 UTC

Artyom could not migrate because of broken vdsm recovery (bug 1524119 and bug 1522901) that was only needed because vdsm crashed (bug 1522878)

This bug happened specifically because of bug 1522878 - and is either a duplicate or not a bug at all (crashing VDSM MUST cause the score to drop - this is not a bug!). We can use the other bug number to close this bug, but it does not change the resolution.

The root cause for both of them is the same except we found another issue in the middle in Artyom's case.

Ansible vs regular does not matter here, because it only affects deployment and the engine database, the part you test here is the same in both flows