Bug 1524989
Summary: | HA score on migration source host zeroed. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Nikolai Sednev <nsednev> | ||||||||||
Component: | BLL.HostedEngine | Assignee: | Doron Fediuck <dfediuck> | ||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | meital avital <mavital> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 4.2.0 | CC: | bugs, msivak, nsednev | ||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||
Target Release: | --- | Flags: | nsednev:
planning_ack?
nsednev: devel_ack? nsednev: testing_ack? |
||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2017-12-13 13:14:26 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Nikolai Sednev
2017-12-12 12:05:13 UTC
Created attachment 1366610 [details]
logs from source host alma03
Created attachment 1366611 [details]
logs from destination host alma04
Created attachment 1366616 [details]
logs from the engine
What is 'for a moment' ? (In reply to Yaniv Kaul from comment #4) > What is 'for a moment' ? For a time frame of 6:09-6:14 as appears on a screencast. Zeroed for ~5 seconds. It seems like an issue talking to vdsm which resulted in restarting the agent. This is why you see the score zero. We need to understand what was the issue with vdsm causing the restart- MainThread::INFO::2017-12-12 13:43:21,220::hosted_engine::494::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::ERROR::2017-12-12 13:43:41,857::hosted_engine::434::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unhandled monitoring loop exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 431, in start_monitoring self._monitoring_loop() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 476, in _monitoring_loop self._initialize_vdsm() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 569, in _initialize_vdsm logger=self._log File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 442, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 398, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 15 seconds MainThread::WARNING::2017-12-12 13:43:43,878::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2017-12-12 13:43:43,878::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2017-12-12 13:43:54,383::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.1-0.0.master.20171130131317 started I've also seen these http://pastebin.test.redhat.com/540340 on alma04, which may be a side affect though. May the code changes in fix in https://bugzilla.redhat.com/show_bug.cgi?id=1524119 be related to this issue? Zeroing the score for 5 seconds is not a big deal and a failure when connecting to VDSM is definitely the cause. But to know why it failed to connect we need to know: - What was the VDSM version? - Did vdsm restart? And yes, the issues we uncovered in vdsm might have something to do with it. Please check again with a fix for Bug 1524119. If the issue still exists please re-open. *** This bug has been marked as a duplicate of bug 1524119 *** Created attachment 1367440 [details]
latest screencast
Artyom could not migrate because of broken vdsm recovery (bug 1524119 and bug 1522901) that was only needed because vdsm crashed (bug 1522878) This bug happened specifically because of bug 1522878 - and is either a duplicate or not a bug at all (crashing VDSM MUST cause the score to drop - this is not a bug!). We can use the other bug number to close this bug, but it does not change the resolution. The root cause for both of them is the same except we found another issue in the middle in Artyom's case. Ansible vs regular does not matter here, because it only affects deployment and the engine database, the part you test here is the same in both flows |