Bug 1420239 - [downstream clone - 4.0.7] Adding host to engine failed at first time but host was auto recovered after several mins
Summary: [downstream clone - 4.0.7] Adding host to engine failed at first time but hos...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.0.6
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.0.7
: ---
Assignee: Yevgeny Zaspitsky
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On: 1415471
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-08 09:39 UTC by rhev-integ
Modified: 2020-04-15 15:14 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1415471
Environment:
Last Closed: 2017-03-16 15:33:12 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0542 0 normal SHIPPED_LIVE Red Hat Virtualization Manager 4.0.7 2017-03-16 19:25:04 UTC
oVirt gerrit 71452 0 master MERGED engine: refactor HostConnectivityChecker 2017-02-08 09:42:08 UTC
oVirt gerrit 71682 0 master MERGED engine: overcome VDSM recovery in PollVDSCommand 2017-02-08 09:42:08 UTC
oVirt gerrit 71776 0 ovirt-engine-4.1 MERGED engine: refactor HostConnectivityChecker 2017-02-08 09:42:08 UTC
oVirt gerrit 71777 0 ovirt-engine-4.1 MERGED engine: overcome VDSM recovery in PollVDSCommand 2017-02-08 09:42:08 UTC
oVirt gerrit 71875 0 None None None 2017-02-08 21:33:48 UTC
oVirt gerrit 71876 0 None None None 2017-02-08 21:34:32 UTC

Description rhev-integ 2017-02-08 09:39:33 UTC
+++ This bug is an upstream to downstream clone. The original bug is: +++
+++   bug 1415471 +++
======================================================================

Created attachment 1243307 [details]
engine log

Description of problem:
dding rhvh to engine failed at first time but rhvh was autorecovered after several mins

Version-Release number of selected component (if applicable):
Red Hat Virtualization Manager Version: 4.1.0.1-0.1.el7
redhat-virtualization-host-4.1-20170120
imgbased-0.9.6-0.1.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Install a rhvh4.1
2. Reboot the rhvh
3. Add host to engine

Actual results:
1. After step #3, add host to engine failed
2. When adding host to engine failed, wait for 2 mins, the rhvh were recover automatically and status up on rhvm side. 

Expected results:
1. After step#3, the host can be added to engine successfully without any error

Additional info:
Modify "VDSM/disableNetworkManager=bool:False" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf, also caught this erro

(Originally by Daijie Guo)

Comment 1 rhev-integ 2017-02-08 09:39:48 UTC
Created attachment 1243308 [details]
host deploy log

(Originally by Daijie Guo)

Comment 3 rhev-integ 2017-02-08 09:39:57 UTC
Created attachment 1243309 [details]
vdsm

(Originally by Daijie Guo)

Comment 4 rhev-integ 2017-02-08 09:40:06 UTC
Created attachment 1243310 [details]
network scripts

(Originally by Daijie Guo)

Comment 5 rhev-integ 2017-02-08 09:40:15 UTC
No such issue in previous build, so it'a regression.

(Originally by Daijie Guo)

Comment 6 rhev-integ 2017-02-08 09:40:25 UTC
Is this reproducible on RHEL?

(Originally by Ryan Barry)

Comment 7 rhev-integ 2017-02-08 09:40:34 UTC
(In reply to Ryan Barry from comment #5)
> Is this reproducible on RHEL?

Yes, also can be caught on RHEL7.3

[root@rhel7 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.3 (Maipo)
[root@rhel7 ~]# rpm -qa|grep vdsm
vdsm-hook-vmfex-dev-4.19.2-2.el7ev.noarch
vdsm-python-4.19.2-2.el7ev.noarch
vdsm-jsonrpc-4.19.2-2.el7ev.noarch
vdsm-cli-4.19.2-2.el7ev.noarch
vdsm-yajsonrpc-4.19.2-2.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64
vdsm-xmlrpc-4.19.2-2.el7ev.noarch
vdsm-api-4.19.2-2.el7ev.noarch

(Originally by Daijie Guo)

Comment 8 rhev-integ 2017-02-08 09:40:44 UTC
Does it happen only with disableNetworkManager=False?

(Originally by danken)

Comment 9 rhev-integ 2017-02-08 09:40:52 UTC
It seems that vdsm was in recovery mode:

2017-01-22 04:10:19,584-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-7-thread-22) [67c967f0] Error: Recovering from crash or Initializing

(Originally by Piotr Kliczewski)

Comment 10 rhev-integ 2017-02-08 09:41:01 UTC
Proposing as blocker for 4.1

(Originally by Sandro Bonazzola)

Comment 11 rhev-integ 2017-02-08 09:41:11 UTC
(In reply to Dan Kenigsberg from comment #7)
> Does it happen only with disableNetworkManager=False?

Yes, in bug description "VDSM/disableNetworkManager=bool:False" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf.

(Originally by Ying Cui)

Comment 12 rhev-integ 2017-02-08 09:41:19 UTC
Maybe not clear for additional info in the description, I should clarify that:

I have tested below two scenarios, both encounter this bug.

1. "VDSM/disableNetworkManager=bool:False" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf
After registering engine, the NM daemon is kept running.

2. "VDSM/disableNetworkManager=bool:True" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf
After registering engine, the NM daemon is not running.

(Originally by Daijie Guo)

Comment 13 rhev-integ 2017-02-08 09:41:29 UTC
Hi

I managed to reproduce this issue one time few days ago(with rhel server).
Note that although the  host was auto recovred the 'ovirtmgmt' network wasn't persisted and such host won't come up after reboot.

(Originally by Michael Burman)

Comment 14 rhev-integ 2017-02-08 09:41:38 UTC
I am not seeing any issue on the networking side.

Setup started @04:10:20 and ended @04:10:49
It took 20sec for DHCP to return the same IP on the management bridge and then connectivity check succeeded and returned back an OK.

I am not clear what happened on the engine side, it makes sense for it to loose connectivity for 20sec but it should have recovered. the pings seem to have arrived to VDSM (see vdsm.log) so I am not clear what are the errors in the Engine logs.

(Originally by edwardh)

Comment 15 rhev-integ 2017-02-08 09:41:47 UTC
I suggest to attempt to reproduce it with the latest jsonrpc in version 1.3.8.

(Originally by Piotr Kliczewski)

Comment 16 rhev-integ 2017-02-08 09:41:55 UTC
We seem to have several points that need treatment (but not related):
- On host deploy, when starting vdsmd, it may take a while until VDSM can actually service incoming requests (it is in recovery mode). It makes sense to block on service start until VDSM is indeed ready to accept requests.

- Based on Engine logs, the first flow of setupNetworks attempt (after the deploy scripts) failed because of VDSM being in recovery state and it has not handled such a case (Engine does not expect VDSM to be in recovery state).

- It is not exactly clear why, but after the first flow failure mentioned above another flow is created and issues an additional setupNetworks that is handled by VDSM but fails on Engine side due to RPC connectivity problem.

For the 2nd point, yzaspits is posting a proposed patch.

(Originally by edwardh)

Comment 18 Michael Burman 2017-02-19 15:56:23 UTC
Verified on - 4.0.7.1-0.1.el7ev

Comment 20 errata-xmlrpc 2017-03-16 15:33:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0542.html


Note You need to log in before you can comment on or make changes to this bug.