1415471 – Adding host to engine failed at first time but host was auto recovered after several mins

Bug 1415471 - Adding host to engine failed at first time but host was auto recovered after several mins

Summary: Adding host to engine failed at first time but host was auto recovered after ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.1.0.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-4.1.1
Target Release:	4.1.1
Assignee:	Yevgeny Zaspitsky
QA Contact:	Michael Burman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1420239
TreeView+	depends on / blocked

Reported:	2017-01-22 10:30 UTC by dguo
Modified:	2023-09-14 03:52 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Clone Of:
Clones:	1420239 (view as bug list)
Environment:
Last Closed:	2017-04-21 09:39:02 UTC
oVirt Team:	Network
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ rule-engine: blocker+ cshao: testing_ack+

Attachments	(Terms of Use)
engine log (142.62 KB, text/plain) 2017-01-22 10:30 UTC, dguo	no flags	Details
host deploy log (513.74 KB, text/plain) 2017-01-22 10:32 UTC, dguo	no flags	Details
vdsm (10.27 KB, application/x-gzip) 2017-01-22 10:33 UTC, dguo	no flags	Details
network scripts (44.15 KB, application/x-gzip) 2017-01-22 10:33 UTC, dguo	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	71452	master	MERGED	engine: refactor HostConnectivityChecker	2017-02-05 19:38:53 UTC
oVirt gerrit	71682	master	MERGED	engine: overcome VDSM recovery in PollVDSCommand	2017-02-07 11:45:30 UTC
oVirt gerrit	71776	ovirt-engine-4.1	MERGED	engine: refactor HostConnectivityChecker	2017-02-08 08:26:47 UTC
oVirt gerrit	71777	ovirt-engine-4.1	MERGED	engine: overcome VDSM recovery in PollVDSCommand	2017-02-08 08:26:51 UTC

Description dguo 2017-01-22 10:30:22 UTC

Created attachment 1243307 [details]
engine log

Description of problem:
dding rhvh to engine failed at first time but rhvh was autorecovered after several mins

Version-Release number of selected component (if applicable):
Red Hat Virtualization Manager Version: 4.1.0.1-0.1.el7
redhat-virtualization-host-4.1-20170120
imgbased-0.9.6-0.1.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Install a rhvh4.1
2. Reboot the rhvh
3. Add host to engine

Actual results:
1. After step #3, add host to engine failed
2. When adding host to engine failed, wait for 2 mins, the rhvh were recover automatically and status up on rhvm side. 

Expected results:
1. After step#3, the host can be added to engine successfully without any error

Additional info:
Modify "VDSM/disableNetworkManager=bool:False" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf, also caught this erro

Comment 1 dguo 2017-01-22 10:32:54 UTC

Created attachment 1243308 [details]
host deploy log

Comment 2 dguo 2017-01-22 10:33:19 UTC

Created attachment 1243309 [details]
vdsm

Comment 3 dguo 2017-01-22 10:33:44 UTC

Created attachment 1243310 [details]
network scripts

Comment 4 dguo 2017-01-22 10:37:10 UTC

No such issue in previous build, so it'a regression.

Comment 5 Ryan Barry 2017-01-22 15:13:02 UTC

Is this reproducible on RHEL?

Comment 6 dguo 2017-01-23 03:53:01 UTC

(In reply to Ryan Barry from comment #5)
> Is this reproducible on RHEL?

Yes, also can be caught on RHEL7.3

[root@rhel7 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.3 (Maipo)
[root@rhel7 ~]# rpm -qa|grep vdsm
vdsm-hook-vmfex-dev-4.19.2-2.el7ev.noarch
vdsm-python-4.19.2-2.el7ev.noarch
vdsm-jsonrpc-4.19.2-2.el7ev.noarch
vdsm-cli-4.19.2-2.el7ev.noarch
vdsm-yajsonrpc-4.19.2-2.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64
vdsm-xmlrpc-4.19.2-2.el7ev.noarch
vdsm-api-4.19.2-2.el7ev.noarch

Comment 7 Dan Kenigsberg 2017-01-23 13:18:06 UTC

Does it happen only with disableNetworkManager=False?

Comment 8 Piotr Kliczewski 2017-01-23 13:18:46 UTC

It seems that vdsm was in recovery mode:

2017-01-22 04:10:19,584-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-7-thread-22) [67c967f0] Error: Recovering from crash or Initializing

Comment 9 Sandro Bonazzola 2017-01-23 13:26:14 UTC

Proposing as blocker for 4.1

Comment 10 Ying Cui 2017-01-23 13:52:38 UTC

(In reply to Dan Kenigsberg from comment #7)
> Does it happen only with disableNetworkManager=False?

Yes, in bug description "VDSM/disableNetworkManager=bool:False" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf.

Comment 11 dguo 2017-01-24 09:17:04 UTC

Maybe not clear for additional info in the description, I should clarify that:

I have tested below two scenarios, both encounter this bug.

1. "VDSM/disableNetworkManager=bool:False" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf
After registering engine, the NM daemon is kept running.

2. "VDSM/disableNetworkManager=bool:True" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf
After registering engine, the NM daemon is not running.

Comment 12 Michael Burman 2017-01-26 08:35:21 UTC

Hi

I managed to reproduce this issue one time few days ago(with rhel server).
Note that although the  host was auto recovred the 'ovirtmgmt' network wasn't persisted and such host won't come up after reboot.

Comment 13 Edward Haas 2017-01-31 08:29:27 UTC

I am not seeing any issue on the networking side.

Setup started @04:10:20 and ended @04:10:49
It took 20sec for DHCP to return the same IP on the management bridge and then connectivity check succeeded and returned back an OK.

I am not clear what happened on the engine side, it makes sense for it to loose connectivity for 20sec but it should have recovered. the pings seem to have arrived to VDSM (see vdsm.log) so I am not clear what are the errors in the Engine logs.

Comment 14 Piotr Kliczewski 2017-01-31 08:55:52 UTC

I suggest to attempt to reproduce it with the latest jsonrpc in version 1.3.8.

Comment 15 Edward Haas 2017-01-31 11:23:48 UTC

We seem to have several points that need treatment (but not related):
- On host deploy, when starting vdsmd, it may take a while until VDSM can actually service incoming requests (it is in recovery mode). It makes sense to block on service start until VDSM is indeed ready to accept requests.

- Based on Engine logs, the first flow of setupNetworks attempt (after the deploy scripts) failed because of VDSM being in recovery state and it has not handled such a case (Engine does not expect VDSM to be in recovery state).

- It is not exactly clear why, but after the first flow failure mentioned above another flow is created and issues an additional setupNetworks that is handled by VDSM but fails on Engine side due to RPC connectivity problem.

For the 2nd point, yzaspits is posting a proposed patch.

Comment 17 Michael Burman 2017-02-13 08:56:09 UTC

Verified on - rhevm-4.1.1-0.1.el7.noarch

Comment 19 Red Hat Bugzilla 2023-09-14 03:52:32 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.