Bug 1415471
Summary: | Adding host to engine failed at first time but host was auto recovered after several mins | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | dguo | ||||||||||
Component: | General | Assignee: | Yevgeny Zaspitsky <yzaspits> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Michael Burman <mburman> | ||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 4.1.0.1 | CC: | bugs, cshao, dguo, edwardh, eedri, gklein, huzhao, jiawu, jmoon, leiwang, mburman, pkliczew, pstehlik, qiyuan, rbarry, weiwang, yaniwang, ycui, yzhao | ||||||||||
Target Milestone: | ovirt-4.1.1 | Keywords: | Regression, ZStream | ||||||||||
Target Release: | 4.1.1 | Flags: | rule-engine:
ovirt-4.1+
rule-engine: blocker+ cshao: testing_ack+ |
||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1420239 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2017-04-21 09:39:02 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1420239 | ||||||||||||
Attachments: |
|
Created attachment 1243308 [details]
host deploy log
Created attachment 1243309 [details]
vdsm
Created attachment 1243310 [details]
network scripts
No such issue in previous build, so it'a regression. Is this reproducible on RHEL? (In reply to Ryan Barry from comment #5) > Is this reproducible on RHEL? Yes, also can be caught on RHEL7.3 [root@rhel7 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.3 (Maipo) [root@rhel7 ~]# rpm -qa|grep vdsm vdsm-hook-vmfex-dev-4.19.2-2.el7ev.noarch vdsm-python-4.19.2-2.el7ev.noarch vdsm-jsonrpc-4.19.2-2.el7ev.noarch vdsm-cli-4.19.2-2.el7ev.noarch vdsm-yajsonrpc-4.19.2-2.el7ev.noarch vdsm-4.19.2-2.el7ev.x86_64 vdsm-xmlrpc-4.19.2-2.el7ev.noarch vdsm-api-4.19.2-2.el7ev.noarch Does it happen only with disableNetworkManager=False? It seems that vdsm was in recovery mode: 2017-01-22 04:10:19,584-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-7-thread-22) [67c967f0] Error: Recovering from crash or Initializing Proposing as blocker for 4.1 (In reply to Dan Kenigsberg from comment #7) > Does it happen only with disableNetworkManager=False? Yes, in bug description "VDSM/disableNetworkManager=bool:False" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf. Maybe not clear for additional info in the description, I should clarify that: I have tested below two scenarios, both encounter this bug. 1. "VDSM/disableNetworkManager=bool:False" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf After registering engine, the NM daemon is kept running. 2. "VDSM/disableNetworkManager=bool:True" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf After registering engine, the NM daemon is not running. Hi I managed to reproduce this issue one time few days ago(with rhel server). Note that although the host was auto recovred the 'ovirtmgmt' network wasn't persisted and such host won't come up after reboot. I am not seeing any issue on the networking side. Setup started @04:10:20 and ended @04:10:49 It took 20sec for DHCP to return the same IP on the management bridge and then connectivity check succeeded and returned back an OK. I am not clear what happened on the engine side, it makes sense for it to loose connectivity for 20sec but it should have recovered. the pings seem to have arrived to VDSM (see vdsm.log) so I am not clear what are the errors in the Engine logs. I suggest to attempt to reproduce it with the latest jsonrpc in version 1.3.8. We seem to have several points that need treatment (but not related): - On host deploy, when starting vdsmd, it may take a while until VDSM can actually service incoming requests (it is in recovery mode). It makes sense to block on service start until VDSM is indeed ready to accept requests. - Based on Engine logs, the first flow of setupNetworks attempt (after the deploy scripts) failed because of VDSM being in recovery state and it has not handled such a case (Engine does not expect VDSM to be in recovery state). - It is not exactly clear why, but after the first flow failure mentioned above another flow is created and issues an additional setupNetworks that is handled by VDSM but fails on Engine side due to RPC connectivity problem. For the 2nd point, yzaspits is posting a proposed patch. Verified on - rhevm-4.1.1-0.1.el7.noarch The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Created attachment 1243307 [details] engine log Description of problem: dding rhvh to engine failed at first time but rhvh was autorecovered after several mins Version-Release number of selected component (if applicable): Red Hat Virtualization Manager Version: 4.1.0.1-0.1.el7 redhat-virtualization-host-4.1-20170120 imgbased-0.9.6-0.1.el7ev.noarch vdsm-4.19.2-2.el7ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Install a rhvh4.1 2. Reboot the rhvh 3. Add host to engine Actual results: 1. After step #3, add host to engine failed 2. When adding host to engine failed, wait for 2 mins, the rhvh were recover automatically and status up on rhvm side. Expected results: 1. After step#3, the host can be added to engine successfully without any error Additional info: Modify "VDSM/disableNetworkManager=bool:False" in /etc/ovirt-host-deploy.conf.d/90-ngn-do-not-keep-networkmanager.conf, also caught this erro