Bug 1476815
Summary: | removing defaultRoute role from mgmt networks is very slow (>60s) if using DHCP | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Michael Burman <mburman> | ||||||
Component: | BLL.Network | Assignee: | Edward Haas <edwardh> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Meni Yakove <myakove> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 4.2.0 | CC: | alkaplan, bugs, danken, mburman, ylavi | ||||||
Target Milestone: | --- | Flags: | sbonazzo:
ovirt-4.3-
|
||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-08-08 07:54:27 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1200963 | ||||||||
Attachments: |
|
Description
Michael Burman
2017-07-31 14:14:42 UTC
Created attachment 1307109 [details]
evm log
Hi Michael, The fact that the network is marked as out-of-sync till the setup network ends is not a bug. Till the setup networks doesn't end successfully the network is indeed out-of-sync (desired state - default route, actual- not default route). Also, if the setup networks fails, it should stay out-of-sync. What bothers me is why you had timeouts. Can you please attach the vdsm log? (In reply to Alona Kaplan from comment #2) > Hi Michael, > The fact that the network is marked as out-of-sync till the setup network > ends is not a bug. Till the setup networks doesn't end successfully the > network is indeed out-of-sync (desired state - default route, actual- not > default route). Also, if the setup networks fails, it should stay > out-of-sync. > > What bothers me is why you had timeouts. Can you please attach the vdsm log? A lot of those timeouts - 2017-08-02 13:35:11,832+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Command 'PollVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='931c94f3-b452-4597-9989-8d6b48e232cb'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during rpc call 2017-08-02 13:35:11,833+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Timeout waiting for VDSM response: Internal timeout occured 2017-08-02 13:35:14,340+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Command 'PollVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='931c94f3-b452-4597-9989-8d6b48e232cb'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during rpc call 2017-08-02 13:35:14,340+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Timeout waiting for VDSM response: Internal timeout occured Attaching vdsm log and engine log Created attachment 1308076 [details]
vdsm and engine logs
I believe this is somehow 'caused' by vdsm and dhcp. What timeouts is polling of host... Edy, the situation here is quite vague to me. Is it a slow dhcp server again? Summarizing the examination of this issue: When switching the ovirtmgmt default route to a different network in the cluster, it takes a long amount of time for the action to complete. It is mainly showing up while ovirtmgmt is defined using DHCP. Examining the logs showed that the DHCP server response takes around 40sec and another 20sec were spent on the network teardown, creation and recovery of the RPC connection (between Engine and VDSM). Optimization options exists to lower this time, but they require wider investment. Optimization options: - Detect that only the default route changed and apply that modification only. Special care will be required for preserving the gateway received in the DHCP response. - Do not allow changing the management network IP address and its method. I would not like to add validations for the management network. If any, we should remove Engine-side validations. if we end up with connectivity we are fine; if we have no connectivity after timeout, we should roll back. Closing old bugs. Please reopen if still needed. In any case patches are welcomed. |