Bug 1476815

Summary: removing defaultRoute role from mgmt networks is very slow (>60s) if using DHCP
Product: [oVirt] ovirt-engine Reporter: Michael Burman <mburman>
Component: BLL.NetworkAssignee: Edward Haas <edwardh>
Status: CLOSED WONTFIX QA Contact: Meni Yakove <myakove>
Severity: high Docs Contact:
Priority: low    
Version: 4.2.0CC: alkaplan, bugs, danken, mburman, ylavi
Target Milestone: ---Flags: sbonazzo: ovirt-4.3-
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-08 07:54:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1200963    
Attachments:
Description Flags
evm log
none
vdsm and engine logs none

Description Michael Burman 2017-07-31 14:14:42 UTC
Description of problem:
Time out is too long when setting a non-mgmt network as the default route role via the 'Clusters' > Logical Networks' > Manage Networks flow.

When setting a non-mgmt network that is attached to the host with default route role via the 'Clusters' > Logical Networks' > Manage Networks flow is taking around 2-3 minutes(in case of dhcp ovirtmgmt) and around minute(for static ovirtmgmt). There are timeouts during this time and both networks are reproted as out-of-sync.

This must be improved. 

This are the timeouts. 

2017-07-31 16:49:07,349+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma
nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri
ng rpc call
2017-07-31 16:49:07,349+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Timeout waiting for V
DSM response: Internal timeout occured
2017-07-31 16:49:09,906+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma
nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri
ng rpc call
2017-07-31 16:49:09,906+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Timeout waiting for V
DSM response: Internal timeout occured
2017-07-31 16:49:12,437+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma
nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri
ng rpc call

Version-Release number of selected component (if applicable):
4.2.0-0.0.master.20170730103259.gitb12378f.el7.centos


How reproducible:
100%

Steps to Reproduce:
1. Attach non-mgmt network to host and set bootproto static or dhcp
2. Via 'Clusters' > 'Logical Networks' > Manage Networks > set the non-mgmt network as the default route.

Actual results:
There are timeouts and erros in engine log. This flow taking around 2-3 minutes and mean while both networks are out of sync

Expected results:
Should be much faster. No errors.

Additional info:
Note that the second flow:
'Networks' > 'Logical Networks' > Manage Networks > set the non-mgmt network as default route ->
will end up with a different result and will be affected and hit BZ 1443292 and fail on 'Only a single default route network is allowed.').
Both networks will be out-of-sync for ever.

Comment 1 Michael Burman 2017-07-31 14:35:34 UTC
Created attachment 1307109 [details]
evm log

Comment 2 Alona Kaplan 2017-08-02 09:20:39 UTC
Hi Michael,
The fact that the network is marked as out-of-sync till the setup network ends is not a bug. Till the setup networks doesn't end successfully the network is indeed out-of-sync (desired state - default route, actual- not default route). Also, if the setup networks fails, it should stay out-of-sync.

What bothers me is why you had timeouts. Can you please attach the vdsm log?

Comment 3 Michael Burman 2017-08-02 10:37:30 UTC
(In reply to Alona Kaplan from comment #2)
> Hi Michael,
> The fact that the network is marked as out-of-sync till the setup network
> ends is not a bug. Till the setup networks doesn't end successfully the
> network is indeed out-of-sync (desired state - default route, actual- not
> default route). Also, if the setup networks fails, it should stay
> out-of-sync.
> 
> What bothers me is why you had timeouts. Can you please attach the vdsm log?

A lot of those timeouts - 

2017-08-02 13:35:11,832+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Command 'PollVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='931c94f3-b452-4597-9989-8d6b48e232cb'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during rpc call
2017-08-02 13:35:11,833+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Timeout waiting for VDSM response: Internal timeout occured
2017-08-02 13:35:14,340+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Command 'PollVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='931c94f3-b452-4597-9989-8d6b48e232cb'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during rpc call
2017-08-02 13:35:14,340+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Timeout waiting for VDSM response: Internal timeout occured

Attaching vdsm log and engine log

Comment 4 Michael Burman 2017-08-02 10:39:43 UTC
Created attachment 1308076 [details]
vdsm and engine logs

Comment 5 Martin Mucha 2017-09-14 13:47:25 UTC
I believe this is somehow 'caused' by vdsm and dhcp. What timeouts is polling of host...

Comment 6 Dan Kenigsberg 2017-11-09 13:44:20 UTC
Edy, the situation here is quite vague to me. Is it a slow dhcp server again?

Comment 7 Edward Haas 2017-11-13 13:08:31 UTC
Summarizing the examination of this issue:

When switching the ovirtmgmt default route to a different network in the cluster, it takes a long amount of time for the action to complete.
It is mainly showing up while ovirtmgmt is defined using DHCP.

Examining the logs showed that the DHCP server response takes around 40sec and another 20sec were spent on the network teardown, creation and recovery of the RPC connection (between Engine and VDSM).

Optimization options exists to lower this time, but they require wider investment.
Optimization options:
- Detect that only the default route changed and apply that modification only. Special care will be required for preserving the gateway received in the DHCP response.
- Do not allow changing the management network IP address and its method.

Comment 8 Dan Kenigsberg 2017-11-15 21:34:28 UTC
I would not like to add validations for the management network. If any, we should remove Engine-side validations. if we end up with connectivity we are fine; if we have no connectivity after timeout, we should roll back.

Comment 9 Yaniv Lavi 2018-08-08 07:54:27 UTC
Closing old bugs. Please reopen if still needed.
In any case patches are welcomed.