Bug 1476815 - removing defaultRoute role from mgmt networks is very slow (>60s)
removing defaultRoute role from mgmt networks is very slow (>60s)
Status: NEW
Product: ovirt-engine
Classification: oVirt
Component: BLL.Network (Show other bugs)
4.2.0
x86_64 Linux
low Severity high (vote)
: ovirt-4.3.0
: ---
Assigned To: Edward Haas
Meni Yakove
:
Depends On:
Blocks: 1200963
  Show dependency treegraph
 
Reported: 2017-07-31 10:14 EDT by Michael Burman
Modified: 2017-11-15 16:34 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Network
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.3+


Attachments (Terms of Use)
evm log (114.62 KB, application/x-gzip)
2017-07-31 10:35 EDT, Michael Burman
no flags Details
vdsm and engine logs (2.28 MB, application/x-gzip)
2017-08-02 06:39 EDT, Michael Burman
no flags Details

  None (edit)
Description Michael Burman 2017-07-31 10:14:42 EDT
Description of problem:
Time out is too long when setting a non-mgmt network as the default route role via the 'Clusters' > Logical Networks' > Manage Networks flow.

When setting a non-mgmt network that is attached to the host with default route role via the 'Clusters' > Logical Networks' > Manage Networks flow is taking around 2-3 minutes(in case of dhcp ovirtmgmt) and around minute(for static ovirtmgmt). There are timeouts during this time and both networks are reproted as out-of-sync.

This must be improved. 

This are the timeouts. 

2017-07-31 16:49:07,349+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma
nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri
ng rpc call
2017-07-31 16:49:07,349+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Timeout waiting for V
DSM response: Internal timeout occured
2017-07-31 16:49:09,906+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma
nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri
ng rpc call
2017-07-31 16:49:09,906+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Timeout waiting for V
DSM response: Internal timeout occured
2017-07-31 16:49:12,437+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma
nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri
ng rpc call

Version-Release number of selected component (if applicable):
4.2.0-0.0.master.20170730103259.gitb12378f.el7.centos


How reproducible:
100%

Steps to Reproduce:
1. Attach non-mgmt network to host and set bootproto static or dhcp
2. Via 'Clusters' > 'Logical Networks' > Manage Networks > set the non-mgmt network as the default route.

Actual results:
There are timeouts and erros in engine log. This flow taking around 2-3 minutes and mean while both networks are out of sync

Expected results:
Should be much faster. No errors.

Additional info:
Note that the second flow:
'Networks' > 'Logical Networks' > Manage Networks > set the non-mgmt network as default route ->
will end up with a different result and will be affected and hit BZ 1443292 and fail on 'Only a single default route network is allowed.').
Both networks will be out-of-sync for ever.
Comment 1 Michael Burman 2017-07-31 10:35 EDT
Created attachment 1307109 [details]
evm log
Comment 2 Alona Kaplan 2017-08-02 05:20:39 EDT
Hi Michael,
The fact that the network is marked as out-of-sync till the setup network ends is not a bug. Till the setup networks doesn't end successfully the network is indeed out-of-sync (desired state - default route, actual- not default route). Also, if the setup networks fails, it should stay out-of-sync.

What bothers me is why you had timeouts. Can you please attach the vdsm log?
Comment 3 Michael Burman 2017-08-02 06:37:30 EDT
(In reply to Alona Kaplan from comment #2)
> Hi Michael,
> The fact that the network is marked as out-of-sync till the setup network
> ends is not a bug. Till the setup networks doesn't end successfully the
> network is indeed out-of-sync (desired state - default route, actual- not
> default route). Also, if the setup networks fails, it should stay
> out-of-sync.
> 
> What bothers me is why you had timeouts. Can you please attach the vdsm log?

A lot of those timeouts - 

2017-08-02 13:35:11,832+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Command 'PollVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='931c94f3-b452-4597-9989-8d6b48e232cb'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during rpc call
2017-08-02 13:35:11,833+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Timeout waiting for VDSM response: Internal timeout occured
2017-08-02 13:35:14,340+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Command 'PollVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='931c94f3-b452-4597-9989-8d6b48e232cb'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during rpc call
2017-08-02 13:35:14,340+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Timeout waiting for VDSM response: Internal timeout occured

Attaching vdsm log and engine log
Comment 4 Michael Burman 2017-08-02 06:39 EDT
Created attachment 1308076 [details]
vdsm and engine logs
Comment 5 Martin Mucha 2017-09-14 09:47:25 EDT
I believe this is somehow 'caused' by vdsm and dhcp. What timeouts is polling of host...
Comment 6 Dan Kenigsberg 2017-11-09 08:44:20 EST
Edy, the situation here is quite vague to me. Is it a slow dhcp server again?
Comment 7 Edward Haas 2017-11-13 08:08:31 EST
Summarizing the examination of this issue:

When switching the ovirtmgmt default route to a different network in the cluster, it takes a long amount of time for the action to complete.
It is mainly showing up while ovirtmgmt is defined using DHCP.

Examining the logs showed that the DHCP server response takes around 40sec and another 20sec were spent on the network teardown, creation and recovery of the RPC connection (between Engine and VDSM).

Optimization options exists to lower this time, but they require wider investment.
Optimization options:
- Detect that only the default route changed and apply that modification only. Special care will be required for preserving the gateway received in the DHCP response.
- Do not allow changing the management network IP address and its method.
Comment 8 Dan Kenigsberg 2017-11-15 16:34:28 EST
I would not like to add validations for the management network. If any, we should remove Engine-side validations. if we end up with connectivity we are fine; if we have no connectivity after timeout, we should roll back.

Note You need to log in before you can comment on or make changes to this bug.