1476815 – removing defaultRoute role from mgmt networks is very slow (>60s) if using DHCP

Bug 1476815 - removing defaultRoute role from mgmt networks is very slow (>60s) if using DHCP

Summary: removing defaultRoute role from mgmt networks is very slow (>60s) if using DHCP

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Network
Sub Component:
Version:	4.2.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Edward Haas
QA Contact:	Meni Yakove
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1200963
TreeView+	depends on / blocked

Reported:	2017-07-31 14:14 UTC by Michael Burman
Modified:	2022-06-27 07:47 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-08-08 07:54:27 UTC
oVirt Team:	Network
Embargoed:
Dependent Products:
Flags:	sbonazzo: ovirt-4.3-

Attachments	(Terms of Use)
evm log (114.62 KB, application/x-gzip) 2017-07-31 14:35 UTC, Michael Burman	no flags	Details
vdsm and engine logs (2.28 MB, application/x-gzip) 2017-08-02 10:39 UTC, Michael Burman	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-46567	0	None	None	None	2022-06-27 07:47:04 UTC

Description Michael Burman 2017-07-31 14:14:42 UTC

Description of problem:
Time out is too long when setting a non-mgmt network as the default route role via the 'Clusters' > Logical Networks' > Manage Networks flow.

When setting a non-mgmt network that is attached to the host with default route role via the 'Clusters' > Logical Networks' > Manage Networks flow is taking around 2-3 minutes(in case of dhcp ovirtmgmt) and around minute(for static ovirtmgmt). There are timeouts during this time and both networks are reproted as out-of-sync.

This must be improved. 

This are the timeouts. 

2017-07-31 16:49:07,349+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma
nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri
ng rpc call
2017-07-31 16:49:07,349+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Timeout waiting for V
DSM response: Internal timeout occured
2017-07-31 16:49:09,906+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma
nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri
ng rpc call
2017-07-31 16:49:09,906+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Timeout waiting for V
DSM response: Internal timeout occured
2017-07-31 16:49:12,437+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma
nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri
ng rpc call

Version-Release number of selected component (if applicable):
4.2.0-0.0.master.20170730103259.gitb12378f.el7.centos


How reproducible:
100%

Steps to Reproduce:
1. Attach non-mgmt network to host and set bootproto static or dhcp
2. Via 'Clusters' > 'Logical Networks' > Manage Networks > set the non-mgmt network as the default route.

Actual results:
There are timeouts and erros in engine log. This flow taking around 2-3 minutes and mean while both networks are out of sync

Expected results:
Should be much faster. No errors.

Additional info:
Note that the second flow:
'Networks' > 'Logical Networks' > Manage Networks > set the non-mgmt network as default route ->
will end up with a different result and will be affected and hit BZ 1443292 and fail on 'Only a single default route network is allowed.').
Both networks will be out-of-sync for ever.

Comment 1 Michael Burman 2017-07-31 14:35:34 UTC

Created attachment 1307109 [details]
evm log

Comment 2 Alona Kaplan 2017-08-02 09:20:39 UTC

Hi Michael,
The fact that the network is marked as out-of-sync till the setup network ends is not a bug. Till the setup networks doesn't end successfully the network is indeed out-of-sync (desired state - default route, actual- not default route). Also, if the setup networks fails, it should stay out-of-sync.

What bothers me is why you had timeouts. Can you please attach the vdsm log?

Comment 3 Michael Burman 2017-08-02 10:37:30 UTC

(In reply to Alona Kaplan from comment #2)
> Hi Michael,
> The fact that the network is marked as out-of-sync till the setup network
> ends is not a bug. Till the setup networks doesn't end successfully the
> network is indeed out-of-sync (desired state - default route, actual- not
> default route). Also, if the setup networks fails, it should stay
> out-of-sync.
> 
> What bothers me is why you had timeouts. Can you please attach the vdsm log?

A lot of those timeouts - 

2017-08-02 13:35:11,832+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Command 'PollVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='931c94f3-b452-4597-9989-8d6b48e232cb'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during rpc call
2017-08-02 13:35:11,833+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Timeout waiting for VDSM response: Internal timeout occured
2017-08-02 13:35:14,340+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Command 'PollVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='931c94f3-b452-4597-9989-8d6b48e232cb'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during rpc call
2017-08-02 13:35:14,340+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Timeout waiting for VDSM response: Internal timeout occured

Attaching vdsm log and engine log

Comment 4 Michael Burman 2017-08-02 10:39:43 UTC

Created attachment 1308076 [details]
vdsm and engine logs

Comment 5 Martin Mucha 2017-09-14 13:47:25 UTC

I believe this is somehow 'caused' by vdsm and dhcp. What timeouts is polling of host...

Comment 6 Dan Kenigsberg 2017-11-09 13:44:20 UTC

Edy, the situation here is quite vague to me. Is it a slow dhcp server again?

Comment 7 Edward Haas 2017-11-13 13:08:31 UTC

Summarizing the examination of this issue:

When switching the ovirtmgmt default route to a different network in the cluster, it takes a long amount of time for the action to complete.
It is mainly showing up while ovirtmgmt is defined using DHCP.

Examining the logs showed that the DHCP server response takes around 40sec and another 20sec were spent on the network teardown, creation and recovery of the RPC connection (between Engine and VDSM).

Optimization options exists to lower this time, but they require wider investment.
Optimization options:
- Detect that only the default route changed and apply that modification only. Special care will be required for preserving the gateway received in the DHCP response.
- Do not allow changing the management network IP address and its method.

Comment 8 Dan Kenigsberg 2017-11-15 21:34:28 UTC

I would not like to add validations for the management network. If any, we should remove Engine-side validations. if we end up with connectivity we are fine; if we have no connectivity after timeout, we should roll back.

Comment 9 Yaniv Lavi 2018-08-08 07:54:27 UTC

Closing old bugs. Please reopen if still needed.
In any case patches are welcomed.

Note You need to log in before you can comment on or make changes to this bug.