Description of problem: Time out is too long when setting a non-mgmt network as the default route role via the 'Clusters' > Logical Networks' > Manage Networks flow. When setting a non-mgmt network that is attached to the host with default route role via the 'Clusters' > Logical Networks' > Manage Networks flow is taking around 2-3 minutes(in case of dhcp ovirtmgmt) and around minute(for static ovirtmgmt). There are timeouts during this time and both networks are reproted as out-of-sync. This must be improved. This are the timeouts. 2017-07-31 16:49:07,349+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri ng rpc call 2017-07-31 16:49:07,349+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Timeout waiting for V DSM response: Internal timeout occured 2017-07-31 16:49:09,906+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri ng rpc call 2017-07-31 16:49:09,906+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Timeout waiting for V DSM response: Internal timeout occured 2017-07-31 16:49:12,437+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-11) [c7476bd0-b699-45b4-92c1-a505e0a924d9] Command 'PollVDSComma nd(HostName = mmucha_test_orchid-vds2.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='f5ebfb95-1860-442c-a0ca-db5dc512cc76'})' execution failed: VDSGenericException: VDSNetworkException: Timeout duri ng rpc call Version-Release number of selected component (if applicable): 4.2.0-0.0.master.20170730103259.gitb12378f.el7.centos How reproducible: 100% Steps to Reproduce: 1. Attach non-mgmt network to host and set bootproto static or dhcp 2. Via 'Clusters' > 'Logical Networks' > Manage Networks > set the non-mgmt network as the default route. Actual results: There are timeouts and erros in engine log. This flow taking around 2-3 minutes and mean while both networks are out of sync Expected results: Should be much faster. No errors. Additional info: Note that the second flow: 'Networks' > 'Logical Networks' > Manage Networks > set the non-mgmt network as default route -> will end up with a different result and will be affected and hit BZ 1443292 and fail on 'Only a single default route network is allowed.'). Both networks will be out-of-sync for ever.
Created attachment 1307109 [details] evm log
Hi Michael, The fact that the network is marked as out-of-sync till the setup network ends is not a bug. Till the setup networks doesn't end successfully the network is indeed out-of-sync (desired state - default route, actual- not default route). Also, if the setup networks fails, it should stay out-of-sync. What bothers me is why you had timeouts. Can you please attach the vdsm log?
(In reply to Alona Kaplan from comment #2) > Hi Michael, > The fact that the network is marked as out-of-sync till the setup network > ends is not a bug. Till the setup networks doesn't end successfully the > network is indeed out-of-sync (desired state - default route, actual- not > default route). Also, if the setup networks fails, it should stay > out-of-sync. > > What bothers me is why you had timeouts. Can you please attach the vdsm log? A lot of those timeouts - 2017-08-02 13:35:11,832+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Command 'PollVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='931c94f3-b452-4597-9989-8d6b48e232cb'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during rpc call 2017-08-02 13:35:11,833+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Timeout waiting for VDSM response: Internal timeout occured 2017-08-02 13:35:14,340+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Command 'PollVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{hostId='931c94f3-b452-4597-9989-8d6b48e232cb'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during rpc call 2017-08-02 13:35:14,340+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-default-Thread-12) [bd85ce13-ef2a-4f7c-b3f7-6bce9449a7e5] Timeout waiting for VDSM response: Internal timeout occured Attaching vdsm log and engine log
Created attachment 1308076 [details] vdsm and engine logs
I believe this is somehow 'caused' by vdsm and dhcp. What timeouts is polling of host...
Edy, the situation here is quite vague to me. Is it a slow dhcp server again?
Summarizing the examination of this issue: When switching the ovirtmgmt default route to a different network in the cluster, it takes a long amount of time for the action to complete. It is mainly showing up while ovirtmgmt is defined using DHCP. Examining the logs showed that the DHCP server response takes around 40sec and another 20sec were spent on the network teardown, creation and recovery of the RPC connection (between Engine and VDSM). Optimization options exists to lower this time, but they require wider investment. Optimization options: - Detect that only the default route changed and apply that modification only. Special care will be required for preserving the gateway received in the DHCP response. - Do not allow changing the management network IP address and its method.
I would not like to add validations for the management network. If any, we should remove Engine-side validations. if we end up with connectivity we are fine; if we have no connectivity after timeout, we should roll back.
Closing old bugs. Please reopen if still needed. In any case patches are welcomed.