Description of problem: Version-Release number of selected component (if applicable): RHEV-M 3.4.4 How reproducible: On a customer site always Steps to Reproduce: 1. Add host to RHEV-M 2. Setup bond0 for rhevm network, 3. Setup second bond with labels and high number of vlans (>=35) Actual results: 2015-02-10 20:14:37,918 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] FINISH, SetupNetworksVDSCommand, log id: 638f39e2 2015-02-10 20:14:43,324 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Command PollVDSCommand(HostName = hrz-rhevnode-0018, HostId = 519772fc-a91a-484e-b882-2ff8959586d6) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call 2015-02-10 20:14:43,325 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Timeout waiting for VDSM response. java.util.concurrent.TimeoutException 2015-02-10 20:14:46,516 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Command PollVDSCommand(HostName = hrz-rhevnode-0018, HostId = 519772fc-a91a-484e-b882-2ff8959586d6) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call 2015-02-10 20:14:46,516 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Timeout waiting for VDSM response. java.util.concurrent.TimeoutException 2015-02-10 20:14:49,565 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Command PollVDSCommand(HostName = hrz-rhevnode-0018, HostId = 519772fc-a91a-484e-b882-2ff8959586d6) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call 2015-02-10 20:14:49,565 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Timeout waiting for VDSM response. java.util.concurrent.TimeoutException 2015-02-10 20:14:52,648 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Command PollVDSCommand(HostName = hrz-rhevnode-0018, HostId = 519772fc-a91a-484e-b882-2ff8959586d6) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call 2015-02-10 20:14:52,648 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Timeout waiting for VDSM response. java.util.concurrent.TimeoutException 2015-02-10 20:14:54,941 WARN [org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil] (org.ovirt.thread.pool-4-thread-13) Executing a command: java.util.concurrent.FutureTask , but note that there are 1 tasks in the queue. Expected results: RHEV-M saves network configuration Additional info:
This may also reproduce on json rpc and we should look into that. Lior is there a workaround you can suggest?
Before I get into what I was thinking about, it seems strange that it is the bonding operation that's taking so long (i.e. without it there's no issue, if I understand correctly). Dan, should the bonding operation slow down Setup Networks by this much? Is it something we'd expect?
Thread-14::DEBUG::2015-02-10 19:28:49,668::BindingXMLRPC::1067::vds::(wrapper) client [10.116.24.30]::call setupNetworks with (...) Thread-14::DEBUG::2015-02-10 19:31:38,125::BindingXMLRPC::1074::vds::(wrapper) return setupNetworks with {'status': {'message': 'Done', 'code': 0}} 3 minutes for 54 bridged networks is more than I expected, but it means 3 seconds per network, which is approximately what we see. I don't think that the bond per se slows us down much; its the serial ifup'ing of 114 devices. Finding out a specific culprit requires deeper profiling. For reference, can the reporter tell how much time does `service network start` take with 54 such networks?
Created attachment 997943 [details] supervdsm.log: setupNetwork with 54 nets
I may have misunderstood something then. Alexander, could you let us know if pressing "Refresh Capabilities" for the host also causes the networks to appear, after the original labelling operation?
Seems like this should be investigated on vdsm side (or maybe lower...).
What is the behavior on a fresh 3.5 cluster that uses json-rpc? Alona, can the timeout of setupNetwork be extended specifically, or is it just like and other VdsCommand?
Meni, is this something you can check?
It can, but just via the rest api. Conectivity_Timeout (seconds) can be passed as a parameter to the setupNetworks command.
Sorry, Alona. Connectivity_Timeout is the time until *vdsm* decides to rollback its configuration, in case it did not hear of Engine. I am asking here about something else. On every verb, Engine waits for a response from Vdsm (typically 3-4 minutes). Here we have a use case for a longer timeout. Can that timeout be extended WITHOUT exnteding the timeout to SpmStart verbs etc?
There is no way to change the timeout of the SetupNetworks verb without affecting the other verbs.
is there any progress in this bug? because this is really weird. I have hosts with over 50 vms on them, all with separate vlan tag on ovirt 3.3. and it works. I'm in the process of upgrading to 3.6. so I need to know if this is a bug which affects the current latest release, because this would be a blocker for my larger deployments. thanks Sven
(In reply to Sven Kieske from comment #26) > is there any progress in this bug? > > because this is really weird. > > I have hosts with over 50 vms on them, all with separate vlan tag on ovirt > 3.3. > and it works. > > I'm in the process of upgrading to 3.6. so I need to know if this is a bug > which affects the current latest release, because this would be a blocker > for my larger deployments. > > thanks > > Sven I don't think you will have any issues unless you have more than 120 virtual networks.
We are looking at this. But the fix is a bit tricky, because we need to ensure backwards compatability.
This will implicitly be fixed by RHEV-H Next which is planend for RHEV 4.
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.
*** Bug 1296141 has been marked as a duplicate of this bug. ***
this bug still reproduced on vdsm-4.18.16-1.el7ev.x86_64 on top of NGN image. danken recommended to test it for rhel image as well.
(In reply to Eldad Marciano from comment #43) > this bug still reproduced on vdsm-4.18.16-1.el7ev.x86_64 on top of NGN image. > danken recommended to test it for rhel image as well. Are you testing with OVS or bridge?
also during the setup network there was a firewall-cmd calls per network: firewall-cmd --zone --remove=NW_Scale_x it is expected behavior ?!
(In reply to Yaniv Kaul from comment #44) > (In reply to Eldad Marciano from comment #43) > > this bug still reproduced on vdsm-4.18.16-1.el7ev.x86_64 on top of NGN image. > > danken recommended to test it for rhel image as well. > > Are you testing with OVS or bridge? me and danken were tested it together via plain network configuration, no bridge or bond. directly to the secondary interface.
Does this bug also reproduce on RHEL? I don't see a reason why this flow should be slower on NGN than on RHEL. (In vintage Node this path _was_ slower, because it involved persistence. In NGN however, the extra persistence call is not necessary anymore and thus not performed, which should lead to the same performance as on RHEL)
(In reply to Fabian Deutsch from comment #49) > Does this bug also reproduce on RHEL? > I don't see a reason why this flow should be slower on NGN than on RHEL. > > (In vintage Node this path _was_ slower, because it involved persistence. In > NGN however, the extra persistence call is not necessary anymore and thus > not performed, which should lead to the same performance as on RHEL) Yes is it reproduced on rhel, with similar response time. https://bugzilla.redhat.com/show_bug.cgi?id=1193083#c48
(In reply to Eldad Marciano from comment #46) > (In reply to Yaniv Kaul from comment #44) > > (In reply to Eldad Marciano from comment #43) > > > this bug still reproduced on vdsm-4.18.16-1.el7ev.x86_64 on top of NGN image. > > > danken recommended to test it for rhel image as well. > > > > Are you testing with OVS or bridge? > > me and danken were tested it together via plain network configuration, no > bridge or bond. Can you share the way you tested it? I want to understand the root cause > > directly to the secondary interface.
test profile: 1. add network 2. update lable. 3. assign networks to cluster. x150 see also https://bugzilla.redhat.com/show_bug.cgi?id=1193083#c41 Danken, I missed something at the flow?
Roy, the "root cause" is that Engine gives Vdsm 2 minutes to finish setupNetworks, yet setting up a single network via vdsm and ifcfg (Linux bridge) takes 4-5 seconds, which ends up with up to ~40 networks that can be handled in a single setupNetworks command.
To work around this problem, it is possible to temporary increase the responsible timeout, called "vdsTimeout", that engine gives to vdsm on the host to finish setupNetworks. The timeout could be increased this way: 1. Read the initial value: engine-config --get=vdsTimeout 2. Increase the timeout to an appropriate value in seconds engine-config --set vdsTimeout=300 3. Restart engine to apply new value: service ovirt-engine restart 4. Execute long running host configuration 5. Reset timeout to the initial value: engine-config --set vdsTimeout=180 3. Restart engine to apply value Please find details about the engine configuration tool in https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/paged/administration-guide/182-the-engine-configuration-tool .
This bug has been (ab)used for OvS integration. But regardless of OvS, and as described in https://github.com/oVirt/ovirt-site/pull/1308 RHV-4.2 would see a considerable improvement of in the number of network we can concurrently attach to a host.
I have tested this on rhv version 4.2.2-0.1, vdsm 4.20.18-1 with rhel 7.5. Steps that where done : 1. Add host to RHV 2. Setup bond0 on 2 NICS for rhevm network (ovirtmgmt) 3. Setup second bond - bond1 on additional 2 NICS 4. Created 150 vlans with a single label in the cluster 4. Added the label to bond1 label was successfully attached to the host and 150 network interfaces where created with no timeouts or issues. Moving bug to verified.
(In reply to guy chen from comment #58) > I have tested this on rhv version 4.2.2-0.1, vdsm 4.20.18-1 with rhel 7.5. > Steps that where done : > 1. Add host to RHV > 2. Setup bond0 on 2 NICS for rhevm network (ovirtmgmt) > 3. Setup second bond - bond1 on additional 2 NICS > 4. Created 150 vlans with a single label in the cluster Why 150? Is there some limit here? Can we test with 250? 500? How much time did it take for 150? > 4. Added the label to bond1 Via the UI or API? (We need both, the UI is actually a bit more interesting!) > > label was successfully attached to the host and 150 network interfaces where > created with no timeouts or issues. > Moving bug to verified.
This was tested with UI. As far as the time - it takes 2M30S to attached the label, so in term of task duration I think we are about the limit, unless it's acceptable to go to 4-5 minutes duration for the scenario.
(In reply to guy chen from comment #60) > This was tested with UI. > As far as the time - it takes 2M30S to attached the label, so in term of > task duration I think we are about the limit, unless it's acceptable to go > to 4-5 minutes duration for the scenario. What is the maximum number of networks that works with Engine's default timeout?
These are the numbers that Petr included in his blog post to ovirt - above 300 for that matters. https://www.ovirt.org/blog/2017/11/setting-up-multiple-networks-is-going-to-be-much-faster-in-ovirt-4-2/
Guy, did you intentionally remove the needinfo from Comment 61 without supplying the needed info?
Hello Dan, I requested Guy to close this need info. We had also agreed with Yaniv K that finding the next limit of "maximum number of networks" is not a need info for this bug but actually a new significant task of it's own. "Need info" should be used when some significant information is needed in order to understand and resolve the bug. We should not to use "need info" to create New tasks for the teams. We will add this task to our back log and prioritize considering other tasks. I also suggest that the scale Requirement lets say for RHV 4.3 of "maximum number of networks" Should be defined by the PM, opened as RFE so we could properly address it in our next testing efforts. Please feel free to contact me off-line to talk about it if needed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489
BZ<2>Jira Resync
We will need to consider it and address with Close loop Ticket process. It.