1193083 – [scale] Network: RHEV fails to apply high number of networks on a new hypervisor: Timeout during xml-rpc call

Bug 1193083 - [scale] Network: RHEV fails to apply high number of networks on a new hypervisor: Timeout during xml-rpc call

Summary: [scale] Network: RHEV fails to apply high number of networks on a new hypervi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Petr Horáček
QA Contact:	mlehrer
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1296141 (view as bug list)
Depends On:	1497759
Blocks:	1358501
TreeView+	depends on / blocked

Reported:	2015-02-16 14:35 UTC by akotov
Modified:	2020-12-11 12:24 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-15 17:49:33 UTC
oVirt Team:	Network
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)
supervdsm.log: setupNetwork with 54 nets (1.00 MB, text/plain) 2015-03-04 16:02 UTC, Dan Kenigsberg	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:1489	0	None	None	None	2018-05-15 17:51:04 UTC

Description akotov 2015-02-16 14:35:00 UTC

Description of problem:


Version-Release number of selected component (if applicable):

RHEV-M 3.4.4

How reproducible:

On a customer site always

Steps to Reproduce:
1. Add host to RHEV-M
2. Setup bond0 for rhevm network,
3. Setup second bond with labels and high number of vlans (>=35)

Actual results:

2015-02-10 20:14:37,918 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] FINISH, SetupNetworksVDSCommand, log id: 638f39e2
2015-02-10 20:14:43,324 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Command PollVDSCommand(HostName = hrz-rhevnode-0018, HostId = 519772fc-a91a-484e-b882-2ff8959586d6) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call
2015-02-10 20:14:43,325 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Timeout waiting for VDSM response. java.util.concurrent.TimeoutException
2015-02-10 20:14:46,516 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Command PollVDSCommand(HostName = hrz-rhevnode-0018, HostId = 519772fc-a91a-484e-b882-2ff8959586d6) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call
2015-02-10 20:14:46,516 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Timeout waiting for VDSM response. java.util.concurrent.TimeoutException
2015-02-10 20:14:49,565 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Command PollVDSCommand(HostName = hrz-rhevnode-0018, HostId = 519772fc-a91a-484e-b882-2ff8959586d6) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call
2015-02-10 20:14:49,565 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Timeout waiting for VDSM response. java.util.concurrent.TimeoutException
2015-02-10 20:14:52,648 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Command PollVDSCommand(HostName = hrz-rhevnode-0018, HostId = 519772fc-a91a-484e-b882-2ff8959586d6) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call
2015-02-10 20:14:52,648 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (ajp-/127.0.0.1:8702-19) [6288b8ac] Timeout waiting for VDSM response. java.util.concurrent.TimeoutException
2015-02-10 20:14:54,941 WARN  [org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil] (org.ovirt.thread.pool-4-thread-13) Executing a command: java.util.concurrent.FutureTask , but note that there are 1 tasks in the queue.

 
Expected results:

RHEV-M saves network configuration


Additional info:

Comment 4 Yaniv Lavi 2015-03-04 12:56:02 UTC

This may also reproduce on json rpc and we should look into that.
Lior is there a workaround you can suggest?

Comment 5 Lior Vernia 2015-03-04 14:08:06 UTC

Before I get into what I was thinking about, it seems strange that it is the bonding operation that's taking so long (i.e. without it there's no issue, if I understand correctly). Dan, should the bonding operation slow down Setup Networks by this much? Is it something we'd expect?

Comment 6 Dan Kenigsberg 2015-03-04 16:01:22 UTC

Thread-14::DEBUG::2015-02-10 19:28:49,668::BindingXMLRPC::1067::vds::(wrapper) client [10.116.24.30]::call setupNetworks with (...)
Thread-14::DEBUG::2015-02-10 19:31:38,125::BindingXMLRPC::1074::vds::(wrapper) return setupNetworks with {'status': {'message': 'Done', 'code': 0}}

3 minutes for 54 bridged networks is more than I expected, but it means 3 seconds per network, which is approximately what we see.

I don't think that the bond per se slows us down much; its the serial ifup'ing of 114 devices. Finding out a specific culprit requires deeper profiling.

For reference, can the reporter tell how much time does `service network start` take with 54 such networks?

Comment 7 Dan Kenigsberg 2015-03-04 16:02:29 UTC

Created attachment 997943 [details]
supervdsm.log: setupNetwork with 54 nets

Comment 8 Lior Vernia 2015-03-04 16:07:08 UTC

I may have misunderstood something then. Alexander, could you let us know if pressing "Refresh Capabilities" for the host also causes the networks to appear, after the original labelling operation?

Comment 10 Lior Vernia 2015-04-20 11:48:41 UTC

Seems like this should be investigated on vdsm side (or maybe lower...).

Comment 11 Dan Kenigsberg 2015-05-18 09:00:35 UTC

What is the behavior on a fresh 3.5 cluster that uses json-rpc?

Alona, can the timeout of setupNetwork be extended specifically, or is it just like and other VdsCommand?

Comment 12 Gil Klein 2015-05-19 11:44:24 UTC

Meni, is this something you can check?

Comment 13 Alona Kaplan 2015-05-19 12:49:17 UTC

It can, but just via the rest api.
Conectivity_Timeout (seconds) can be passed as a parameter to the setupNetworks command.

Comment 14 Dan Kenigsberg 2015-05-25 08:48:09 UTC

Sorry, Alona. Connectivity_Timeout is the time until *vdsm* decides to rollback its configuration, in case it did not hear of Engine.

I am asking here about something else. On every verb, Engine waits for a response from Vdsm (typically 3-4 minutes). Here we have a use case for a longer timeout. Can that timeout be extended WITHOUT exnteding the timeout to SpmStart verbs etc?

Comment 15 Alona Kaplan 2015-05-26 08:44:26 UTC

There is no way to change the timeout of the SetupNetworks verb without affecting the other verbs.

Comment 26 Sven Kieske 2016-03-14 12:19:10 UTC

is there any progress in this bug?

because this is really weird.

I have hosts with over 50 vms on them, all with separate vlan tag on ovirt 3.3.
and it works.

I'm in the process of upgrading to 3.6. so I need to know if this is a bug which affects the current latest release, because this would be a blocker for my larger deployments.

thanks

Sven

Comment 27 Yaniv Lavi 2016-03-14 14:23:26 UTC

(In reply to Sven Kieske from comment #26)
> is there any progress in this bug?
> 
> because this is really weird.
> 
> I have hosts with over 50 vms on them, all with separate vlan tag on ovirt
> 3.3.
> and it works.
> 
> I'm in the process of upgrading to 3.6. so I need to know if this is a bug
> which affects the current latest release, because this would be a blocker
> for my larger deployments.
> 
> thanks
> 
> Sven

I don't think you will have any issues unless you have more than 120 virtual networks.

Comment 28 Fabian Deutsch 2016-03-22 19:05:51 UTC

We are looking at this. But the fix is a bit tricky, because we need to ensure backwards compatability.

Comment 29 Fabian Deutsch 2016-04-19 18:10:00 UTC

This will implicitly be fixed by RHEV-H Next which is planend for RHEV 4.

Comment 30 Yaniv Lavi 2016-05-09 10:59:26 UTC

oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.

Comment 37 Ryan Barry 2016-07-11 13:43:34 UTC

*** Bug 1296141 has been marked as a duplicate of this bug. ***

Comment 43 Eldad Marciano 2016-11-27 12:52:01 UTC

this bug still reproduced on vdsm-4.18.16-1.el7ev.x86_64 on top of NGN image.
danken recommended  to test it for rhel image as well.

Comment 44 Yaniv Kaul 2016-11-27 12:53:51 UTC

(In reply to Eldad Marciano from comment #43)
> this bug still reproduced on vdsm-4.18.16-1.el7ev.x86_64 on top of NGN image.
> danken recommended  to test it for rhel image as well.

Are you testing with OVS or bridge?

Comment 45 Eldad Marciano 2016-11-27 12:57:58 UTC

also during the setup network there was a firewall-cmd calls per network:
firewall-cmd --zone --remove=NW_Scale_x

it is expected behavior ?!

Comment 46 Eldad Marciano 2016-11-27 12:59:21 UTC

(In reply to Yaniv Kaul from comment #44)
> (In reply to Eldad Marciano from comment #43)
> > this bug still reproduced on vdsm-4.18.16-1.el7ev.x86_64 on top of NGN image.
> > danken recommended  to test it for rhel image as well.
> 
> Are you testing with OVS or bridge?

me and danken were tested it together via plain network configuration, no bridge or bond.

directly to the secondary interface.

Comment 49 Fabian Deutsch 2016-11-27 16:57:06 UTC

Does this bug also reproduce on RHEL?
I don't see a reason why this flow should be slower on NGN than on RHEL.

(In vintage Node this path _was_ slower, because it involved persistence. In NGN however, the extra persistence call is not necessary anymore and thus not performed, which should lead to the same performance as on RHEL)

Comment 50 Eldad Marciano 2016-11-27 17:18:16 UTC

(In reply to Fabian Deutsch from comment #49)
> Does this bug also reproduce on RHEL?
> I don't see a reason why this flow should be slower on NGN than on RHEL.
> 
> (In vintage Node this path _was_ slower, because it involved persistence. In
> NGN however, the extra persistence call is not necessary anymore and thus
> not performed, which should lead to the same performance as on RHEL)

Yes is it reproduced on rhel, with similar response time.
https://bugzilla.redhat.com/show_bug.cgi?id=1193083#c48

Comment 51 Roy Golan 2016-12-25 10:12:32 UTC

(In reply to Eldad Marciano from comment #46)
> (In reply to Yaniv Kaul from comment #44)
> > (In reply to Eldad Marciano from comment #43)
> > > this bug still reproduced on vdsm-4.18.16-1.el7ev.x86_64 on top of NGN image.
> > > danken recommended  to test it for rhel image as well.
> > 
> > Are you testing with OVS or bridge?
> 
> me and danken were tested it together via plain network configuration, no
> bridge or bond.

Can you share the way you tested it? I want to understand the root cause

> 
> directly to the secondary interface.

Comment 52 Eldad Marciano 2016-12-25 10:56:09 UTC

test profile:
1. add network
2. update lable.
3. assign networks to cluster.

x150

see also https://bugzilla.redhat.com/show_bug.cgi?id=1193083#c41

Danken, I missed something at the flow?

Comment 53 Dan Kenigsberg 2016-12-25 11:26:40 UTC

Roy, the "root cause" is that Engine gives Vdsm 2 minutes to finish setupNetworks, yet setting up a single network via vdsm and ifcfg (Linux bridge) takes 4-5 seconds, which ends up with up to ~40 networks that can be handled in a single setupNetworks command.

Comment 54 Dominik Holler 2017-01-18 16:48:13 UTC

To work around this problem, it is possible to temporary increase the responsible timeout, called "vdsTimeout", that engine gives to vdsm on the host to finish setupNetworks.
The timeout could be increased this way:
1. Read the initial value:
   engine-config --get=vdsTimeout
2. Increase the timeout to an appropriate value in seconds
   engine-config --set vdsTimeout=300
3. Restart engine to apply new value:
   service ovirt-engine restart
4. Execute long running host configuration
5. Reset timeout to the initial value:
   engine-config --set vdsTimeout=180
3. Restart engine to apply value

Please find details about the engine configuration tool in  https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/paged/administration-guide/182-the-engine-configuration-tool .

Comment 55 Dan Kenigsberg 2017-11-05 15:25:40 UTC

This bug has been (ab)used for OvS integration. But regardless of OvS, and as described in https://github.com/oVirt/ovirt-site/pull/1308 RHV-4.2 would see a considerable improvement of in the number of network we can concurrently attach to a host.

Comment 58 guy chen 2018-03-29 08:55:32 UTC

I have tested this on rhv version 4.2.2-0.1, vdsm 4.20.18-1 with rhel 7.5.
Steps that where done :
1. Add host to RHV
2. Setup bond0 on 2 NICS for rhevm network (ovirtmgmt)
3. Setup second bond  - bond1 on additional 2 NICS
4. Created 150 vlans with a single label in the cluster
4. Added the label to bond1

label was successfully attached to the host and 150 network interfaces where created with no timeouts or issues.
Moving bug to verified.

Comment 59 Yaniv Kaul 2018-03-29 21:06:32 UTC

(In reply to guy chen from comment #58)
> I have tested this on rhv version 4.2.2-0.1, vdsm 4.20.18-1 with rhel 7.5.
> Steps that where done :
> 1. Add host to RHV
> 2. Setup bond0 on 2 NICS for rhevm network (ovirtmgmt)
> 3. Setup second bond  - bond1 on additional 2 NICS
> 4. Created 150 vlans with a single label in the cluster

Why 150? Is there some limit here? Can we test with 250? 500?
How much time did it take for 150?

> 4. Added the label to bond1

Via the UI or API? (We need both, the UI is actually a bit more interesting!)

> 
> label was successfully attached to the host and 150 network interfaces where
> created with no timeouts or issues.
> Moving bug to verified.

Comment 60 guy chen 2018-04-07 09:53:15 UTC

This was tested with UI.
As far as the time - it takes 2M30S to attached the label, so in term of task duration I think we are about the limit, unless it's acceptable to go to 4-5 minutes duration for the scenario.

Comment 61 Dan Kenigsberg 2018-04-11 07:36:35 UTC

(In reply to guy chen from comment #60)
> This was tested with UI.
> As far as the time - it takes 2M30S to attached the label, so in term of
> task duration I think we are about the limit, unless it's acceptable to go
> to 4-5 minutes duration for the scenario.

What is the maximum number of networks that works with Engine's default timeout?

Comment 62 Roy Golan 2018-04-16 08:07:39 UTC

These are the numbers that Petr included in his blog post to ovirt - above 300 for that matters.
https://www.ovirt.org/blog/2017/11/setting-up-multiple-networks-is-going-to-be-much-faster-in-ovirt-4-2/

Comment 65 Dan Kenigsberg 2018-04-29 14:05:28 UTC

Guy, did you intentionally remove the needinfo from Comment 61 without supplying the needed info?

Comment 66 Daniel Gur 2018-04-29 16:02:08 UTC

Hello Dan,
I requested Guy to close this need info.
We had also agreed with Yaniv K that finding the next  limit of "maximum number of networks" is not a need info for this bug but actually a new significant task of it's own.

"Need info" should be used when some significant information is needed in order to understand and resolve the bug.
We should not to use "need info" to create New tasks for the teams.

We will add this task to our back log and prioritize considering other tasks.
I also  suggest that the scale Requirement lets say for RHV 4.3 of "maximum number of networks" Should be defined by the PM, opened as RFE  so we could properly address it in our next testing efforts. 

Please feel free to contact me off-line to talk about it if needed.

Comment 69 errata-xmlrpc 2018-05-15 17:49:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 70 Franta Kust 2019-05-16 13:07:03 UTC

BZ<2>Jira Resync

Comment 72 Daniel Gur 2019-10-22 09:44:45 UTC

We will need to consider it and address with Close loop Ticket process.
It.

Note You need to log in before you can comment on or make changes to this bug.

alkaplan
dagur
danken
dholler
eedri
fdeutsch
guchen
jbuchta
jentrena
lpeer
lsurette
mburman
mkalinin
mlehrer
myakove
obockows
pdwyer
ppostler
rbalakri
rgolan
Rhev-m-bugs
s.kieske
srevivo
tmilsond
ylavi