Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1766193

Summary: [Scale] oVirt should support up to 200 networks per host
Product: [oVirt] ovirt-engine Reporter: mlehrer
Component: BLL.NetworkAssignee: Dominik Holler <dholler>
Status: CLOSED CURRENTRELEASE QA Contact: mlehrer
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.3.6.6CC: bugs, dholler, edwardh, eraviv, gveitmic, mburman, michal.skrivanek
Target Milestone: ovirt-4.4.0Keywords: Performance
Target Release: ---Flags: pm-rhel: ovirt-4.4+
dholler: devel_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhv-4.4.0-31 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-20 20:03:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1723804, 1788081    
Bug Blocks: 1826425    
Attachments:
Description Flags
Network attachment cases via API and UI relevant logs
none
vdsm logfiles of attaching 300 networks in one shot none

Description mlehrer 2019-10-28 14:43:31 UTC
Created attachment 1629815 [details]
Network attachment cases via API and UI relevant logs

Description of problem:

Setup networks fails to attach multiple networks via API (51 total networks), and UI (97 total networks) to a single interface.
Results in VDSErrorException: Failed to HostSetupNetworksVDS, error = Resource unavailable, code = 40 (Failed with error unavail and code 40) and block work exception.


The following was tested:

#via ansible API
Case 1: Attempting to attach 300 networks on single interface of host running many vms
Result: failed after 51 networks attached.

Case 2: Attempting to attach 300 networks to a single interface on host running no vms.
Result: failed after 51 networks attached

#Setup Networks Via UI
Case 3: Attempt to attach 70 networks via UI - successful with exceptions
Case 4: Attempt to attach 100 networks vai UI - only 97 networks attached







Version-Release number of selected component (if applicable):
vdsm-4.30.30-1.el7ev.x86_64
ovirt-engine-4.3.6.5-0.1.el7.noarch

How reproducible:
reproduces

Steps to Reproduce via ansible
1.using ansible ovirt_networks create 300 networks with vlan
2.using ansible ovirt_host_networks assign 300 networks to 1 interface

Steps to Reproduce via UI
1) create 300 networks via ansible
2) Manually drag over networks in setup host networks UI dialog box

Actual results:
Setup networks fails.
UI dialog box doesnt scale in the number of elements diagram arrows.
Max networks attached is 97 further drag attempts at more networks doesnt work.


2019-10-27 15:18:29,587Z ERROR [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (default task-47) [66d91fb8-a452-48ae-8a56-c3ebca2188cf] Command 'org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HostSetupNetworksVDS, error = Resource unavailable, code = 40 (Failed with error unavail and code 40)



Expected results:
Successful attachment of 300 networks on 1 interface

Additional info:
Logs contain engine / supervdsm / and vdsm per test cases.

#block work exception
2019-10-28 14:01:59,612+0000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'bondings': {}, u'networks': {u'network_83': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'83', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_82': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'82', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_81': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'81', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_80': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'80', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_87': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'87', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_86': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'86', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_85': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'85', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_84': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'84', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_89': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'89', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_88': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'88', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_78': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'78', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_79': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'79', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_76': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'76', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_77': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'77', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_74': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'74', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_75': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'75', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_72': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'72', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_73': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'73', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_70': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'70', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_71': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'71', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_94': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'94', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_95': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'95', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_96': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'96', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_90': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'90', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_91': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'91', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_92': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'92', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}, u'network_93': {u'ipv6autoconf': False, u'nic': u'em2', u'vlan': u'93', u'mtu': 1500, u'switch': u'legacy', u'dhcpv6': False, u'STP': u'no', u'bridged': u'true'}}, u'options': {u'connectivityCheck': u'true', u'connectivityTimeout': 120, u'commitOnSuccess': True}}, 'jsonrpc': '2.0', 'method': u'Host.setupNetworks', 'id': u'8b6da14d-c6fa-4f9e-941e-bb741f6ca0da'} at 0x7fb2c0705950> timeout=60, duration=60.00 at 0x7fb2c0705090> task#=192615 at 0x7fb3001654d0>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 198, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in setupNetworks
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1560, in setupNetworks
  supervdsm.getProxy().setupNetworks(networks, bondings, options)
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 56, in __call__
  return callMethod()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>
  **kwargs)
File: "<string>", line 2, in setupNetworks
File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod
  kind, result = conn.recv() (executor:363)
2019-10-28 14:01:59,916+0000 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.confirmConnectivity succeeded in 0.00 seconds (__init__:312)

Comment 1 eraviv 2019-10-29 06:28:21 UTC
Running the setup using the API (as well as ansible) or the UI on ovirt-engine-4.3.z triggers a sequence of two actions:
1. setup networks
2. commit network changes done by (1)

the second action is started by engine after and if the first completes successfully. Also, vdsm does not allow these two actions to be invoked concurrently. So when multiple setup networks are invoked these two actions collide on vdsm on vdsm rejects them with:

2019-10-27 15:20:03,560+0000 WARN  (jsonrpc/4) [vds] concurrent network verb already executing (API:1555)
2019-10-27 15:20:03,561+0000 INFO  (jsonrpc/4) [api.network] FINISH setupNetworks return={'status': {'message': 'Resource unavailable', 'code': 40}}

In ovirt-engine-4.4-git.a474be727 and onwards we have corrected this situation so that if the flag commit-on-success is specified and set to true in the API (in the UI it is set to true by default) then the commit-network-changes does not occur and this collisions should not occur at all.

Network QE have decided not to backport this fix to 4.3.z for now.

Comment 2 mlehrer 2019-10-29 13:43:02 UTC
Updating that I sync'd with eraviv and using the method of attaching networks one at a time and not in batch I"m able to work around this.
Updating the bug title to reflect that.
Currently, we have 198 networks on 1 interface on a nested host.

Comment 3 eraviv 2019-10-29 13:44:28 UTC
In reference to my comment #1 and the initial logs that were attached to the bug:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. API invocations:
=================== 

1.1 the log extract in my comment #1 is from "case_1_api". it looks like the collision happened once only
1.2 in the supervdsm.log of "case_1_api" looks like all 300 networks were created (search the logs for "adding network u'network_" yields 300 results. e.g.:

MainProcess|jsonrpc/7::INFO::2019-10-27 15:36:06,760::netconfpersistence::58::root::(setNetwork) Adding network network_299...
MainProcess|jsonrpc/7::INFO::2019-10-27 15:36:06,765::legacy_switch::223::root::(_add_network) Configuring device network_299...
MainProcess|jsonrpc/7::DEBUG::2019-10-27 15:36:06,788::ifcfg::471::root::(writeBackupFile) Persistently backed up /var/lib/vdsm/netconfback/ifcfg-eno4.299...

1.3 in "case_2_api" there are no 'adding network' events at all

2. UI invocations:
================== 

2.1 In both engine logs from "case_3_UI" and from "case_4_UI" I found only one network error:


2019-10-28 09:26:55,459Z ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-70) [] Operation Failed: [Cannot setup Networks. Role (migration/display/gluster/default-route) network 'network_100' has no boot protocol assigned.]

It refers to a configuration error on the part of the user, and anyway not related to multiple network setups.


2.2 "case_4_UI_97_networks": vdsm.log contains the worker blocked error quoted in the description above. It means that for some time a thread was not able to run on vdsm but is not a real problem: the error is reported on 2019-10-28 13:49:13,140 but in the supervdsm.log there are successful setups reported at a later timestamp, e.g. 2019-10-28 14:01:59,450 

2.3 "case_4_UI_97_networks": supervdsm.log contains 96 successful network creations and last network caps contains 96 networks

To summarize - I do not see in the UI invocation logs anything related to the attempt to add a 98th network which IIUC Mordehai tried to add on its own after the setup for the initial 97 networks completed.

Comment 4 Michael Burman 2019-11-05 13:24:48 UTC
I suggest to move to 4.4, as this should be fixed already in 4.4

Comment 5 Dominik Holler 2019-11-05 13:29:53 UTC
(In reply to Michael Burman from comment #4)
> I suggest to move to 4.4, as this should be fixed already in 4.4

No idea, I am excited to know the performance of nmstate.
We should definitely address this soon.

Comment 7 Dominik Holler 2020-01-07 09:38:22 UTC
Is this blocked by bug 1788081?

Comment 8 eraviv 2020-01-07 09:54:28 UTC
(In reply to Dominik Holler from comment #7)
> Is this blocked by bug 1788081?

Here the problem is a concurrency problem inside the vdsm process due to concurrent setup networks actions (setup collides with commit). 
It has been resolved on 4.4 with BZ 1723804.

BZ 1788081 is about libnm having a policy of stopping handling setup networks requests after some time (35 sec).
So IMO not related.

Comment 9 eraviv 2020-01-07 09:56:19 UTC
(In reply to eraviv from comment #8)
> (In reply to Dominik Holler from comment #7)
> > Is this blocked by bug 1788081?
> 
> Here the problem is a concurrency problem inside the vdsm process due to
> concurrent setup networks actions (setup collides with commit). 
> It has been resolved on 4.4 with BZ 1723804.
> 
> BZ 1788081 is about libnm having a policy of stopping handling setup
> networks requests after some time (35 sec).
> So IMO not related.

Also, libnm issue is on 4.4 and this bug is on 4.3, so 1788081 is not a blocker.

Comment 10 Dominik Holler 2020-01-07 10:34:49 UTC
(In reply to eraviv from comment #9)
> (In reply to eraviv from comment #8)
> > (In reply to Dominik Holler from comment #7)
> > > Is this blocked by bug 1788081?
> > 
> > Here the problem is a concurrency problem inside the vdsm process due to
> > concurrent setup networks actions (setup collides with commit). 
> > It has been resolved on 4.4 with BZ 1723804.
> > 

This explains why this bug depends on bug 1723804.

> > BZ 1788081 is about libnm having a policy of stopping handling setup
> > networks requests after some time (35 sec).
> > So IMO not related.
> 
> Also, libnm issue is on 4.4 and this bug is on 4.3, so 1788081 is not a
> blocker.

This is all about implementation details.
From my understanding the criteria that this bug can be verified is to
attach 300 networks with VLAN via UI and ansible in a single shot.
Everything what prevents this, blocks this bug.
Do you agree, maybe I miss somehting?

Comment 11 eraviv 2020-01-07 11:18:40 UTC
(In reply to Dominik Holler from comment #10)
> (In reply to eraviv from comment #9)
> > (In reply to eraviv from comment #8)
> > > (In reply to Dominik Holler from comment #7)
> > > > Is this blocked by bug 1788081?
> > > 
> > > Here the problem is a concurrency problem inside the vdsm process due to
> > > concurrent setup networks actions (setup collides with commit). 
> > > It has been resolved on 4.4 with BZ 1723804.
> > > 
> 
> This explains why this bug depends on bug 1723804.
> 
> > > BZ 1788081 is about libnm having a policy of stopping handling setup
> > > networks requests after some time (35 sec).
> > > So IMO not related.
> > 
> > Also, libnm issue is on 4.4 and this bug is on 4.3, so 1788081 is not a
> > blocker.
> 
> This is all about implementation details.
> From my understanding the criteria that this bug can be verified is to
> attach 300 networks with VLAN via UI and ansible in a single shot.
> Everything what prevents this, blocks this bug.
> Do you agree, maybe I miss somehting?

If verification is on 4.4 and vdsm uses nmstate then yes - it is a blocker

Comment 12 Sandro Bonazzola 2020-04-08 14:56:17 UTC
This bug is targeted to ovirt 4.4.3 and in post state but referenced patch is included in vdsm v4.40.13 tag. Can we move this to modified for ovirt-4.4.0?

Comment 13 Dominik Holler 2020-04-08 19:30:21 UTC
Created attachment 1677354 [details]
vdsm logfiles of attaching 300 networks in one shot

With fix from https://bugzilla.redhat.com/show_bug.cgi?id=1820009#c2, attaching 300 networks in one shot works.

Comment 14 Dominik Holler 2020-04-21 15:11:49 UTC
Adding 200 networks in batches of 100 networks,
removing them,
adding 200 other networks,
modifying up to 100 networks and
rebooting after this should work.

In Administration Portal, the Compute > Hosts > hostname > Network Interfaces tab and
the "Setup Host Networks" dialog should open in less than 10 seconds each.

Comment 15 mlehrer 2020-05-11 16:26:42 UTC
4.4.0-0.31.master.el8
HE environment with 200 nested hosts of which 150 hosts have 100 networks per host.
DWH separated and JVM set to 4G with engine set with with 200 pool connections / 250 db connections.



# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c ""SELECT count(*) from vds_interface_view""
 count 
-------
 15825
(1 row)"


Able to add 200 networks in batches of 100, when a timeout is added between batches of 100.
Able to remove them, and re-add them, reboot the vm.
Able to view setup networks in less than 12s
Took around 42s via UI to attach 1 network to host with 199 already attached networks via Setup Hosts in the UI.
Able to restart host with 200 networks 
Time spent on restart:
[root@vhost ~]# systemd-analyze blame
         41.210s vdsm-network.service
         30.072s NetworkManager-wait-online.service
         27.847s kdump.service
          9.238s vdsmd.service
          7.086s vdsm-network-init.service
          1.818s firewalld.service
          1.635s initrd-switch-root.service
          1.492s tuned.service
          1.291s dracut-initqueue.service
          1.204s systemd-udev-settle.service
          1.108s ovirt-imageio.service
           986ms libvirtd.service
           974ms sssd.service

Comment 16 Sandro Bonazzola 2020-05-20 20:03:41 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.