Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 871722

Summary: 3.x [Scalability] Failed to attach more than ~3350 vlan-networks to host's nic due to unexpected exception in getVdsCaps
Product: Red Hat Enterprise Virtualization Manager Reporter: Rami Vaknin <rvaknin>
Component: vdsmAssignee: lpeer <lpeer>
Status: CLOSED WONTFIX QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acathrow, bazulay, danken, hateya, iheim, jkt, lpeer, yeylon
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard: network
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-12 12:23:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm and engine logs none

Description Rami Vaknin 2012-10-31 08:18:55 UTC
Created attachment 635981 [details]
vdsm and engine logs

Versions:
vdsm-4.9.6-39.0.el6_3.x86_64
si22

Scenario:
1. Create 4096 vlan-networks in a data center and attach them to a cluster
2. Add the above vlan-networks to host's nic

Results:

I've managed to attach ~3350 networks to the host nic without any problem while using a bare-metal host (Intel Xeon E5645, 2.40GHz, 32GB RAM), then the next ~20 networks succedded/failed alternately, then all the rest network attachements failed.

Note that every ifconfig command (with output to /dev/null) on this order of host networks takes ~40 seconds to return while "ip addr" takes less than 1 second.


Additional Information:

From vdsm.log, same output dumped by manually running of getVdsCaps
===================================================================

Thread-132709::ERROR::2012-10-29 07:10:43,737::BindingXMLRPC::892::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 879, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 346, in delNetwork
    return api.delNetwork(bridge, vlan, bond, nics, options)
  File "/usr/share/vdsm/API.py", line 1164, in delNetwork
    _netinfo = netinfo.NetInfo()
  File "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py", line 369, in __init__
    _netinfo = get()
  File "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py", line 308, in get
    for bridge in bridges()])
KeyError: 'vlan3350'


Side effect in rhevm
====================

2012-10-29 06:43:17,498 INFO  [org.ovirt.engine.core.bll.AttachNetworkToVdsInterfaceCommand] (ajp-/127.0.0.1:8702-12) [5c1ed607] Running command: AttachNetworkToVdsInterfaceCommand internal: false. Entities affected :  ID: b79e6836-1fad-11e2-81c1-8febf9096a31 Type: VDS
2012-10-29 06:43:17,501 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AddNetworkVDSCommand] (ajp-/127.0.0.1:8702-12) [5c1ed607] START, AddNetworkVDSCommand(HostName = puma01, HostId = b79e6836-1fad-11e2-81c1-8febf9096a31, networkName=vlan3346, oldNetworkName=null, hostAddr=null, checkConnectivity=false, connectionTimeout=0, vlanId=3346, bondName=null, nics=[eth3], inetAddr=, networkMask=null, gateway=, stp=false, bondingOptions=null, bootProtocol=None, vmNetwork=true, network=vlan3346 {id=2c2747be-f33f-445e-871e-12cb1f91bbd8, description=null, subnet=null, gateway=null, type=null, vlan_id=3346, stp=false, storage_pool_id=5058a6dd-37cb-44ed-b811-d84aee9aae4c, mtu=0, vmNetwork=true, cluster=network_cluster {id={clusterId=null, networkId=null}, status=NonOperational, is_display=false, required=true}}), log id: 2e839d41
2012-10-29 06:46:17,504 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (ajp-/127.0.0.1:8702-12) [5c1ed607] XML RPC error in command AddNetworkVDS ( HostName = puma01 ), the error was: java.util.concurrent.TimeoutException, TimeoutException:  
2012-10-29 06:46:17,504 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AddNetworkVDSCommand] (ajp-/127.0.0.1:8702-12) [5c1ed607] FINISH, AddNetworkVDSCommand, log id: 2e839d41
2012-10-29 06:46:17,504 ERROR [org.ovirt.engine.core.bll.AttachNetworkToVdsInterfaceCommand] (ajp-/127.0.0.1:8702-12) [5c1ed607] Command org.ovirt.engine.core.bll.AttachNetworkToVdsInterfaceCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSNetworkException: 
2012-10-29 06:47:13,825 ERROR [org.ovirt.engine.core.bll.AttachNetworkToVdsInterfaceCommand] (ajp-/127.0.0.1:8702-12) [5c1ed607] Transaction rolled-back for command: org.ovirt.engine.core.bll.AttachNetworkToVdsInterfaceCommand.

Comment 2 Dan Kenigsberg 2012-11-17 22:48:17 UTC
The vdsm.log excerpt is from a call to delNetwork(vlan3350)

Thread-132709::DEBUG::2012-10-29 07:09:41,213::BindingXMLRPC::874::vds::(wrapper) client [10.35.160.91]::call delNetwork with ('vlan3350', '3350', '', ['eth3']) {} flowID [37e06cbd]

Any idea why this was attempted, or why it has no trace in engine.log?

Comment 3 Rami Vaknin 2012-11-18 12:48:47 UTC
According to the correlation id, it's probably operation done by engine's transaction roll-back:

2012-10-29 07:06:41,164 INFO  [org.ovirt.engine.core.bll.AttachNetworkToVdsInterfaceCommand] (ajp-/127.0.0.1:8702-3) [37e06cbd] Running command: AttachNetworkToVdsInterfaceCommand internal: false. Entities affected :  ID: b79e6836-1fad-11e2-81c1-8febf9096a31 Type: VDS
2012-10-29 07:06:41,166 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AddNetworkVDSCommand] (ajp-/127.0.0.1:8702-3) [37e06cbd] START, AddNetworkVDSCommand(HostName = puma01, HostId = b79e6836-1fad-11e2-81c1-8febf9096a31, networkName=vlan3350, oldNetworkName=null, hostAddr=null, checkConnectivity=false, connectionTimeout=0, vlanId=3350, bondName=null, nics=[eth3], inetAddr=, networkMask=null, gateway=, stp=false, bondingOptions=null, bootProtocol=None, vmNetwork=true, network=vlan3350 {id=71bc913c-a1a8-459b-9285-69d0362a058e, description=null, subnet=null, gateway=null, type=null, vlan_id=3350, stp=false, storage_pool_id=5058a6dd-37cb-44ed-b811-d84aee9aae4c, mtu=0, vmNetwork=true, cluster=network_cluster {id={clusterId=null, networkId=null}, status=NonOperational, is_display=false, required=true}}), log id: 2a320cb6
2012-10-29 07:09:41,168 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (ajp-/127.0.0.1:8702-3) [37e06cbd] XML RPC error in command AddNetworkVDS ( HostName = puma01 ), the error was: java.util.concurrent.TimeoutException, TimeoutException:  
2012-10-29 07:09:41,169 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AddNetworkVDSCommand] (ajp-/127.0.0.1:8702-3) [37e06cbd] FINISH, AddNetworkVDSCommand, log id: 2a320cb6
2012-10-29 07:09:41,169 ERROR [org.ovirt.engine.core.bll.AttachNetworkToVdsInterfaceCommand] (ajp-/127.0.0.1:8702-3) [37e06cbd] Command org.ovirt.engine.core.bll.AttachNetworkToVdsInterfaceCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSNetworkException: 
2012-10-29 07:10:00,000 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-87) Autorecovering hosts is disabled, skipping
2012-10-29 07:10:00,001 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-87) Autorecovering storage domains is disabled, skipping
2012-10-29 07:10:43,738 ERROR [org.ovirt.engine.core.bll.AttachNetworkToVdsInterfaceCommand] (ajp-/127.0.0.1:8702-3) [37e06cbd] Transaction rolled-back for command: org.ovirt.engine.core.bll.AttachNetworkToVdsInterfaceCommand.
2012-10-29 07:10:43,758 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-3) Operation Failed: [Network error during communication with the Host.]

Comment 4 Dan Kenigsberg 2012-11-18 13:32:19 UTC
Yes, there has been a network error when vdsm attempted to respond

MainProcess|Thread-132580::DEBUG::2012-10-29 07:09:43,411::configNetwork::248::root::(writeBackupFile) Persistently backed up /var/lib/vdsm/netconfback/logicalnetworks/vlan3350 (until next 'set safe config')
Thread-132580::DEBUG::2012-10-29 07:09:43,511::BindingXMLRPC::880::vds::(wrapper) return addNetwork with {'status': {'message': 'Done', 'code
': 0}}
Thread-132580::ERROR::2012-10-29 07:09:43,516::SecureXMLRPCServer::73::root::(handle_error) client ('10.35.160.91', 36988)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 560, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 69, in finish_request
    client_address)
  File "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
    self.handle()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
    self.handle_one_request()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
    method()
  File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 494, in do_POST
    self.wfile.write(response)
  File "/usr/lib64/python2.6/socket.py", line 324, in write
    self.flush()
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
    self._sock.sendall(buffer(data, write_offset, buffer_size))
  File "/usr/lib64/python2.6/ssl.py", line 203, in sendall
    v = self.send(data[count:])
  File "/usr/lib64/python2.6/ssl.py", line 174, in send
    v = self._sslobj.write(data)
error: [Errno 32] Broken pipe

Comment 5 lpeer 2014-01-12 12:23:48 UTC
Over 3k vlans on a single host should be ok given that physical switches scale only to a few hundreds.