Bug 1623851 - [VDSM] supervdsm network_caps fails with "IOError: [Errno 5] Out of memory"
Summary: [VDSM] supervdsm network_caps fails with "IOError: [Errno 5] Out of memory"
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: SuperVDSM
Version: 4.20.31
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ovirt-4.4.3
: ---
Assignee: Dominik Holler
QA Contact: msheena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-30 10:22 UTC by msheena
Modified: 2022-01-06 14:09 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-09-15 13:52:56 UTC
oVirt Team: Network
Embargoed:
sbonazzo: ovirt-4.4?


Attachments (Terms of Use)
engine, vdsm and supervdsm logs (6.37 MB, application/x-gzip)
2018-08-30 10:22 UTC, msheena
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1466628 0 high CLOSED SetupNetworks failed after sometime with 'IOError: [Errno 5] Out of memory' and 'OSError: [Errno 9] Bad file descriptor' 2021-02-22 00:41:40 UTC

Description msheena 2018-08-30 10:22:10 UTC
Created attachment 1479750 [details]
engine, vdsm and supervdsm logs

Description of problem:
VDSM failed to report to engine 'GetCapabilitiesVDS' method after engine sent setupNetworks command

From engine.log -
 
tion_4ddf55e4-f9b8-4049] START, HostSetupNetworksVDSCommand(HostName = host_mixed_1, HostSetupNetworksVdsCommandParameters:{hostId='9838c152-9549-49bb-b8b6-20335643cd4c', vds='Host[host_mixed_1,9838c152-9549-49bb-b8b6-20335643cd4c]', rollbackOnFailure='true', connectivityTimeout='60', networks='[]', removedNetworks='[]', bonds='[CreateOrUpdateBond:{id='8f7c60d9-861d-460c-a793-973793ba174c', name='bond026', bondingOptions='mode=active-backup arp_ip_target=192.168.0.2 arp_interval=1', slaves='[dummy_14, dummy_13]'}]', removedBonds='[]', clusterSwitchType='LEGACY'}), log id: 27d862f8
2018-08-29 21:53:49,069+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (default task-38) [hosts_syncAction_4ddf55e4-f9b8-4049] FINISH, HostSetupNetworksVDSCommand, log id: 27d862f8
2018-08-29 21:53:55,398+03 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (default task-38) [hosts_syncAction_4ddf55e4-f9b8-4049] Unexpected return value: Status [code=-32603, message=Internal JSON-RPC error: {'reason': '[Errno 9] Bad file descriptor'}]
2018-08-29 21:53:55,399+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (default task-38) [hosts_syncAction_4ddf55e4-f9b8-4049] Failed in 'GetCapabilitiesVDS' method

From vdsm.log -
 
2018-08-29 21:53:55,392+0300 ERROR (jsonrpc/0) [jsonrpc.JsonRpcServer] Internal server error (__init__:611)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
    res = method(**params)
  File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 201, in _dynamicMethod
    result = fn(*methodArgs)
  File "<string>", line 2, in getCapabilities
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in getCapabilities
    c = caps.get()
  File "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, in get
    net_caps = supervdsm.getProxy().network_caps()
  File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
    return callMethod()
  File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
    **kwargs)
  File "<string>", line 2, in network_caps
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
OSError: [Errno 9] Bad file descriptor
2018-08-29 21:53:55,393+0300 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getCapabilities failed (error -32603) in 1.24 secon
ds (__init__:573)

From supervdsm.log - 

netlink/events::ERROR::2018-08-29 21:53:55,391::concurrent::201::root::(run) FINISH thread <Thread(netlink/events, stopped daemon 140286
838302464)> failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 186, in _scan
    libnl.nl_recvmsgs_default(sock)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 454, in nl_recvmsgs_default
    raise IOError(-err, nl_geterror(err))
IOError: [Errno 5] Out of memory
MainProcess|jsonrpc/0::ERROR::2018-08-29 21:53:55,391::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in network_caps
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in network_caps
    return netswitch.configurator.netcaps(compatibility=30600)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 318, in netcaps
    _add_speed_device_info(net_caps)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 353, in _add_speed_device_info
    timeout -= _wait_for_link_up(devname, timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 369, in _wait_for_link_up
    pass
  File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/waitfor.py", line 41, in waitfor_linkup
    return
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 145, in __exit__
    self.stop()
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 198, in stop
    os.write(self._pipetrick[1], b'c')
OSError: [Errno 9] Bad file descriptor


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.6.4-0.1.el7ev.noarch
vdsm-4.20.39-1.el7ev.x86_64

How reproducible:
Once

Comment 1 Dan Kenigsberg 2018-12-25 10:40:29 UTC
This is very rare. We may be able to address this only if we have a reproducer.

Comment 2 msheena 2019-04-07 11:07:04 UTC
It appears that this issue is returning.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
ovirt-engine-4.4.0-0.0.master.20190327134249.git9e2deb5.el7.noarch
vdsm-4.30.12-1.el7.x86_64

How reproducible:
-----------------
30% (4/12 failed test runs in automation)

Additional notes:
-----------------
VDSM's NetLink buffer was increased to 1024 * 1024 (instead of the original 1024 * 512)

Comment 4 Dominik Holler 2019-04-08 14:16:28 UTC
Most relevant part of supervdsm.log seems to be:
netlink/events::ERROR::2019-04-07 10:34:32,210::concurrent::202::root::(run) FINISH thread <Thread(netlink/events, stopped daemon 139722559039232)> failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 188, in _scan
    libnl.nl_recvmsgs_default(sock)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 455, in nl_recvmsgs_default
    raise IOError(-err, nl_geterror(err))
IOError: [Errno 5] Out of memory
MainProcess|jsonrpc/7::ERROR::2019-04-07 10:34:32,211::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in network_caps
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 58, in network_caps
    return netswitch.configurator.netcaps(compatibility=30600)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 367, in netcaps
    _add_speed_device_info(net_caps)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 402, in _add_speed_device_info
    timeout -= _wait_for_link_up(devname, timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 418, in _wait_for_link_up
    pass
  File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/waitfor.py", line 43, in waitfor_linkup
    return
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 147, in __exit__
    self.stop()
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 200, in stop
    os.write(self._pipetrick[1], b'c')
OSError: [Errno 9] Bad file descriptor

Comment 5 Petr Horáček 2019-04-08 14:35:09 UTC
Moshe, could you please show how many interfaces there are on the node?

ip -o addr show | wc -l

On top of that, could you raise the memory even more (wonderful fix), maybe to something ridiculously big?

Comment 6 msheena 2019-04-10 11:09:56 UTC
ip -o addr show | wc -l produces 7 interfaces.
Will increasing the buffer 12 times be big enough?

Comment 7 Petr Horáček 2019-04-10 14:00:38 UTC
7? Hmmm, I wonder what eats the whole memory then. Please try it 12x, if it still fails, maybe 120x! But anyway, it shoudn't be that hungry...

Comment 8 msheena 2019-04-17 10:51:14 UTC
Failed with _NL_SOCKET_BUFF_SIZE = 12288 * 512 (which is 12x times), seeing the same error.
I believe there is a deeper issue here.

Comment 9 Dominik Holler 2020-06-22 06:31:10 UTC
netlink/events::DEBUG::2020-06-19 02:34:01,694::concurrent::258::root::(run) START thread <Thread(netlink/events, started daemon 140210834605824)> (func=<bound method Monitor._scan of <vdsm.network.netlink.monitor.Monitor object at 0x7f8584c18e10>>, args=(), kwargs={})
MainProcess|jsonrpc/4::ERROR::2020-06-19 02:34:02,210::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in network_caps
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 58, in network_caps
    return netswitch.configurator.netcaps(compatibility=30600)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 367, in netcaps
    _add_speed_device_info(net_caps)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 402, in _add_speed_device_info
    timeout -= _wait_for_link_up(devname, timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 418, in _wait_for_link_up
    pass
  File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/waitfor.py", line 46, in waitfor_linkup
    return
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 147, in __exit__
    self.stop()
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 200, in stop
    os.write(self._pipetrick[1], b'c')
OSError: [Errno 9] Bad file descriptor
netlink/events::ERROR::2020-06-19 02:34:02,211::concurrent::267::root::(run) FINISH thread <Thread(netlink/events, stopped daemon 140210834605824)> failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 260, in run
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 188, in _scan
    libnl.nl_recvmsgs_default(sock)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 455, in nl_recvmsgs_default
    raise IOError(-err, nl_geterror(err))
IOError: [Errno 5] Out of memory

Comment 10 Michael Burman 2020-06-22 06:37:43 UTC
QE see this once again on tier2 4.3.11 - vdsm-4.30.48-1.el7ev.x86_64

Indeed rare, but still happens from time to time.

Comment 13 Sandro Bonazzola 2020-11-17 12:50:16 UTC
Re-targeting to 4.4.3 being closed current release with not being reproducible in latest 4.4.


Note You need to log in before you can comment on or make changes to this bug.