Created attachment 1479750 [details] engine, vdsm and supervdsm logs Description of problem: VDSM failed to report to engine 'GetCapabilitiesVDS' method after engine sent setupNetworks command From engine.log - tion_4ddf55e4-f9b8-4049] START, HostSetupNetworksVDSCommand(HostName = host_mixed_1, HostSetupNetworksVdsCommandParameters:{hostId='9838c152-9549-49bb-b8b6-20335643cd4c', vds='Host[host_mixed_1,9838c152-9549-49bb-b8b6-20335643cd4c]', rollbackOnFailure='true', connectivityTimeout='60', networks='[]', removedNetworks='[]', bonds='[CreateOrUpdateBond:{id='8f7c60d9-861d-460c-a793-973793ba174c', name='bond026', bondingOptions='mode=active-backup arp_ip_target=192.168.0.2 arp_interval=1', slaves='[dummy_14, dummy_13]'}]', removedBonds='[]', clusterSwitchType='LEGACY'}), log id: 27d862f8 2018-08-29 21:53:49,069+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (default task-38) [hosts_syncAction_4ddf55e4-f9b8-4049] FINISH, HostSetupNetworksVDSCommand, log id: 27d862f8 2018-08-29 21:53:55,398+03 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (default task-38) [hosts_syncAction_4ddf55e4-f9b8-4049] Unexpected return value: Status [code=-32603, message=Internal JSON-RPC error: {'reason': '[Errno 9] Bad file descriptor'}] 2018-08-29 21:53:55,399+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (default task-38) [hosts_syncAction_4ddf55e4-f9b8-4049] Failed in 'GetCapabilitiesVDS' method From vdsm.log - 2018-08-29 21:53:55,392+0300 ERROR (jsonrpc/0) [jsonrpc.JsonRpcServer] Internal server error (__init__:611) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 201, in _dynamicMethod result = fn(*methodArgs) File "<string>", line 2, in getCapabilities File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in getCapabilities c = caps.get() File "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, in get net_caps = supervdsm.getProxy().network_caps() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda> **kwargs) File "<string>", line 2, in network_caps File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result) OSError: [Errno 9] Bad file descriptor 2018-08-29 21:53:55,393+0300 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getCapabilities failed (error -32603) in 1.24 secon ds (__init__:573) From supervdsm.log - netlink/events::ERROR::2018-08-29 21:53:55,391::concurrent::201::root::(run) FINISH thread <Thread(netlink/events, stopped daemon 140286 838302464)> failed Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 186, in _scan libnl.nl_recvmsgs_default(sock) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 454, in nl_recvmsgs_default raise IOError(-err, nl_geterror(err)) IOError: [Errno 5] Out of memory MainProcess|jsonrpc/0::ERROR::2018-08-29 21:53:55,391::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in network_caps Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in network_caps return netswitch.configurator.netcaps(compatibility=30600) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 318, in netcaps _add_speed_device_info(net_caps) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 353, in _add_speed_device_info timeout -= _wait_for_link_up(devname, timeout) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 369, in _wait_for_link_up pass File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/waitfor.py", line 41, in waitfor_linkup return File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 145, in __exit__ self.stop() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 198, in stop os.write(self._pipetrick[1], b'c') OSError: [Errno 9] Bad file descriptor Version-Release number of selected component (if applicable): ovirt-engine-4.2.6.4-0.1.el7ev.noarch vdsm-4.20.39-1.el7ev.x86_64 How reproducible: Once
This is very rare. We may be able to address this only if we have a reproducer.
It appears that this issue is returning. Version-Release number of selected component (if applicable): ------------------------------------------------------------- ovirt-engine-4.4.0-0.0.master.20190327134249.git9e2deb5.el7.noarch vdsm-4.30.12-1.el7.x86_64 How reproducible: ----------------- 30% (4/12 failed test runs in automation) Additional notes: ----------------- VDSM's NetLink buffer was increased to 1024 * 1024 (instead of the original 1024 * 512)
Most relevant part of supervdsm.log seems to be: netlink/events::ERROR::2019-04-07 10:34:32,210::concurrent::202::root::(run) FINISH thread <Thread(netlink/events, stopped daemon 139722559039232)> failed Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 188, in _scan libnl.nl_recvmsgs_default(sock) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 455, in nl_recvmsgs_default raise IOError(-err, nl_geterror(err)) IOError: [Errno 5] Out of memory MainProcess|jsonrpc/7::ERROR::2019-04-07 10:34:32,211::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in network_caps Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 58, in network_caps return netswitch.configurator.netcaps(compatibility=30600) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 367, in netcaps _add_speed_device_info(net_caps) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 402, in _add_speed_device_info timeout -= _wait_for_link_up(devname, timeout) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 418, in _wait_for_link_up pass File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/waitfor.py", line 43, in waitfor_linkup return File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 147, in __exit__ self.stop() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 200, in stop os.write(self._pipetrick[1], b'c') OSError: [Errno 9] Bad file descriptor
Moshe, could you please show how many interfaces there are on the node? ip -o addr show | wc -l On top of that, could you raise the memory even more (wonderful fix), maybe to something ridiculously big?
ip -o addr show | wc -l produces 7 interfaces. Will increasing the buffer 12 times be big enough?
7? Hmmm, I wonder what eats the whole memory then. Please try it 12x, if it still fails, maybe 120x! But anyway, it shoudn't be that hungry...
Failed with _NL_SOCKET_BUFF_SIZE = 12288 * 512 (which is 12x times), seeing the same error. I believe there is a deeper issue here.
netlink/events::DEBUG::2020-06-19 02:34:01,694::concurrent::258::root::(run) START thread <Thread(netlink/events, started daemon 140210834605824)> (func=<bound method Monitor._scan of <vdsm.network.netlink.monitor.Monitor object at 0x7f8584c18e10>>, args=(), kwargs={}) MainProcess|jsonrpc/4::ERROR::2020-06-19 02:34:02,210::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) Error in network_caps Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 101, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 58, in network_caps return netswitch.configurator.netcaps(compatibility=30600) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 367, in netcaps _add_speed_device_info(net_caps) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 402, in _add_speed_device_info timeout -= _wait_for_link_up(devname, timeout) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 418, in _wait_for_link_up pass File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/waitfor.py", line 46, in waitfor_linkup return File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 147, in __exit__ self.stop() File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 200, in stop os.write(self._pipetrick[1], b'c') OSError: [Errno 9] Bad file descriptor netlink/events::ERROR::2020-06-19 02:34:02,211::concurrent::267::root::(run) FINISH thread <Thread(netlink/events, stopped daemon 140210834605824)> failed Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 260, in run ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/monitor.py", line 188, in _scan libnl.nl_recvmsgs_default(sock) File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 455, in nl_recvmsgs_default raise IOError(-err, nl_geterror(err)) IOError: [Errno 5] Out of memory
QE see this once again on tier2 4.3.11 - vdsm-4.30.48-1.el7ev.x86_64 Indeed rare, but still happens from time to time.
Re-targeting to 4.4.3 being closed current release with not being reproducible in latest 4.4.