Description of problem: Customer was testing network outages by pulling network cables. Cable was unplugged and host fenced by power management. Upon restarting the host the network is up, but supervdsm was unable to return network_caps(). There is actually a big gap in the logs, it seems supervdsm hung shortly after initialization and did not return any vdsm requests. Problem was not solved until vdsm/supervdsm restarted. Seems like supervdsm simply locked/hung, or at least some thread/logging did. Supervdsm seems to have donethe usual 'tc qdisc' and 'ovs' commands successfully, but it never returned network_caps() to vdsm. And shortly after supervdsm stopped logging too. See in details. 1. After network test, hoes goes up: 2019-01-14 18:26:50,843Z INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-280398) [606d138b] EVENT_ID: FENCE_OPERATION_SUCCEEDED(9,018), Power management stop of Host <removed> succeeded. 2019-01-14 18:27:07,322Z INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-280398) [] EVENT_ID: FENCE_OPERATION_SUCCEEDED(9,018), Power management start of Host <removed> succeeded. 2. Host is activated 2019-01-14 18:29:55,222Z INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (default task-21) [9990796c-7b4b-4a4b-b6b2-ba23dfbec2c8] Lock Acquired to object 'EngineLock:{exclusiveLocks='[7e78b3d7-dab1-424e-ad95-ab587f04c5cb=VDS]', sharedLocks=''}' 2019-01-14 18:29:55,380Z INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to <removed>/<removed> 3. But GetCaps fails over and over, as below: 2019-01-14 18:36:01,496Z ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80} On the host side... 4. Host boots: Jan 14 18:29:06 wuv-ishypv001b kernel: Linux version 3.10.0-862.11.6.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Fri Aug 10 16:55:11 UTC 2018 5. Network is up Jan 14 18:29:32 wuv-ishypv001b kernel: bnx2x 0000:5c:00.1 eno4: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit Jan 14 18:29:32 wuv-ishypv001b kernel: bnx2x 0000:5c:00.0 eno3: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit Jan 14 18:29:33 wuv-ishypv001b kernel: i40e 0000:14:00.1 eno2: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None Jan 14 18:29:33 wuv-ishypv001b kernel: i40e 0000:14:00.0 eno1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None 6. VDSM starts 2019-01-14 18:29:42,463+0000 INFO (MainThread) [vds] (PID: 21673) I am the actual vdsm 4.20.39.1-1.el7ev <removed> (3.10.0-862.11.6.el7.x86_64) (vdsmd:149) 7. Here comes the first GetCaps 2019-01-14 18:29:55,852+0000 INFO (jsonrpc/7) [api.host] START getCapabilities() from=::ffff:<removed>,43994 (api:46) 8. But blocked on supervdsm: 2019-01-14 18:30:55,479+0000 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getCapabilities', 'id': u'2b2c13c7-d548-4238-bb3e-bd77b93530d0'} at 0x7f235c2e9550> timeout=60, duration=60 at 0x7f235c2e9590> task#=2 at 0x7f235c3a1090>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File: "<string>", line 2, in getCapabilities File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in getCapabilities c = caps.get() File: "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, in get net_caps = supervdsm.getProxy().network_caps() File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda> **kwargs) File: "<string>", line 2, in network_caps File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod kind, result = conn.recv() (executor:363) 9. Supervdsm logs we see the request: MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,084::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,127::routes::109::root::(get_gateway) The gateway <removed> is duplicated for the device ovirtmgmt MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,151::cmdutils::150::root::(exec_cmd) /sbin/tc qdisc show (cwd None) MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,160::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,911::cmdutils::150::root::(exec_cmd) /usr/share/openvswitch/scripts/ovs-ctl status (cwd None) MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,937::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,941::vsctl::67::root::(commit) Executing commands: /usr/bin/ovs-vsctl --timeout=5 --oneline --format=json -- list Bridge -- list Port -- list Interface MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,944::cmdutils::150::root::(exec_cmd) /usr/bin/ovs-vsctl --timeout=5 --oneline --format=json -- list Bridge -- list Port -- list Interface (cwd None) MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,998::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 10. But there is no return of network_caps in the supervdsm logs, while vdsm was obviously waiting for the response. 11. And look at the log time gap in supervdsm logs shortly after this never returned network_caps, 2h and 30min with no logs. MainProcess|jsonrpc/7::DEBUG::2019-01-14 18:29:57,091::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess::DEBUG::2019-01-14 21:03:01,322::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call multipath_status with () {} 12. It also took a while to see all network_caps requests, looks like they piled up to start with. MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,084::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/1::DEBUG::2019-01-14 18:29:56,085::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/4::DEBUG::2019-01-14 18:29:56,086::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/3::DEBUG::2019-01-14 18:29:56,087::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/6::DEBUG::2019-01-14 18:29:56,088::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/2::DEBUG::2019-01-14 18:29:56,089::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/7::DEBUG::2019-01-14 18:29:56,089::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/5::DEBUG::2019-01-14 18:29:56,099::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} 13. And never returned any of these. It seems supervdsm was hung until 21:03:01, when it "resumed". 14. Note these 9215 seconds shortly after it started working again. MainProcess::DEBUG::2019-01-14 21:03:01,323::logutils::317::root::(_report_stats) ThreadedHandler is ok in the last 9215 seconds (max pending: 24) Version-Release number of selected component (if applicable): vdsm-4.20.39.1-1.el7ev.x86_64 libvirt-3.9.0-14.el7_5.7.x86_64 How reproducible: Unknown
What happens on for i in seq 20; do vdsm-cli Host getCapabilities & done
(In reply to Dominik Holler from comment #6) > What happens on > for i in seq 20; do vdsm-cli Host getCapabilities & done vdsm-4.20.39.1-1.el7.x86_64 is not confused by for i in $(seq 100) ; do (vdsm-client Host getCapabilities &) ; done
Customer was able to reproduce with the following steps: - Network removed at 12:57:05 - Network reconnected at 13:27:34 -- Note that host was fenced and booted with network still disconnected. This was the case in first test run also. After reconnecting the network, the host doesn't try to acquire an IP through DHCP. I assume this is due to not being run under NetworkManager, need to investigate that. - Host restarted manually 13:34:40 with power management and returned from boot in non responsive state. - Too many tasks error at 13:46:48 Will attach latest LogCollectors from customer reproducer in case they are helpful.
Yup, customer reproduced it again. Slightly different as network_caps() worked for a while. But then the avalanche of network_caps() calls arrive and supervdsm seems to hang. 2019-01-15 13:37:44,499+0000 INFO (MainThread) [vds] (PID: 21472) I am the actual vdsm 4.20.39.1-1.el7ev <removed> (3.10.0-862.11.6.el7.x86_64) (vdsmd:149) works for a while... 2019-01-15 13:37:59,240+0000 INFO (jsonrpc/1) [api.host] START getCapabilities() from=::ffff:127.0.0.1,44464 (api:46) 2019-01-15 13:38:00,629+0000 INFO (jsonrpc/1) [api.host] FINISH getCapabilities return blocked: 2019-01-15 13:41:12,210+0000 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getCapabilities', 'id': u'74b9dc62-22b2-4698-9d84-6a71c4f29763'} at 0x7f71dc31b0d0> timeout=60, duration=60 at 0x7f71dc31b110> task#=33 at 0x7f722003c890>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File: "<string>", line 2, in getCapabilities File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in getCapabilities c = caps.get() File: "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, in get net_caps = supervdsm.getProxy().network_caps() File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda> **kwargs) File: "<string>", line 2, in network_caps File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod kind, result = conn.recv() (executor:363) supervdsm received a ton of calls at once, like before, and didn't return anything: MainProcess|jsonrpc/1::DEBUG::2019-01-15 13:40:12,206::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/3::DEBUG::2019-01-15 13:40:12,207::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/5::DEBUG::2019-01-15 13:40:12,208::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/0::DEBUG::2019-01-15 13:40:12,210::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/6::DEBUG::2019-01-15 13:40:12,211::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/7::DEBUG::2019-01-15 13:40:12,214::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/4::DEBUG::2019-01-15 13:40:12,215::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/2::DEBUG::2019-01-15 13:40:12,215::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} vdsm logs for those: 2019-01-15 13:40:12,206+0000 INFO (jsonrpc/1) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46) 2019-01-15 13:40:12,206+0000 INFO (jsonrpc/3) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46) 2019-01-15 13:40:12,208+0000 INFO (jsonrpc/5) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46) 2019-01-15 13:40:12,210+0000 INFO (jsonrpc/0) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46) 2019-01-15 13:40:12,210+0000 INFO (jsonrpc/4) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46) 2019-01-15 13:40:12,210+0000 INFO (jsonrpc/6) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46) 2019-01-15 13:40:12,214+0000 INFO (jsonrpc/7) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46) 2019-01-15 13:40:12,215+0000 INFO (jsonrpc/2) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46) Last line in supervdsm log is: MainProcess|jsonrpc/3::DEBUG::2019-01-15 13:40:13,111::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 While the date is: $ cat date Tue Jan 15 13:51:43 GMT 2019 So it seems when sosreport was taken that supervdsm was hung for 10 minutes. Interestingly it seems the get_caps() initiated internally (127.0.0.1) were returned, while the ones from the engine hanged. Difficult as supervdsm does not seem to be aware of where vdsm received the call from. Might be just a coincidence. Finally note that after restart there were also some tasks blocked on hosted-engine-ha 2019-01-15 13:38:19,404+0000 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f71dcef5290> at 0x7f71dcef52d0> timeout=15, duration=15 at 0x7f72200342d0> task#=3 at 0x7f71dcda6710>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 580, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 77, in get_stats ret['haStats'] = _getHaInfo() File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo stats = instance.get_all_stats() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats stats = broker.get_stats_from_storage() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage result = self._proxy.get_stats() File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__ return self.__send(self.__name, args) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request verbose=self.__verbose File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request return self.single_request(host, handler, request_body, verbose) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request response = h.getresponse(buffering=True) File: "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse response.begin() File: "/usr/lib64/python2.7/httplib.py", line 444, in begin version, status, reason = self._read_status() File: "/usr/lib64/python2.7/httplib.py", line 400, in _read_status line = self.fp.readline(_MAXLINE + 1) File: "/usr/lib64/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize) (executor:363) The ha-broker seems to be leaking FDs and therefore hitting "Too many open files" to do get_stats_from_storage(). I'll open another bug for that.
https://bugzilla.redhat.com/show_bug.cgi?id=1666553
> MainProcess::DEBUG::2019-01-14 21:03:01,323::logutils::317::root::(_report_stats) ThreadedHandler is ok in the last 9215 seconds (max pending: 24) We log this every 60 seconds, so this means that the logging handler thread writing logs to storage was blocked for 9215 seconds. max_pending: 24 means that at that interval there were only up to 24 messages in the queue, so we can be pretty sure that the entire process was blocked for this duration. This usually happens when we call a C extension (could be function from standard library), and the C code did not released the GIL and called out to some blocking syscall that remain blocked about 9215 seconds. If you can reproduce this, please provide a core dump of supervdsm process while it at this state.
The long term solution for such issues is to use systemd watchdog feature for supervdsmd. If supervdsmd got stuck for long time, systemd should kill it, creating a core dump. This will both recover the system, and reveal the reason for blocking in the core dump. In the worst case supervdsmd is in D state and cannot be killed; we cannot do anything in this case. We have bug 1522916 for vdsm, I think we can add supervdsm to the same bug.
Had to stop halfway to work on another case, this is what I found so far, can continue on Monday. (gdb) info threads Id Target Id Frame 13 Thread 0x7fa8149df740 (LWP 4430) 0x00007fa813802e9d in poll () at ../sysdeps/unix/syscall-template.S:81 12 Thread 0x7fa7d37fe700 (LWP 62937) 0x00007fa81380e113 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81 11 Thread 0x7fa7d3fff700 (LWP 51738) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81 10 Thread 0x7fa7f4ff9700 (LWP 51336) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81 9 Thread 0x7fa7f57fa700 (LWP 50685) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81 8 Thread 0x7fa7f5ffb700 (LWP 50540) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81 7 Thread 0x7fa7f67fc700 (LWP 50514) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81 6 Thread 0x7fa7f6ffd700 (LWP 50432) 0x00007fa8141f2aab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fa7e8002720) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 5 Thread 0x7fa7f77fe700 (LWP 50314) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81 4 Thread 0x7fa7f7fff700 (LWP 50241) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81 3 Thread 0x7fa7fcd97700 (LWP 49929) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81 2 Thread 0x7fa7fd698700 (LWP 4574) 0x00007fa8141f3a2d in accept () at ../sysdeps/unix/syscall-template.S:81 * 1 Thread 0x7fa80a9a8700 (LWP 4476) 0x00007fa8141f2aab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fa804000f60) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 (gdb) - Thread 1 is logger and is waiting at [1]. - Thread 6 is doing the hung network_caps() requested by a client and is stuck on _wait_for_link_up [2] for device eno1 with a timeout of 2. Up the stack this thread is waiting at [3]. So it is waiting to retrieve something from a collections.deque that reads netlink events from the kernel. This ends up in a semaphore at 0x7fa7e8002720. - Thread 12 is the thread reading netlink events, waiting at [4] for netlink events. - Thread 13 is the main thread, waiting for signals[5]. - The other threads seem to be stopped at receve to handle clients (accept_connection, serve_client). So for the network_caps() hang, it seems the problem is Thread 6 waiting for Thread 12 to put netlink events on the deque for processing. [1] https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/common/logutils.py#L324 [2] https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/network/netswitch/configurator.py#L368 [3] https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/network/netlink/monitor.py#L126 [4] https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/network/netlink/monitor.py#L174 [5] https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/supervdsm_server.py#L324
(In reply to Nir Soffer from comment #18) > > MainProcess::DEBUG::2019-01-14 21:03:01,323::logutils::317::root::(_report_stats) ThreadedHandler is ok in the last 9215 seconds (max pending: 24) > > We log this every 60 seconds, so this means that the logging handler > thread writing logs to storage was blocked for 9215 seconds. I need to correct this. We do not log every 60 seconds. We log when handling log messages, but not more than every 60 seconds. What this log means is that no log was handled by the logger in the last 9215 seconds. Checking supervdsm on my test system: $ grep 'ThreadedHandler is ok in the last' /var/log/vdsm/supervdsm.log | tail MainProcess|jsonrpc/0::DEBUG::2019-02-10 16:40:01,967::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 147 seconds (max pending: 1) MainProcess|jsonrpc/3::DEBUG::2019-02-10 17:32:59,024::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 3177 seconds (max pending: 3) MainThread::DEBUG::2019-02-10 17:49:42,862::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 1003 seconds (max pending: 3) MainProcess|jsonrpc/7::DEBUG::2019-02-10 17:51:47,926::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 123 seconds (max pending: 19) MainProcess|vm/eb2789af::DEBUG::2019-02-10 17:54:18,554::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 150 seconds (max pending: 3) MainProcess|jsonrpc/5::DEBUG::2019-02-10 18:07:34,395::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 795 seconds (max pending: 2) MainProcess|jsonrpc/7::DEBUG::2019-02-10 18:09:58,741::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 144 seconds (max pending: 2) MainProcess|jsonrpc/5::DEBUG::2019-02-10 18:13:07,719::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 188 seconds (max pending: 1) MainProcess|dhclient-monitor::DEBUG::2019-02-10 23:32:09,682::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 19141 seconds (max pending: 2) MainProcess|libvirt/events::DEBUG::2019-02-11 01:45:00,431::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 7970 seconds (max pending: 2) Supervdsm is running without any issue on this system, so I don't think it this log is helpful to tell about stuck supervdsm process. However blocked workers waiting in supervdsm calls, and long delay between log message is likely to be stuck supervdsm process.
We should analyze the second core, which I am uploading shortly. From the logs and capture time this one is right in the middle of the "hung". Why we have a bunch of threads trying to get a netlink socket from the pool while there is no other thread seem to be doing work which will eventually put a socket back to the pool? (gdb) info threads Id Target Id Frame 13 Thread 0x7fb572e39740 (LWP 4254) 0x00007fb571c5ce9d in poll () at ../sysdeps/unix/syscall-template.S:81 12 Thread 0x7fb53a7fc700 (LWP 43762) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb52400eb40) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 11 Thread 0x7fb53affd700 (LWP 43205) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb530011350) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 10 Thread 0x7fb53b7fe700 (LWP 43010) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb52c0064d0) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 9 Thread 0x7fb53bfff700 (LWP 42918) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb534001000) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 8 Thread 0x7fb5589ec700 (LWP 42725) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb5400013e0) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 7 Thread 0x7fb5591ed700 (LWP 42688) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb53c004a70) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 6 Thread 0x7fb5599ee700 (LWP 40323) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb548006e10) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 5 Thread 0x7fb55a9f0700 (LWP 34373) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb550011710) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 4 Thread 0x7fb55a1ef700 (LWP 21323) 0x00007fb57264d7ad in read () at ../sysdeps/unix/syscall-template.S:81 3 Thread 0x7fb55b1f1700 (LWP 20993) 0x00007fb57264d7ad in read () at ../sysdeps/unix/syscall-template.S:81 2 Thread 0x7fb55baf2700 (LWP 4395) 0x00007fb57264da2d in accept () at ../sysdeps/unix/syscall-template.S:81 * 1 Thread 0x7fb568e02700 (LWP 4299) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb564000f60) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43 1 : logging, waiting on semaphore 0x7fb564000f60. 2 : handling client 3 : handling client 4 : handling client 5 : network_caps() stuck at _get_link (netdev eno3) 6 : network_caps() stuck at _get_link (netdev eno1) 7 : network_caps() stuck at _get_link (netdev eno2) 8 : network_caps() stuck at _get_link (netdev eno1) 9 : network_caps() stuck at getIpAddrs 10 : network_caps() stuck at _get_link (netdev eno2) 11 : network_caps() stuck at _get_link (eno2) 12 : network_caps() stuck at _get_link (netdev eno1) 13 : main thread Threads 5 to 12 are stuck trying to acquire from semaphore, waiting to get a netlink socket from the pool: https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/network/netlink/__init__.py#L44 So the semaphore is currently at 0 with a bunch of threads waiting for it. Initial value was 5. #11 PyEval_EvalFrameEx ( f=f@entry=Frame 0x7fb550012170, for file /usr/lib64/python2.7/threading.py, line 471, in acquire (self=<_BoundedSemaphore(_Verbose__verbose=False, _Semaphore__value=0, _Semaphore__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fb572dccc70>, acquire=<built-in method acquire of thread.lock object at remote 0x7fb572dccc70>, _Condition__waiters=[<thread.lock at remote 0x7fb55b1f55f0>, <thread.lock at remote 0x7fb55b1f57b0>, <thread.lock at remote 0x7fb55b1f5730>, <thread.lock at remote 0x7fb55b20f3b0>, <thread.lock at remote 0x7fb55b20fb50>, <thread.lock at remote 0x7fb55b1f5810>, <thread.lock at remote 0x7fb55b1f58b0>, <thread.lock at remote 0x7fb55b20f150>], release=<built-in method release of thread.lock object at remote 0x7fb572dccc70>) at remote 0x7fb56b85eed0>, _initial_value=5) at remote 0x7fb56b85ee10>, blocking=1, rc=False), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3040 #18 0x00007fb57293a45b in PyEval_EvalFrameEx ( f=f@entry=Frame 0x1b40bd0, for file /usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py, line 45, in socket (self=<NLSocketPool(_semaphore=<_BoundedSemaphore(_Verbose__verbose=False, _Semaphore__value=0, _Semaphore__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fb572dccc70>, acquire=<built-in method acquire of thread.lock object at remote 0x7fb572dccc70>, _Condition__waiters=[<thread.lock at remote 0x7fb55b1f55f0>, <thread.lock at remote 0x7fb55b1f57b0>, <thread.lock at remote 0x7fb55b1f5730>, <thread.lock at remote 0x7fb55b20f3b0>, <thread.lock at remote 0x7fb55b20fb50>, <thread.lock at remote 0x7fb55b1f5810>, <thread.lock at remote 0x7fb55b1f58b0>, <thread.lock at remote 0x7fb55b20f150>], release=<built-in method release of thread.lock object at remote 0x7fb572dccc70>) at remote 0x7fb56b85eed0>, _initial_value=5) at remote 0x7fb56b85ee10>, _sockets=<Queue(unfinished_tasks=543, queue=<collections.deque at remote 0x7fb56b85a210>, maxsize=5, all_tasks_done=<_Condition...(truncated), throwflag=<optimized out>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2928 Looking like a synchronization problem...
Hmmm, I think the problem is *reentrancy* on _get_link while loosing the reference to the already acquired socket. Here, on frame 31, _get_link is called with sock=None, so it tries to grab another socket. But up the stackm, on frame #52, there is another _get_link, and that has a socket!! #31 0x00007fb57293a45b in PyEval_EvalFrameEx ( f=f@entry=Frame 0x7fb53c00b7b0, for file /usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py, line 137, in _get_link (name=None, index=24, sock=None, NLE_NODEV=31), throwflag=<optimized out>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2928 #52 PyEval_EvalFrameEx ( f=f@entry=Frame 0x7fb53c0074e0, for file /usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py, line 37, in get_link (name='eno2', sock=140416439554096, link=<c_void_p at remote 0x7fb570bd6830>), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3040 Looks like we try to get a second socket by reentering get_link on the following stack: get_link -> _link_info -> _link_index_to_name -> _get_link. Given enough threads like this, we get stuck on waiting for more sockets, as each of these threads would need 2 sockets to complete, not just 1. Does this make sense?
def get_link(name): """Returns the information dictionary of the name specified link.""" with _pool.socket() as sock: with _get_link(name=name, sock=sock) as link: if not link: raise IOError(errno.ENODEV, '%s is not present in the system' % name) link_info = _link_info(link) <------ this can reach another get_link through _link_info and _link_index_to_name, acquiring a second sock instead of re-using the one above. return link_info
Thank you guys. This should fix it: https://bugzilla.redhat.com/show_bug.cgi?id=1666123
Hi Nir, Could you please check with Michael his concerns on adding this code? I believe there's a good reason for nacking this
Let's check if this is a candidate for 4.2.8-3
we are running full regression now
Verified on - vdsm-4.20.49-3.el7ev.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1261