Bug 1666123 - supervdsm did not return network_caps() and stopped logging
Summary: supervdsm did not return network_caps() and stopped logging
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.2.7
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ovirt-4.2.10
: ---
Assignee: Edward Haas
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On: 1653258
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-15 00:30 UTC by Germano Veit Michel
Modified: 2019-09-11 10:03 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-23 11:31:44 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3813511 0 None None None 2019-01-17 06:54:50 UTC
Red Hat Product Errata RHBA-2019:1261 0 None None None 2019-05-23 11:31:46 UTC
oVirt gerrit 97731 0 ovirt-4.2 MERGED net: Avoid deadlock when using the netlink package 2021-01-22 19:10:04 UTC
oVirt gerrit 97774 0 ovirt-4.2 MERGED net: Avoid deadlock when using the netlink package 2021-01-22 19:10:04 UTC

Description Germano Veit Michel 2019-01-15 00:30:09 UTC
Description of problem:

Customer was testing network outages by pulling network cables. Cable was unplugged and host fenced by power management.

Upon restarting the host the network is up, but supervdsm was unable to return network_caps(). There is actually a big gap in the logs, it seems supervdsm hung shortly after initialization and did not return any vdsm requests. Problem was not solved until vdsm/supervdsm restarted. Seems like supervdsm simply locked/hung, or at least some thread/logging did.

Supervdsm seems to have donethe usual 'tc qdisc' and 'ovs' commands successfully, but it never returned network_caps() to vdsm. And shortly after supervdsm stopped logging too. See in details.

1. After network test, hoes goes up:

2019-01-14 18:26:50,843Z INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-280398) [606d138b] EVENT_ID: FENCE_OPERATION_SUCCEEDED(9,018), Power management stop of Host <removed> succeeded.

2019-01-14 18:27:07,322Z INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-280398) [] EVENT_ID: FENCE_OPERATION_SUCCEEDED(9,018), Power management start of Host <removed> succeeded.

2. Host is activated

2019-01-14 18:29:55,222Z INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (default task-21) [9990796c-7b4b-4a4b-b6b2-ba23dfbec2c8] Lock Acquired to object 'EngineLock:{exclusiveLocks='[7e78b3d7-dab1-424e-ad95-ab587f04c5cb=VDS]', sharedLocks=''}'

2019-01-14 18:29:55,380Z INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to <removed>/<removed>

3. But GetCaps fails over and over, as below:

2019-01-14 18:36:01,496Z ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-29) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80}


On the host side...

4. Host boots:

Jan 14 18:29:06 wuv-ishypv001b kernel: Linux version 3.10.0-862.11.6.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Fri Aug 10 16:55:11 UTC 2018

5. Network is up

Jan 14 18:29:32 wuv-ishypv001b kernel: bnx2x 0000:5c:00.1 eno4: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Jan 14 18:29:32 wuv-ishypv001b kernel: bnx2x 0000:5c:00.0 eno3: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Jan 14 18:29:33 wuv-ishypv001b kernel: i40e 0000:14:00.1 eno2: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
Jan 14 18:29:33 wuv-ishypv001b kernel: i40e 0000:14:00.0 eno1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None

6. VDSM starts

2019-01-14 18:29:42,463+0000 INFO  (MainThread) [vds] (PID: 21673) I am the actual vdsm 4.20.39.1-1.el7ev <removed> (3.10.0-862.11.6.el7.x86_64) (vdsmd:149)

7. Here comes the first GetCaps

2019-01-14 18:29:55,852+0000 INFO  (jsonrpc/7) [api.host] START getCapabilities() from=::ffff:<removed>,43994 (api:46)

8. But blocked on supervdsm:

2019-01-14 18:30:55,479+0000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getCapabilities', 'id': u'2b2c13c7-d548-4238-bb3e-bd77b93530d0'} at 0x7f235c2e9550> timeout=60, duration=60 at 0x7f235c2e9590> task#=2 at 0x7f235c3a1090>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in getCapabilities
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in getCapabilities
  c = caps.get()
File: "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, in get
  net_caps = supervdsm.getProxy().network_caps()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
  return callMethod()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
  **kwargs)
File: "<string>", line 2, in network_caps
File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod
  kind, result = conn.recv() (executor:363)

9. Supervdsm logs we see the request:

MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,084::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}

MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,127::routes::109::root::(get_gateway) The gateway <removed> is duplicated for the device ovirtmgmt
MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,151::cmdutils::150::root::(exec_cmd) /sbin/tc qdisc show (cwd None)
MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,160::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,911::cmdutils::150::root::(exec_cmd) /usr/share/openvswitch/scripts/ovs-ctl status (cwd None)
MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,937::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,941::vsctl::67::root::(commit) Executing commands: /usr/bin/ovs-vsctl --timeout=5 --oneline --format=json -- list Bridge -- list Port -- list Interface
MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,944::cmdutils::150::root::(exec_cmd) /usr/bin/ovs-vsctl --timeout=5 --oneline --format=json -- list Bridge -- list Port -- list Interface (cwd None)
MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,998::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0

10. But there is no return of network_caps in the supervdsm logs, while vdsm was obviously waiting for the response.

11. And look at the log time gap in supervdsm logs shortly after this never returned network_caps, 2h and 30min with no logs.

MainProcess|jsonrpc/7::DEBUG::2019-01-14 18:29:57,091::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess::DEBUG::2019-01-14 21:03:01,322::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call multipath_status with () {}

12. It also took a while to see all network_caps requests, looks like they piled up to start with.

MainProcess|jsonrpc/0::DEBUG::2019-01-14 18:29:56,084::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/1::DEBUG::2019-01-14 18:29:56,085::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/4::DEBUG::2019-01-14 18:29:56,086::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/3::DEBUG::2019-01-14 18:29:56,087::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/6::DEBUG::2019-01-14 18:29:56,088::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/2::DEBUG::2019-01-14 18:29:56,089::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/7::DEBUG::2019-01-14 18:29:56,089::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/5::DEBUG::2019-01-14 18:29:56,099::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}

13. And never returned any of these. It seems supervdsm was hung until 21:03:01, when it "resumed".

14. Note these 9215 seconds shortly after it started working again.

MainProcess::DEBUG::2019-01-14 21:03:01,323::logutils::317::root::(_report_stats) ThreadedHandler is ok in the last 9215 seconds (max pending: 24)

Version-Release number of selected component (if applicable):
vdsm-4.20.39.1-1.el7ev.x86_64
libvirt-3.9.0-14.el7_5.7.x86_64

How reproducible:
Unknown

Comment 6 Dominik Holler 2019-01-15 11:20:01 UTC
What happens on
for i in seq 20; do vdsm-cli Host getCapabilities & done

Comment 8 Dominik Holler 2019-01-15 13:52:45 UTC
(In reply to Dominik Holler from comment #6)
> What happens on
> for i in seq 20; do vdsm-cli Host getCapabilities & done

vdsm-4.20.39.1-1.el7.x86_64 is not confused by
for i in $(seq 100) ; do (vdsm-client Host getCapabilities  &)  ;  done

Comment 9 Neil Garrett 2019-01-15 21:35:14 UTC
Customer was able to reproduce with the following steps:

- Network removed at 12:57:05
- Network reconnected at 13:27:34 -- Note that host was fenced and booted with network still disconnected. This was the case in first test run also. After reconnecting the network, the host doesn't try to acquire an IP through DHCP. I assume this is due to not being run under NetworkManager, need to investigate that.
- Host restarted manually 13:34:40 with power management and returned from boot in non responsive state.
- Too many tasks error at 13:46:48

Will attach latest LogCollectors from customer reproducer in case they are helpful.

Comment 10 Germano Veit Michel 2019-01-16 02:07:20 UTC
Yup, customer reproduced it again. 

Slightly different as network_caps() worked for a while. But then the avalanche of network_caps() calls arrive and supervdsm seems to hang.

2019-01-15 13:37:44,499+0000 INFO  (MainThread) [vds] (PID: 21472) I am the actual vdsm 4.20.39.1-1.el7ev <removed> (3.10.0-862.11.6.el7.x86_64) (vdsmd:149)

works for a while...

2019-01-15 13:37:59,240+0000 INFO  (jsonrpc/1) [api.host] START getCapabilities() from=::ffff:127.0.0.1,44464 (api:46)
2019-01-15 13:38:00,629+0000 INFO  (jsonrpc/1) [api.host] FINISH getCapabilities return

blocked:

2019-01-15 13:41:12,210+0000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getCapabilities', 'id': u'74b9dc62-22b2-4698-9d84-6a71c4f29763'} at 0x7f71dc31b0d0> timeout=60, duration=60 at 0x7f71dc31b110> task#=33 at 0x7f722003c890>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in getCapabilities
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in getCapabilities
  c = caps.get()
File: "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, in get
  net_caps = supervdsm.getProxy().network_caps()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
  return callMethod()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
  **kwargs)
File: "<string>", line 2, in network_caps
File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod
  kind, result = conn.recv() (executor:363)

supervdsm received a ton of calls at once, like before, and didn't return anything:

MainProcess|jsonrpc/1::DEBUG::2019-01-15 13:40:12,206::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/3::DEBUG::2019-01-15 13:40:12,207::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/5::DEBUG::2019-01-15 13:40:12,208::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/0::DEBUG::2019-01-15 13:40:12,210::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/6::DEBUG::2019-01-15 13:40:12,211::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/7::DEBUG::2019-01-15 13:40:12,214::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/4::DEBUG::2019-01-15 13:40:12,215::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/2::DEBUG::2019-01-15 13:40:12,215::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}

vdsm logs for those:

2019-01-15 13:40:12,206+0000 INFO  (jsonrpc/1) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46)
2019-01-15 13:40:12,206+0000 INFO  (jsonrpc/3) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46)
2019-01-15 13:40:12,208+0000 INFO  (jsonrpc/5) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46)
2019-01-15 13:40:12,210+0000 INFO  (jsonrpc/0) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46)
2019-01-15 13:40:12,210+0000 INFO  (jsonrpc/4) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46)
2019-01-15 13:40:12,210+0000 INFO  (jsonrpc/6) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46)
2019-01-15 13:40:12,214+0000 INFO  (jsonrpc/7) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46)
2019-01-15 13:40:12,215+0000 INFO  (jsonrpc/2) [api.host] START getCapabilities() from=::ffff:<removed>,54396 (api:46)

Last line in supervdsm log is:

MainProcess|jsonrpc/3::DEBUG::2019-01-15 13:40:13,111::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0

While the date is:

$ cat date 
Tue Jan 15 13:51:43 GMT 2019

So it seems when sosreport was taken that supervdsm was hung for 10 minutes. Interestingly it seems the get_caps() initiated internally (127.0.0.1) were returned, while the ones from the engine hanged. Difficult as supervdsm does not seem to be aware of where vdsm received the call from. Might be just a coincidence.

Finally note that after restart there were also some tasks blocked on hosted-engine-ha

2019-01-15 13:38:19,404+0000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f71dcef5290> at 0x7f71dcef52d0> timeout=15, duration=15 at 0x7f72200342d0> task#=3 at 0x7f71dcda6710>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__
  self._func()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 580, in __call__
  stats = hostapi.get_stats(self._cif, self._samples.stats())
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 77, in get_stats
  ret['haStats'] = _getHaInfo()
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo
  stats = instance.get_all_stats()
File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats
  stats = broker.get_stats_from_storage()
File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage
  result = self._proxy.get_stats()
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__
  return self.__send(self.__name, args)
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request
  verbose=self.__verbose
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
  return self.single_request(host, handler, request_body, verbose)
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request
  response = h.getresponse(buffering=True)
File: "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse
  response.begin()
File: "/usr/lib64/python2.7/httplib.py", line 444, in begin
  version, status, reason = self._read_status()
File: "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
  line = self.fp.readline(_MAXLINE + 1)
File: "/usr/lib64/python2.7/socket.py", line 476, in readline
  data = self._sock.recv(self._rbufsize) (executor:363)

The ha-broker seems to be leaking FDs and therefore hitting "Too many open files" to do get_stats_from_storage(). I'll open another bug for that.

Comment 11 Germano Veit Michel 2019-01-16 02:27:16 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1666553

Comment 18 Nir Soffer 2019-01-23 17:14:02 UTC
> MainProcess::DEBUG::2019-01-14 21:03:01,323::logutils::317::root::(_report_stats) ThreadedHandler is ok in the last 9215 seconds (max pending: 24)

We log this every 60 seconds, so this means that the logging handler
thread writing logs to storage was blocked for 9215 seconds.

max_pending: 24 means that at that interval there were only up to 24
messages in the queue, so we can be pretty sure that the entire process
was blocked for this duration.

This usually happens when we call a C extension (could be function from
standard library), and the C code did not released the GIL and called out
to some blocking syscall that remain blocked about 9215 seconds.

If you can reproduce this, please provide a core dump of supervdsm process
while it at this state.

Comment 23 Nir Soffer 2019-01-28 14:58:28 UTC
The long term solution for such issues is to use systemd watchdog feature
for supervdsmd. If supervdsmd got stuck for long time, systemd should kill
it, creating a core dump.

This will both recover the system, and reveal the reason for blocking in the
core dump. In the worst case supervdsmd is in D state and cannot be killed;
we cannot do anything in this case.

We have bug 1522916 for vdsm, I think we can add supervdsm to the same bug.

Comment 28 Germano Veit Michel 2019-02-08 04:51:24 UTC
Had to stop halfway to work on another case, this is what I found so far, can continue on Monday.

(gdb) info threads
  Id   Target Id         Frame 
  13   Thread 0x7fa8149df740 (LWP 4430) 0x00007fa813802e9d in poll () at ../sysdeps/unix/syscall-template.S:81
  12   Thread 0x7fa7d37fe700 (LWP 62937) 0x00007fa81380e113 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
  11   Thread 0x7fa7d3fff700 (LWP 51738) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81
  10   Thread 0x7fa7f4ff9700 (LWP 51336) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81
  9    Thread 0x7fa7f57fa700 (LWP 50685) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81
  8    Thread 0x7fa7f5ffb700 (LWP 50540) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81
  7    Thread 0x7fa7f67fc700 (LWP 50514) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81
  6    Thread 0x7fa7f6ffd700 (LWP 50432) 0x00007fa8141f2aab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fa7e8002720)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
  5    Thread 0x7fa7f77fe700 (LWP 50314) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81
  4    Thread 0x7fa7f7fff700 (LWP 50241) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81
  3    Thread 0x7fa7fcd97700 (LWP 49929) 0x00007fa8141f37ad in read () at ../sysdeps/unix/syscall-template.S:81
  2    Thread 0x7fa7fd698700 (LWP 4574) 0x00007fa8141f3a2d in accept () at ../sysdeps/unix/syscall-template.S:81
* 1    Thread 0x7fa80a9a8700 (LWP 4476) 0x00007fa8141f2aab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fa804000f60)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
(gdb) 

- Thread 1 is logger and is waiting at [1].
- Thread 6 is doing the hung network_caps() requested by a client and is stuck on _wait_for_link_up [2] for device eno1 with a timeout of 2.
  Up the stack this thread is waiting at [3]. So it is waiting to retrieve something from a collections.deque that reads netlink events from the kernel.
  This ends up in a semaphore at 0x7fa7e8002720.
- Thread 12 is the thread reading netlink events, waiting at [4] for netlink events.
- Thread 13 is the main thread, waiting for signals[5].
- The other threads seem to be stopped at receve to handle clients (accept_connection, serve_client).

So for the network_caps() hang, it seems the problem is Thread 6 waiting for Thread 12 to put netlink events on the deque for processing.

[1] https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/common/logutils.py#L324
[2] https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/network/netswitch/configurator.py#L368
[3] https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/network/netlink/monitor.py#L126
[4] https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/network/netlink/monitor.py#L174
[5] https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/supervdsm_server.py#L324

Comment 35 Nir Soffer 2019-02-11 00:17:24 UTC
(In reply to Nir Soffer from comment #18)
> > MainProcess::DEBUG::2019-01-14 21:03:01,323::logutils::317::root::(_report_stats) ThreadedHandler is ok in the last 9215 seconds (max pending: 24)
> 
> We log this every 60 seconds, so this means that the logging handler
> thread writing logs to storage was blocked for 9215 seconds.

I need to correct this.

We do not log every 60 seconds. We log when handling log messages, but not more
than every 60 seconds.

What this log means is that no log was handled by the logger in the last 9215
seconds.

Checking supervdsm on my test system:

$ grep 'ThreadedHandler is ok in the last' /var/log/vdsm/supervdsm.log | tail
MainProcess|jsonrpc/0::DEBUG::2019-02-10 16:40:01,967::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 147 seconds (max pending: 1)
MainProcess|jsonrpc/3::DEBUG::2019-02-10 17:32:59,024::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 3177 seconds (max pending: 3)
MainThread::DEBUG::2019-02-10 17:49:42,862::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 1003 seconds (max pending: 3)
MainProcess|jsonrpc/7::DEBUG::2019-02-10 17:51:47,926::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 123 seconds (max pending: 19)
MainProcess|vm/eb2789af::DEBUG::2019-02-10 17:54:18,554::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 150 seconds (max pending: 3)
MainProcess|jsonrpc/5::DEBUG::2019-02-10 18:07:34,395::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 795 seconds (max pending: 2)
MainProcess|jsonrpc/7::DEBUG::2019-02-10 18:09:58,741::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 144 seconds (max pending: 2)
MainProcess|jsonrpc/5::DEBUG::2019-02-10 18:13:07,719::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 188 seconds (max pending: 1)
MainProcess|dhclient-monitor::DEBUG::2019-02-10 23:32:09,682::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 19141 seconds (max pending: 2)
MainProcess|libvirt/events::DEBUG::2019-02-11 01:45:00,431::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 7970 seconds (max pending: 2)

Supervdsm is running without any issue on this system, so I don't think it
this log is helpful to tell about stuck supervdsm process.

However blocked workers waiting in supervdsm calls, and long delay between log
message is likely to be stuck supervdsm process.

Comment 38 Germano Veit Michel 2019-02-13 05:57:23 UTC
We should analyze the second core, which I am uploading shortly. From the logs and capture time this one is right in the middle of the "hung".

Why we have a bunch of threads trying to get a netlink socket from the pool while there is no other thread seem to be doing work which will eventually put a socket back to the pool?

(gdb) info threads
  Id   Target Id         Frame 
  13   Thread 0x7fb572e39740 (LWP 4254) 0x00007fb571c5ce9d in poll () at ../sysdeps/unix/syscall-template.S:81
  12   Thread 0x7fb53a7fc700 (LWP 43762) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb52400eb40)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
  11   Thread 0x7fb53affd700 (LWP 43205) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb530011350)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
  10   Thread 0x7fb53b7fe700 (LWP 43010) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb52c0064d0)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
  9    Thread 0x7fb53bfff700 (LWP 42918) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb534001000)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
  8    Thread 0x7fb5589ec700 (LWP 42725) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb5400013e0)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
  7    Thread 0x7fb5591ed700 (LWP 42688) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb53c004a70)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
  6    Thread 0x7fb5599ee700 (LWP 40323) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb548006e10)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
  5    Thread 0x7fb55a9f0700 (LWP 34373) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb550011710)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
  4    Thread 0x7fb55a1ef700 (LWP 21323) 0x00007fb57264d7ad in read () at ../sysdeps/unix/syscall-template.S:81
  3    Thread 0x7fb55b1f1700 (LWP 20993) 0x00007fb57264d7ad in read () at ../sysdeps/unix/syscall-template.S:81
  2    Thread 0x7fb55baf2700 (LWP 4395) 0x00007fb57264da2d in accept () at ../sysdeps/unix/syscall-template.S:81
* 1    Thread 0x7fb568e02700 (LWP 4299) 0x00007fb57264caab in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fb564000f60)
    at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43


1  : logging, waiting on semaphore 0x7fb564000f60.
2  : handling client
3  : handling client
4  : handling client
5  : network_caps() stuck at _get_link (netdev eno3)
6  : network_caps() stuck at _get_link (netdev eno1)
7  : network_caps() stuck at _get_link (netdev eno2)
8  : network_caps() stuck at _get_link (netdev eno1)
9  : network_caps() stuck at getIpAddrs
10 : network_caps() stuck at _get_link (netdev eno2)
11 : network_caps() stuck at _get_link (eno2)
12 : network_caps() stuck at _get_link (netdev eno1)
13 : main thread

Threads 5 to 12 are stuck trying to acquire from semaphore, waiting to get a netlink socket from the pool: https://github.com/oVirt/vdsm/blob/ovirt-4.2.6/lib/vdsm/network/netlink/__init__.py#L44

So the semaphore is currently at 0 with a bunch of threads waiting for it. Initial value was 5.

#11 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7fb550012170, for file /usr/lib64/python2.7/threading.py, line 471, in acquire (self=<_BoundedSemaphore(_Verbose__verbose=False, _Semaphore__value=0, _Semaphore__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fb572dccc70>, acquire=<built-in method acquire of thread.lock object at remote 0x7fb572dccc70>, _Condition__waiters=[<thread.lock at remote 0x7fb55b1f55f0>, <thread.lock at remote 0x7fb55b1f57b0>, <thread.lock at remote 0x7fb55b1f5730>, <thread.lock at remote 0x7fb55b20f3b0>, <thread.lock at remote 0x7fb55b20fb50>, <thread.lock at remote 0x7fb55b1f5810>, <thread.lock at remote 0x7fb55b1f58b0>, <thread.lock at remote 0x7fb55b20f150>], release=<built-in method release of thread.lock object at remote 0x7fb572dccc70>) at remote 0x7fb56b85eed0>, _initial_value=5) at remote 0x7fb56b85ee10>, blocking=1, rc=False), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3040

#18 0x00007fb57293a45b in PyEval_EvalFrameEx (
    f=f@entry=Frame 0x1b40bd0, for file /usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py, line 45, in socket (self=<NLSocketPool(_semaphore=<_BoundedSemaphore(_Verbose__verbose=False, _Semaphore__value=0, _Semaphore__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fb572dccc70>, acquire=<built-in method acquire of thread.lock object at remote 0x7fb572dccc70>, _Condition__waiters=[<thread.lock at remote 0x7fb55b1f55f0>, <thread.lock at remote 0x7fb55b1f57b0>, <thread.lock at remote 0x7fb55b1f5730>, <thread.lock at remote 0x7fb55b20f3b0>, <thread.lock at remote 0x7fb55b20fb50>, <thread.lock at remote 0x7fb55b1f5810>, <thread.lock at remote 0x7fb55b1f58b0>, <thread.lock at remote 0x7fb55b20f150>], release=<built-in method release of thread.lock object at remote 0x7fb572dccc70>) at remote 0x7fb56b85eed0>, _initial_value=5) at remote 0x7fb56b85ee10>, _sockets=<Queue(unfinished_tasks=543, queue=<collections.deque at remote 0x7fb56b85a210>, maxsize=5, all_tasks_done=<_Condition...(truncated), throwflag=<optimized out>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2928

Looking like a synchronization problem...

Comment 39 Germano Veit Michel 2019-02-13 06:06:53 UTC
Hmmm, I think the problem is *reentrancy* on _get_link while loosing the reference to the already acquired socket.

Here, on frame 31, _get_link is called with sock=None, so it tries to grab another socket.

But up the stackm, on frame #52, there is another _get_link, and that has a socket!!

#31 0x00007fb57293a45b in PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7fb53c00b7b0, for file /usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py, line 137, in _get_link (name=None, index=24, sock=None, NLE_NODEV=31), throwflag=<optimized out>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2928

#52 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7fb53c0074e0, for file /usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py, line 37, in get_link (name='eno2', sock=140416439554096, link=<c_void_p at remote 0x7fb570bd6830>), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3040

Looks like we try to get a second socket by reentering get_link on the following stack: get_link -> _link_info -> _link_index_to_name -> _get_link.

Given enough threads like this, we get stuck on waiting for more sockets, as each of these threads would need 2 sockets to complete, not just 1.

Does this make sense?

Comment 41 Germano Veit Michel 2019-02-13 06:14:31 UTC
def get_link(name):
    """Returns the information dictionary of the name specified link."""
    with _pool.socket() as sock:
        with _get_link(name=name, sock=sock) as link:
            if not link:
                raise IOError(errno.ENODEV, '%s is not present in the system' %
                              name)
            link_info = _link_info(link)        <------ this can reach another get_link through  _link_info and _link_index_to_name, acquiring a second sock instead of re-using the one above.
        return link_info

Comment 42 Edward Haas 2019-02-13 12:49:20 UTC
Thank you guys.
This should fix it: https://bugzilla.redhat.com/show_bug.cgi?id=1666123

Comment 45 Raz Tamir 2019-02-13 13:36:07 UTC
Hi Nir,

Could you please check with Michael his concerns on adding this code?
I believe there's a good reason for nacking this

Comment 49 Dominik Holler 2019-02-14 08:28:25 UTC
Let's check if this is a candidate for 4.2.8-3

Comment 59 Daniel Gur 2019-05-19 10:42:12 UTC
we are running full regression now

Comment 60 Michael Burman 2019-05-22 13:36:17 UTC
Verified on - vdsm-4.20.49-3.el7ev.x86_64

Comment 62 errata-xmlrpc 2019-05-23 11:31:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1261


Note You need to log in before you can comment on or make changes to this bug.