Description of problem: Would be nice if the executor's _check_task did a thread dump of the blocked task, so we can have a better idea on why it's blocked. For example, the log below is useful, but it doesn't tell where HostMonitor is blocked: 2017-10-06 12:29:40,618-0400 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7fbb74011790> at 0x7fbb740117d0> timeout=15, duration=2025 at 0x7fbb74011b10> task#=1 at 0x7fbb7401d950> (executor:342) HostSample does a ton of stuff, from the logs above it's not immediately clear where it's blocked. Perhaps something like sys._current_frames() could help.
(In reply to Germano Veit Michel from comment #0) > Description of problem: > > Would be nice if the executor's _check_task did a thread dump of the blocked > task, so we can have a better idea on why it's blocked. > > For example, the log below is useful, but it doesn't tell where HostMonitor > is blocked: > > 2017-10-06 12:29:40,618-0400 WARN (vdsm.Scheduler) [Executor] Worker > blocked: <Worker name=periodic/1 running <Task <Operation > action=<vdsm.virt.sampling.HostMonitor object at 0x7fbb74011790> at > 0x7fbb740117d0> timeout=15, duration=2025 at 0x7fbb74011b10> task#=1 at > 0x7fbb7401d950> (executor:342) > > HostSample does a ton of stuff, from the logs above it's not immediately > clear where it's blocked. Perhaps something like sys._current_frames() could > help. In general I'm always happy to improve the debuggability of Vdsm. I spent some time playing with one idea much like this in the past months. It would help indeed, my only concern is if we can do this in a safe way. Threading and python is a delicate matter, full of pitfalls.
The feature is definitely useful, but as discussed offline threading in Python is very tricky. Targeting now for 4.3, but once we will have fix we will evaluate the possibility of backport.
It is easy to do, we have an example here: https://github.com/ionelmc/python-manhole/blob/master/src/manhole/__init__.py#L618 I'm not sure if we want a multi-line traceback on each warning, it may be too noisy, and I'm not sure if the log is needed for debugging. You can use gdb to extract a python traceback from a running vdsm process using py-bt. If we see that this log is too noisy we can enable it with a configuration flag.
Hi, I can print tracebacks for threads, and it comes out as the following print (https://gerrit.ovirt.org/#/c/82876/). quite long So I guess you would say - print only the ThreadID that relates to the worker.. I wonder how to correlate between our task_number to the actual threadID from the thread frame. sounds like I need to keep it for that as part of the worker frame. Francesco, am I right? maybe its there already and I miss it.. --------------------- return _libpthread.pthread_cond_wait(self._cond, m.mutex()) ######### ProcessID=8101, ThreadID=139943359203136 ######### File: "/usr/share/vdsm/vdsmd", line 12, in <module> vdsmd.main() File: "/usr/lib/python2.7/site-packages/vdsm/vdsmd.py", line 267, in main run() File: "/usr/lib/python2.7/site-packages/vdsm/vdsmd.py", line 155, in run serve_clients(log) File: "/usr/lib/python2.7/site-packages/vdsm/vdsmd.py", line 116, in serve_clients sigutils.wait_for_signal() File: "/usr/lib/python2.7/site-packages/vdsm/common/sigutils.py", line 89, in wait_for_signal cleanup = [] != _signal_poller.poll(timeout) ######### ProcessID=8101, ThreadID=139941401044736 ######### File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 191, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 233, in process_requests count=1, File: "/usr/lib64/python2.7/asyncore.py", line 220, in loop poll_fun(timeout, map) File: "/usr/lib64/python2.7/asyncore.py", line 192, in poll2 r = pollster.poll(timeout) #############################################
Can you provide reproduction steps please?
Change log level to INFO (in /etc/vdsm/logger.conf) and block one of the storage domains, this will stuck also the monitoring worker - after short priod you'll see prints in the format: 2018-02-06 17:40:21,351+0200 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 waiting task#=1 at 0x2377310>, traceback: ... Thanks
ok, vdsm-4.20.19-1.el7ev.x86_64 2018-02-26 16:10:46,851+0100 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f1dd4633b90> at 0x7f1dd4633bd0> timeout=15, duration=15 at 0x7f1dd4633ed0> task#=33 at 0x7f1dd45cae90>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 62, in get_stats avail, commit = _memUsageInfo(cif) File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 121, in _memUsageInfo mem_info = v.memory_info() File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 827, in memory_info dom_stats = self._dom.memoryStats() File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1510, in memoryStats ret = libvirtmod.virDomainMemoryStats(self._o) (executor:363) 2018-02-26 16:10:50,118+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/str-03s1.example.com:_jbelka_jbelka-engine-rhv42-1/161d57c4-d159-4ca9-b5b6-c81ab55bf9e0/dom_md/metadata' is blocked for 20.01 seconds (check:278)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489
BZ<2>Jira Resync