Bug 1501012

Summary: [RFE] Show thread dump for blocked VDSM workers in the log to show details what exactly is blocked
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: vdsmAssignee: Dan Kenigsberg <danken>
Status: CLOSED ERRATA QA Contact: Jiri Belka <jbelka>
Severity: medium Docs Contact:
Priority: high    
Version: 4.1.6CC: fromani, lsurette, lsvaty, mgoldboi, mperina, nsoffer, pkliczew, srevivo, ycui, ykaul
Target Milestone: ovirt-4.2.2Keywords: FutureFeature
Target Release: ---Flags: jbelka: testing_plan_complete-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-15 17:52:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2017-10-12 02:33:38 UTC
Description of problem:

Would be nice if the executor's _check_task did a thread dump of the blocked task, so we can have a better idea on why it's blocked.

For example, the log below is useful, but it doesn't tell where HostMonitor is blocked:

2017-10-06 12:29:40,618-0400 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7fbb74011790> at 0x7fbb740117d0> timeout=15, duration=2025 at 0x7fbb74011b10> task#=1 at 0x7fbb7401d950> (executor:342)

HostSample does a ton of stuff, from the logs above it's not immediately clear where it's blocked. Perhaps something like sys._current_frames() could help.

Comment 1 Francesco Romani 2017-10-12 08:05:57 UTC
(In reply to Germano Veit Michel from comment #0)
> Description of problem:
> 
> Would be nice if the executor's _check_task did a thread dump of the blocked
> task, so we can have a better idea on why it's blocked.
> 
> For example, the log below is useful, but it doesn't tell where HostMonitor
> is blocked:
> 
> 2017-10-06 12:29:40,618-0400 WARN  (vdsm.Scheduler) [Executor] Worker
> blocked: <Worker name=periodic/1 running <Task <Operation
> action=<vdsm.virt.sampling.HostMonitor object at 0x7fbb74011790> at
> 0x7fbb740117d0> timeout=15, duration=2025 at 0x7fbb74011b10> task#=1 at
> 0x7fbb7401d950> (executor:342)
> 
> HostSample does a ton of stuff, from the logs above it's not immediately
> clear where it's blocked. Perhaps something like sys._current_frames() could
> help.

In general I'm always happy to improve the debuggability of Vdsm.

I spent some time playing with one idea much like this in the past months.
It would help indeed, my only concern is if we can do this in a safe way.
Threading and python is a delicate matter, full of pitfalls.

Comment 2 Martin Perina 2017-10-12 08:45:39 UTC
The feature is definitely useful, but as discussed offline threading in Python is very tricky. Targeting now for 4.3, but once we will have fix we will evaluate the possibility of backport.

Comment 3 Nir Soffer 2017-10-17 12:42:22 UTC
It is easy to do, we have an example here:
https://github.com/ionelmc/python-manhole/blob/master/src/manhole/__init__.py#L618

I'm not sure if we want a multi-line traceback on each warning, it may be too
noisy, and I'm not sure if the log is needed for debugging. You can use gdb to 
extract a python traceback from a running vdsm process using py-bt.

If we see that this log is too noisy we can enable it with a configuration flag.

Comment 4 Yaniv Bronhaim 2017-10-17 13:05:15 UTC
Hi, I can print tracebacks for threads, and it comes out as the following print (https://gerrit.ovirt.org/#/c/82876/). quite long

So I guess you would say - print only the ThreadID that relates to the worker.. I wonder how to correlate between our task_number to the actual threadID from the thread frame. sounds like I need to keep it for that as part of the worker frame. Francesco, am I right? maybe its there already and I miss it..


---------------------

  return _libpthread.pthread_cond_wait(self._cond, m.mutex())

######### ProcessID=8101, ThreadID=139943359203136 #########
File: "/usr/share/vdsm/vdsmd", line 12, in <module>
  vdsmd.main()
File: "/usr/lib/python2.7/site-packages/vdsm/vdsmd.py", line 267, in main
  run()
File: "/usr/lib/python2.7/site-packages/vdsm/vdsmd.py", line 155, in run
  serve_clients(log)
File: "/usr/lib/python2.7/site-packages/vdsm/vdsmd.py", line 116, in serve_clients
  sigutils.wait_for_signal()
File: "/usr/lib/python2.7/site-packages/vdsm/common/sigutils.py", line 89, in wait_for_signal
  cleanup = [] != _signal_poller.poll(timeout)

######### ProcessID=8101, ThreadID=139941401044736 #########
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 191, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 233, in process_requests
  count=1,
File: "/usr/lib64/python2.7/asyncore.py", line 220, in loop
  poll_fun(timeout, map)
File: "/usr/lib64/python2.7/asyncore.py", line 192, in poll2
  r = pollster.poll(timeout)
#############################################

Comment 5 Lukas Svaty 2018-02-05 13:59:58 UTC
Can you provide reproduction steps please?

Comment 6 Yaniv Bronhaim 2018-02-06 15:41:40 UTC
Change log level to INFO (in /etc/vdsm/logger.conf) and block one of the storage domains, this will stuck also the monitoring worker - after short priod you'll see prints in the format:
2018-02-06 17:40:21,351+0200 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 waiting task#=1 at 0x2377310>, traceback:
...

Thanks

Comment 8 Jiri Belka 2018-02-26 15:12:18 UTC
ok, vdsm-4.20.19-1.el7ev.x86_64

2018-02-26 16:10:46,851+0100 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f1dd4633b90> at 0x7f1dd4633bd0> timeout=15, duration=15 at 0x7f1dd4633ed0> task#=33 at 0x7f1dd45cae90>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__
  self._func()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__
  stats = hostapi.get_stats(self._cif, self._samples.stats())
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 62, in get_stats
  avail, commit = _memUsageInfo(cif)
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 121, in _memUsageInfo
  mem_info = v.memory_info()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 827, in memory_info
  dom_stats = self._dom.memoryStats()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
  ret = attr(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
  ret = f(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
  return func(inst, *args, **kwargs)
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1510, in memoryStats
  ret = libvirtmod.virDomainMemoryStats(self._o) (executor:363)
2018-02-26 16:10:50,118+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/str-03s1.example.com:_jbelka_jbelka-engine-rhv42-1/161d57c4-d159-4ca9-b5b6-c81ab55bf9e0/dom_md/metadata' is blocked for 20.01 seconds (check:278)

Comment 14 errata-xmlrpc 2018-05-15 17:52:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 15 Franta Kust 2019-05-16 13:03:38 UTC
BZ<2>Jira Resync