Bug 1719125 - [downstream clone - 4.3.5] lshw can take more than 15 seconds to execute depending on the system
Summary: [downstream clone - 4.3.5] lshw can take more than 15 seconds to execute depe...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.2.8
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.3.5
: 4.3.5
Assignee: Edward Haas
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On: 1709628
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-11 06:23 UTC by RHV bug bot
Modified: 2019-08-12 11:53 UTC (History)
7 users (show)

Fixed In Version: vdsm-4.30.19
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1709628
Environment:
Last Closed: 2019-08-12 11:53:49 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4134861 0 Troubleshoot None VDSM periodically logs Worker blocked warning messages in vdsm.virt.sampling.HostMonitor 2019-06-11 06:23:25 UTC
Red Hat Product Errata RHBA-2019:2432 0 None None None 2019-08-12 11:53:57 UTC
oVirt gerrit 100193 0 master MERGED disable dpdk by default 2020-08-06 08:18:43 UTC
oVirt gerrit 100614 0 ovirt-4.3 MERGED disable dpdk by default 2020-08-06 08:18:43 UTC

Description RHV bug bot 2019-06-11 06:23:15 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1709628 +++
======================================================================

Description of problem:

This is seen very frequently in vdsm logs during host monitoring:

vdsm[121675]: WARN Worker blocked: <Worker name=periodic/82 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f94e443fe50> at 0x7f94e443fe90> timeout=15,
                                         File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
                                           self.__bootstrap_inner()
                                         File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
                                           self.run()
                                         File: "/usr/lib64/python2.7/threading.py", line 765, in run
                                           self.__target(*self.__args, **self.__kwargs)
                                         File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
                                           ret = func(*args, **kwargs)
                                         File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
                                           self._execute_task()
                                         File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
                                           task()
                                         File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
                                           self._callable()
                                         File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__
                                           self._func()
                                         File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 576, in __call__
                                           sample = HostSample(self._pid)
                                         File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 240, in __init__
                                           self.interfaces = _get_interfaces_and_samples()
                                         File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 200, in _get_interfaces_and_samples
                                           for link in ipwrapper.getLinks():
                                         File: "/usr/lib/python2.7/site-packages/vdsm/network/ipwrapper.py", line 267, in getLinks
                                           in six.viewitems(dpdk.get_dpdk_devices()))
                                         File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 47, in get_dpdk_devices
                                           dpdk_devices = _get_dpdk_devices()
                                         File: "/usr/lib/python2.7/site-packages/vdsm/common/cache.py", line 41, in __call__
                                           value = self.func(*args)
                                         File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 111, in _get_dpdk_devices
                                           devices = _lshw_command()
                                         File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 123, in _lshw_command
                                           rc, out, err = cmd.exec_sync(['lshw', '-json'] + filterout_cmd)
                                         File: "/usr/lib/python2.7/site-packages/vdsm/network/cmd.py", line 38, in exec_sync
                                           retcode, out, err = exec_sync_bytes(cmds)
                                         File: "/usr/lib/python2.7/site-packages/vdsm/common/cmdutils.py", line 156, in exec_cmd
                                           out, err = p.communicate()
                                         File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate
                                           stdout, stderr = self._communicate(input, endtime, timeout)
                                         File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate
                                           orig_timeout)
                                         File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll
                                           ready = poller.poll(self._remaining_time(endtime))

It is because the lshw command used to get the network devices can taker a longer time apparently depending on the number of devices present on the system.
Even with the improvements from "dpdk: update lshw command", which filters out several classes of devices, it can still take more than 15s on some systems.

$ time lshw -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable cpuinfo
real    0m23.723s    
user    0m23.521s    
sys     0m0.176s  

Version-Release number of selected component (if applicable):
vdsm-4.20.43-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
# ip link add name dummy type dummy
# for i in {1..300} ; do ip link add link dummy type macvtap; done
# time lshw -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable cpuinfo

Actual results:
lshw time goes up with the number of devices (such as VM vnets), it can take more than 15s

Expected results:
Perhaps something can be done more efficiently by  _get_dpdk_devices? Or allow for a greater time...

(Originally by Germano Veit Michel)

Comment 4 Michael Burman 2019-06-17 05:27:17 UTC
Verified on - vdsm-4.30.19-1.el7ev.x86_64

Comment 6 errata-xmlrpc 2019-08-12 11:53:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2432


Note You need to log in before you can comment on or make changes to this bug.