Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 918541 - The VM Channels Listener thread appears to stall , blocking all communication between VDSM and the hosted guest agents.
The VM Channels Listener thread appears to stall , blocking all communication...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.1.3
All Linux
high Severity high
: ---
: 3.2.0
Assigned To: Vinzenz Feenstra [evilissimo]
Jiri Belka
virt
: ZStream
: 967136 (view as bug list)
Depends On:
Blocks: 947883 964195
  Show dependency treegraph
 
Reported: 2013-03-06 08:40 EST by Lee Yarwood
Modified: 2015-09-22 09 EDT (History)
26 users (show)

See Also:
Fixed In Version: vdsm-4.10.2-16.0.el6ev
Doc Type: Bug Fix
Doc Text:
Previously, the VM Channels Listener thread stalled, which blocked communication between VDSM and the hosted guest agents. A patch prevents the VM Channels Listener thread from stalling, removing the block in communication between VDSM and the hosted guest agents.
Story Points: ---
Clone Of:
: 947883 (view as bug list)
Environment:
Last Closed: 2013-06-10 16:42:20 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 325943 None None None Never
oVirt gerrit 13093 None None None Never
Red Hat Product Errata RHSA-2013:0886 normal SHIPPED_LIVE Moderate: rhev 3.2 - vdsm security and bug fix update 2013-06-10 20:25:02 EDT

  None (edit)
Description Lee Yarwood 2013-03-06 08:40:05 EST
Description of problem:

The VM Channels Listener thread appears to stall causing communication between VDSM and the hosted guest agents to stop.

For example, after restarting vdsmd we can see the listener connect to a number of guest agents :

~~~
 26 vdsm.log.2.xz:31928:MainThread::INFO::2013-03-05 11:10:13,573::vdsm::88::vds::(run) I am the actual vdsm 4.10-1.6 h1-test.ec.dev (2.6.32-358.0.1.el6.x86_64)
 27 vdsm.log.2.xz:31964:MainThread::INFO::2013-03-05 11:10:17,089::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds.
 28 vdsm.log.2.xz:31965:VM Channels Listener::INFO::2013-03-05 11:10:17,090::vmChannels::127::vds::(run) Starting VM channels listener thread.
 29 vdsm.log.2.xz:32312:Thread-69::DEBUG::2013-03-05 11:19:49,070::vmChannels::144::vds::(register) Add fileno 17 to listener's channels.
 30 vdsm.log.2.xz:32322:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::vmChannels::71::vds::(_do_add_channels) fileno 17 was added to unconnected channels.
 31 vdsm.log.2.xz:32323:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17.
 32 vdsm.log.2.xz:32324:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::guestIF::95::vm.Vm::(_connect) vmId=`31404679-cef1-4291-b9fa-e979b515b98c`::Attempting connection to /var/lib/libvirt/qemu/channels/pwallend-test.com.redhat.rhevm.vdsm
 33 vdsm.log.2.xz:32325:VM Channels Listener::DEBUG::2013-03-05 11:19:49,340::guestIF::97::vm.Vm::(_connect) vmId=`31404679-cef1-4291-b9fa-e979b515b98c`::Connected to /var/lib/libvirt/qemu/channels/pwallend-test.com.redhat.rhevm.vdsm
 34 vdsm.log.2.xz:32326:VM Channels Listener::DEBUG::2013-03-05 11:19:49,340::vmChannels::106::vds::(_handle_unconnected) Connect fileno 17 was succeed.
~~~

Some time later we see a final _do_del_channels() call after which the thread appears to stall :

~~~
 81 vdsm.log.2.xz:38891:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,513::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 82 vdsm.log.2.xz:38893:VM Channels Listener::DEBUG::2013-03-05 12:06:23,514::vmChannels::88::vds::(_do_del_channels) fileno 21 was removed from listener.
 83 vdsm.log.2.xz:38896:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,582::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 84 vdsm.log.2.xz:38906:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,598::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 85 vdsm.log.2.xz:39312:Thread-2002::DEBUG::2013-03-05 12:10:15,956::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 86 vdsm.log.2.xz:39318:Thread-2002::DEBUG::2013-03-05 12:10:16,759::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 87 vdsm.log.2.xz:39330:libvirtEventLoop::DEBUG::2013-03-05 12:10:16,787::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 88 vdsm.log.2.xz:39536:Thread-2015::DEBUG::2013-03-05 12:10:36,529::vmChannels::144::vds::(register) Add fileno 21 to listener's channels.
 89 vdsm.log.2.xz:41495:Thread-2600::DEBUG::2013-03-05 12:22:44,969::vmChannels::144::vds::(register) Add fileno 46 to listener's channels.
 90 vdsm.log.2.xz:41747:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,828::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 91 vdsm.log.2.xz:41750:Thread-2604::DEBUG::2013-03-05 12:23:19,841::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 92 vdsm.log.2.xz:41754:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,877::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 93 vdsm.log.2.xz:41764:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,883::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 94 vdsm.log.2.xz:42615:Thread-2956::DEBUG::2013-03-05 12:29:27,364::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 95 vdsm.log.2.xz:42620:Thread-2956::DEBUG::2013-03-05 12:29:28,169::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
~~~ 

This results in the webadmin removing the guests IP etc from the UI and any shutdown requests falling back to the libvirt acpi method.

This was corrected the next day by once again restarting vdsmd :

~~~
155 vdsm.log:66815:MainThread::INFO::2013-03-06 12:23:48,966::vdsm::88::vds::(run) I am the actual vdsm 4.10-1.6 h1-test.ec.dev (2.6.32-358.0.1.el6.x86_64)
156 vdsm.log:66851:MainThread::INFO::2013-03-06 12:23:51,878::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds.
157 vdsm.log:66852:VM Channels Listener::INFO::2013-03-06 12:23:51,879::vmChannels::127::vds::(run) Starting VM channels listener thread.
158 vdsm.log:66912:Thread-13::DEBUG::2013-03-06 12:23:53,973::vmChannels::144::vds::(register) Add fileno 17 to listener's channels.
159 vdsm.log:66917:Thread-15::DEBUG::2013-03-06 12:23:54,517::vmChannels::144::vds::(register) Add fileno 20 to listener's channels.
160 vdsm.log:66925:Thread-17::DEBUG::2013-03-06 12:23:54,794::vmChannels::144::vds::(register) Add fileno 22 to listener's channels.
161 vdsm.log:66928:VM Channels Listener::DEBUG::2013-03-06 12:23:54,891::vmChannels::71::vds::(_do_add_channels) fileno 17 was added to unconnected channels.
162 vdsm.log:66929:VM Channels Listener::DEBUG::2013-03-06 12:23:54,896::vmChannels::71::vds::(_do_add_channels) fileno 20 was added to unconnected channels.
163 vdsm.log:66930:VM Channels Listener::DEBUG::2013-03-06 12:23:54,901::vmChannels::71::vds::(_do_add_channels) fileno 22 was added to unconnected channels.
164 vdsm.log:66932:VM Channels Listener::DEBUG::2013-03-06 12:23:54,934::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17.
165 vdsm.log:66934:VM Channels Listener::DEBUG::2013-03-06 12:23:54,936::guestIF::95::vm.Vm::(_connect) vmId=`1b755f3a-c511-4092-811e-ebfb90c0ecba`::Attempting connection to /var/lib/libvirt/qemu/channels/fw3007.com.redhat.rhevm.vdsm
166 vdsm.log:66935:VM Channels Listener::DEBUG::2013-03-06 12:23:54,939::guestIF::97::vm.Vm::(_connect) vmId=`1b755f3a-c511-4092-811e-ebfb90c0ecba`::Connected to /var/lib/libvirt/qemu/channels/fw3007.com.redhat.rhevm.vdsm
167 vdsm.log:66936:VM Channels Listener::DEBUG::2013-03-06 12:23:54,940::vmChannels::106::vds::(_handle_unconnected) Connect fileno 17 was succeed.
~~~

I *think* the following _epoll.poll(1) may be causing the overall VM Channels Listener thread to stall here but have no hard evidence to prove this as yet. Have we seen this before?

vdsm/vdsm/vmChannels.py

114     def _wait_for_events(self):
115         """ Wait for an epoll event and handle channels' timeout. """
116         events = self._epoll.poll(1)
117         for (fileno, event) in events:
118             self._handle_event(fileno, event)
119         else:
120             self._update_channels()
121             if (self._timeout is not None) and (self._timeout > 0):
122                 self._handle_timeouts()
123             self._handle_unconnected()
124
125     def run(self):
126         """ The listener thread's function. """
127         self.log.info("Starting VM channels listener thread.")
128         self._quit = False
129         while not self._quit:
130             self._wait_for_events()

Version-Release number of selected component (if applicable):
4.10.2-1.6

How reproducible:
No clear steps at present, the customer reports this reproducing every 30mins or so after a vdsmd restarts.

Steps to Reproduce:
N/A
  
Actual results:
vdsmd unable to connect, disconnect or communicate with any guest agent.

Expected results:
vdsmd restarts the stalled vm listener thread and is once again able to connect, disconnect and communicate with all guest agents.

Additional info:

vdsm.log.edited collected with the following one liner :

# xzegrep -in '(vmChannels|VM Channels Listener|I am)' vdsm.{log.2.xz,log.1.xz,log} > vdsm.log.edited

vdsm.logs.xz contains the full vdsm.log, vdsm.log.1.xz, vdsm.log.2.xz and ovirt-guest.

I also have direct access to the customer environment and am attempting to reproduce this with additional debugging in both vmChannels.py and guestIF.py.
Comment 6 Jason Dillaman 2013-03-15 16:00:36 EDT
Customer is able to periodically re-create this issue by using CloudForms and ManageIQ EVM VM self-service provisioning.  

I am able to recreate the issue by repeatedly starting and stopping VMs on a host.  For example, assign 10 VMs to a single hypervisor host and, using the web admin, concurrently start and stop the VMs repeatedly.
Comment 10 Vinzenz Feenstra [evilissimo] 2013-04-03 09:14:10 EDT
Merged u/s as http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=9dd359178dd3164eb7e425c81200d9440894436c
Comment 16 Jiri Belka 2013-04-30 09:04:22 EDT
Please provide steps for verification, the description is very vague.
Comment 20 Jiri Belka 2013-05-22 05:03:36 EDT
ok, sf17, cannot reproduce the behavior becase on actions described in comment#6.
Comment 21 Bryan Yount 2013-05-28 18:44:18 EDT
*** Bug 967136 has been marked as a duplicate of this bug. ***
Comment 23 errata-xmlrpc 2013-06-10 16:42:20 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0886.html

Note You need to log in before you can comment on or make changes to this bug.