Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 918541

Summary: The VM Channels Listener thread appears to stall , blocking all communication between VDSM and the hosted guest agents.
Product: Red Hat Enterprise Virtualization Manager Reporter: Lee Yarwood <lyarwood>
Component: vdsmAssignee: Vinzenz Feenstra [evilissimo] <vfeenstr>
Status: CLOSED ERRATA QA Contact: Jiri Belka <jbelka>
Severity: high Docs Contact:
Priority: high    
Version: 3.1.3CC: bazulay, byount, carlos.molina.ext, chchen, cshao, gouyang, hadong, hateya, huiwa, iheim, italkohe, jdillama, leiwang, lpeer, mburns, michal.skrivanek, mkalinin, ofrenkel, pep, pstehlik, pzhukov, sgrinber, vchoudha, ycui, ykaul, zdover
Target Milestone: ---Keywords: ZStream
Target Release: 3.2.0   
Hardware: All   
OS: Linux   
Whiteboard: virt
Fixed In Version: vdsm-4.10.2-16.0.el6ev Doc Type: Bug Fix
Doc Text:
Previously, the VM Channels Listener thread stalled, which blocked communication between VDSM and the hosted guest agents. A patch prevents the VM Channels Listener thread from stalling, removing the block in communication between VDSM and the hosted guest agents.
Story Points: ---
Clone Of:
: 947883 (view as bug list) Environment:
Last Closed: 2013-06-10 20:42:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 947883, 964195    

Description Lee Yarwood 2013-03-06 13:40:05 UTC
Description of problem:

The VM Channels Listener thread appears to stall causing communication between VDSM and the hosted guest agents to stop.

For example, after restarting vdsmd we can see the listener connect to a number of guest agents :

~~~
 26 vdsm.log.2.xz:31928:MainThread::INFO::2013-03-05 11:10:13,573::vdsm::88::vds::(run) I am the actual vdsm 4.10-1.6 h1-test.ec.dev (2.6.32-358.0.1.el6.x86_64)
 27 vdsm.log.2.xz:31964:MainThread::INFO::2013-03-05 11:10:17,089::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds.
 28 vdsm.log.2.xz:31965:VM Channels Listener::INFO::2013-03-05 11:10:17,090::vmChannels::127::vds::(run) Starting VM channels listener thread.
 29 vdsm.log.2.xz:32312:Thread-69::DEBUG::2013-03-05 11:19:49,070::vmChannels::144::vds::(register) Add fileno 17 to listener's channels.
 30 vdsm.log.2.xz:32322:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::vmChannels::71::vds::(_do_add_channels) fileno 17 was added to unconnected channels.
 31 vdsm.log.2.xz:32323:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17.
 32 vdsm.log.2.xz:32324:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::guestIF::95::vm.Vm::(_connect) vmId=`31404679-cef1-4291-b9fa-e979b515b98c`::Attempting connection to /var/lib/libvirt/qemu/channels/pwallend-test.com.redhat.rhevm.vdsm
 33 vdsm.log.2.xz:32325:VM Channels Listener::DEBUG::2013-03-05 11:19:49,340::guestIF::97::vm.Vm::(_connect) vmId=`31404679-cef1-4291-b9fa-e979b515b98c`::Connected to /var/lib/libvirt/qemu/channels/pwallend-test.com.redhat.rhevm.vdsm
 34 vdsm.log.2.xz:32326:VM Channels Listener::DEBUG::2013-03-05 11:19:49,340::vmChannels::106::vds::(_handle_unconnected) Connect fileno 17 was succeed.
~~~

Some time later we see a final _do_del_channels() call after which the thread appears to stall :

~~~
 81 vdsm.log.2.xz:38891:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,513::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 82 vdsm.log.2.xz:38893:VM Channels Listener::DEBUG::2013-03-05 12:06:23,514::vmChannels::88::vds::(_do_del_channels) fileno 21 was removed from listener.
 83 vdsm.log.2.xz:38896:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,582::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 84 vdsm.log.2.xz:38906:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,598::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 85 vdsm.log.2.xz:39312:Thread-2002::DEBUG::2013-03-05 12:10:15,956::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 86 vdsm.log.2.xz:39318:Thread-2002::DEBUG::2013-03-05 12:10:16,759::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 87 vdsm.log.2.xz:39330:libvirtEventLoop::DEBUG::2013-03-05 12:10:16,787::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 88 vdsm.log.2.xz:39536:Thread-2015::DEBUG::2013-03-05 12:10:36,529::vmChannels::144::vds::(register) Add fileno 21 to listener's channels.
 89 vdsm.log.2.xz:41495:Thread-2600::DEBUG::2013-03-05 12:22:44,969::vmChannels::144::vds::(register) Add fileno 46 to listener's channels.
 90 vdsm.log.2.xz:41747:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,828::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 91 vdsm.log.2.xz:41750:Thread-2604::DEBUG::2013-03-05 12:23:19,841::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 92 vdsm.log.2.xz:41754:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,877::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 93 vdsm.log.2.xz:41764:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,883::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 94 vdsm.log.2.xz:42615:Thread-2956::DEBUG::2013-03-05 12:29:27,364::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 95 vdsm.log.2.xz:42620:Thread-2956::DEBUG::2013-03-05 12:29:28,169::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
~~~ 

This results in the webadmin removing the guests IP etc from the UI and any shutdown requests falling back to the libvirt acpi method.

This was corrected the next day by once again restarting vdsmd :

~~~
155 vdsm.log:66815:MainThread::INFO::2013-03-06 12:23:48,966::vdsm::88::vds::(run) I am the actual vdsm 4.10-1.6 h1-test.ec.dev (2.6.32-358.0.1.el6.x86_64)
156 vdsm.log:66851:MainThread::INFO::2013-03-06 12:23:51,878::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds.
157 vdsm.log:66852:VM Channels Listener::INFO::2013-03-06 12:23:51,879::vmChannels::127::vds::(run) Starting VM channels listener thread.
158 vdsm.log:66912:Thread-13::DEBUG::2013-03-06 12:23:53,973::vmChannels::144::vds::(register) Add fileno 17 to listener's channels.
159 vdsm.log:66917:Thread-15::DEBUG::2013-03-06 12:23:54,517::vmChannels::144::vds::(register) Add fileno 20 to listener's channels.
160 vdsm.log:66925:Thread-17::DEBUG::2013-03-06 12:23:54,794::vmChannels::144::vds::(register) Add fileno 22 to listener's channels.
161 vdsm.log:66928:VM Channels Listener::DEBUG::2013-03-06 12:23:54,891::vmChannels::71::vds::(_do_add_channels) fileno 17 was added to unconnected channels.
162 vdsm.log:66929:VM Channels Listener::DEBUG::2013-03-06 12:23:54,896::vmChannels::71::vds::(_do_add_channels) fileno 20 was added to unconnected channels.
163 vdsm.log:66930:VM Channels Listener::DEBUG::2013-03-06 12:23:54,901::vmChannels::71::vds::(_do_add_channels) fileno 22 was added to unconnected channels.
164 vdsm.log:66932:VM Channels Listener::DEBUG::2013-03-06 12:23:54,934::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17.
165 vdsm.log:66934:VM Channels Listener::DEBUG::2013-03-06 12:23:54,936::guestIF::95::vm.Vm::(_connect) vmId=`1b755f3a-c511-4092-811e-ebfb90c0ecba`::Attempting connection to /var/lib/libvirt/qemu/channels/fw3007.com.redhat.rhevm.vdsm
166 vdsm.log:66935:VM Channels Listener::DEBUG::2013-03-06 12:23:54,939::guestIF::97::vm.Vm::(_connect) vmId=`1b755f3a-c511-4092-811e-ebfb90c0ecba`::Connected to /var/lib/libvirt/qemu/channels/fw3007.com.redhat.rhevm.vdsm
167 vdsm.log:66936:VM Channels Listener::DEBUG::2013-03-06 12:23:54,940::vmChannels::106::vds::(_handle_unconnected) Connect fileno 17 was succeed.
~~~

I *think* the following _epoll.poll(1) may be causing the overall VM Channels Listener thread to stall here but have no hard evidence to prove this as yet. Have we seen this before?

vdsm/vdsm/vmChannels.py

114     def _wait_for_events(self):
115         """ Wait for an epoll event and handle channels' timeout. """
116         events = self._epoll.poll(1)
117         for (fileno, event) in events:
118             self._handle_event(fileno, event)
119         else:
120             self._update_channels()
121             if (self._timeout is not None) and (self._timeout > 0):
122                 self._handle_timeouts()
123             self._handle_unconnected()
124
125     def run(self):
126         """ The listener thread's function. """
127         self.log.info("Starting VM channels listener thread.")
128         self._quit = False
129         while not self._quit:
130             self._wait_for_events()

Version-Release number of selected component (if applicable):
4.10.2-1.6

How reproducible:
No clear steps at present, the customer reports this reproducing every 30mins or so after a vdsmd restarts.

Steps to Reproduce:
N/A
  
Actual results:
vdsmd unable to connect, disconnect or communicate with any guest agent.

Expected results:
vdsmd restarts the stalled vm listener thread and is once again able to connect, disconnect and communicate with all guest agents.

Additional info:

vdsm.log.edited collected with the following one liner :

# xzegrep -in '(vmChannels|VM Channels Listener|I am)' vdsm.{log.2.xz,log.1.xz,log} > vdsm.log.edited

vdsm.logs.xz contains the full vdsm.log, vdsm.log.1.xz, vdsm.log.2.xz and ovirt-guest.

I also have direct access to the customer environment and am attempting to reproduce this with additional debugging in both vmChannels.py and guestIF.py.

Comment 6 Jason Dillaman 2013-03-15 20:00:36 UTC
Customer is able to periodically re-create this issue by using CloudForms and ManageIQ EVM VM self-service provisioning.  

I am able to recreate the issue by repeatedly starting and stopping VMs on a host.  For example, assign 10 VMs to a single hypervisor host and, using the web admin, concurrently start and stop the VMs repeatedly.

Comment 10 Vinzenz Feenstra [evilissimo] 2013-04-03 13:14:10 UTC
Merged u/s as http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=9dd359178dd3164eb7e425c81200d9440894436c

Comment 16 Jiri Belka 2013-04-30 13:04:22 UTC
Please provide steps for verification, the description is very vague.

Comment 20 Jiri Belka 2013-05-22 09:03:36 UTC
ok, sf17, cannot reproduce the behavior becase on actions described in comment#6.

Comment 21 Bryan Yount 2013-05-28 22:44:18 UTC
*** Bug 967136 has been marked as a duplicate of this bug. ***

Comment 23 errata-xmlrpc 2013-06-10 20:42:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0886.html