Bug 918541 - The VM Channels Listener thread appears to stall , blocking all communication between VDSM and the hosted guest agents.
Summary: The VM Channels Listener thread appears to stall , blocking all communication...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.1.3
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: 3.2.0
Assignee: Vinzenz Feenstra [evilissimo]
QA Contact: Jiri Belka
URL:
Whiteboard: virt
: 967136 (view as bug list)
Depends On:
Blocks: 947883 964195
TreeView+ depends on / blocked
 
Reported: 2013-03-06 13:40 UTC by Lee Yarwood
Modified: 2022-07-09 05:57 UTC (History)
26 users (show)

Fixed In Version: vdsm-4.10.2-16.0.el6ev
Doc Type: Bug Fix
Doc Text:
Previously, the VM Channels Listener thread stalled, which blocked communication between VDSM and the hosted guest agents. A patch prevents the VM Channels Listener thread from stalling, removing the block in communication between VDSM and the hosted guest agents.
Clone Of:
: 947883 (view as bug list)
Environment:
Last Closed: 2013-06-10 20:42:20 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 871616 0 unspecified CLOSED Guest agent information is missing after few VM's migrations 2022-07-09 05:59:49 UTC
Red Hat Bugzilla 927143 0 unspecified CLOSED [vdsm] ShutdownVM fails after plugging shared disk to 2 vms at once due to 'Bad File Descriptor' in vdsm 2022-07-09 06:15:02 UTC
Red Hat Issue Tracker RHV-47071 0 None None None 2022-07-09 05:57:53 UTC
Red Hat Knowledge Base (Solution) 325943 0 None None None Never
Red Hat Product Errata RHSA-2013:0886 0 normal SHIPPED_LIVE Moderate: rhev 3.2 - vdsm security and bug fix update 2013-06-11 00:25:02 UTC
oVirt gerrit 13093 0 None None None Never

Internal Links: 871616 927143

Description Lee Yarwood 2013-03-06 13:40:05 UTC
Description of problem:

The VM Channels Listener thread appears to stall causing communication between VDSM and the hosted guest agents to stop.

For example, after restarting vdsmd we can see the listener connect to a number of guest agents :

~~~
 26 vdsm.log.2.xz:31928:MainThread::INFO::2013-03-05 11:10:13,573::vdsm::88::vds::(run) I am the actual vdsm 4.10-1.6 h1-test.ec.dev (2.6.32-358.0.1.el6.x86_64)
 27 vdsm.log.2.xz:31964:MainThread::INFO::2013-03-05 11:10:17,089::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds.
 28 vdsm.log.2.xz:31965:VM Channels Listener::INFO::2013-03-05 11:10:17,090::vmChannels::127::vds::(run) Starting VM channels listener thread.
 29 vdsm.log.2.xz:32312:Thread-69::DEBUG::2013-03-05 11:19:49,070::vmChannels::144::vds::(register) Add fileno 17 to listener's channels.
 30 vdsm.log.2.xz:32322:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::vmChannels::71::vds::(_do_add_channels) fileno 17 was added to unconnected channels.
 31 vdsm.log.2.xz:32323:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17.
 32 vdsm.log.2.xz:32324:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::guestIF::95::vm.Vm::(_connect) vmId=`31404679-cef1-4291-b9fa-e979b515b98c`::Attempting connection to /var/lib/libvirt/qemu/channels/pwallend-test.com.redhat.rhevm.vdsm
 33 vdsm.log.2.xz:32325:VM Channels Listener::DEBUG::2013-03-05 11:19:49,340::guestIF::97::vm.Vm::(_connect) vmId=`31404679-cef1-4291-b9fa-e979b515b98c`::Connected to /var/lib/libvirt/qemu/channels/pwallend-test.com.redhat.rhevm.vdsm
 34 vdsm.log.2.xz:32326:VM Channels Listener::DEBUG::2013-03-05 11:19:49,340::vmChannels::106::vds::(_handle_unconnected) Connect fileno 17 was succeed.
~~~

Some time later we see a final _do_del_channels() call after which the thread appears to stall :

~~~
 81 vdsm.log.2.xz:38891:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,513::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 82 vdsm.log.2.xz:38893:VM Channels Listener::DEBUG::2013-03-05 12:06:23,514::vmChannels::88::vds::(_do_del_channels) fileno 21 was removed from listener.
 83 vdsm.log.2.xz:38896:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,582::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 84 vdsm.log.2.xz:38906:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,598::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 85 vdsm.log.2.xz:39312:Thread-2002::DEBUG::2013-03-05 12:10:15,956::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 86 vdsm.log.2.xz:39318:Thread-2002::DEBUG::2013-03-05 12:10:16,759::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 87 vdsm.log.2.xz:39330:libvirtEventLoop::DEBUG::2013-03-05 12:10:16,787::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 88 vdsm.log.2.xz:39536:Thread-2015::DEBUG::2013-03-05 12:10:36,529::vmChannels::144::vds::(register) Add fileno 21 to listener's channels.
 89 vdsm.log.2.xz:41495:Thread-2600::DEBUG::2013-03-05 12:22:44,969::vmChannels::144::vds::(register) Add fileno 46 to listener's channels.
 90 vdsm.log.2.xz:41747:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,828::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 91 vdsm.log.2.xz:41750:Thread-2604::DEBUG::2013-03-05 12:23:19,841::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 92 vdsm.log.2.xz:41754:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,877::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 93 vdsm.log.2.xz:41764:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,883::vmChannels::152::vds::(unregister) Delete fileno 46 from listener.
 94 vdsm.log.2.xz:42615:Thread-2956::DEBUG::2013-03-05 12:29:27,364::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
 95 vdsm.log.2.xz:42620:Thread-2956::DEBUG::2013-03-05 12:29:28,169::vmChannels::152::vds::(unregister) Delete fileno 21 from listener.
~~~ 

This results in the webadmin removing the guests IP etc from the UI and any shutdown requests falling back to the libvirt acpi method.

This was corrected the next day by once again restarting vdsmd :

~~~
155 vdsm.log:66815:MainThread::INFO::2013-03-06 12:23:48,966::vdsm::88::vds::(run) I am the actual vdsm 4.10-1.6 h1-test.ec.dev (2.6.32-358.0.1.el6.x86_64)
156 vdsm.log:66851:MainThread::INFO::2013-03-06 12:23:51,878::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds.
157 vdsm.log:66852:VM Channels Listener::INFO::2013-03-06 12:23:51,879::vmChannels::127::vds::(run) Starting VM channels listener thread.
158 vdsm.log:66912:Thread-13::DEBUG::2013-03-06 12:23:53,973::vmChannels::144::vds::(register) Add fileno 17 to listener's channels.
159 vdsm.log:66917:Thread-15::DEBUG::2013-03-06 12:23:54,517::vmChannels::144::vds::(register) Add fileno 20 to listener's channels.
160 vdsm.log:66925:Thread-17::DEBUG::2013-03-06 12:23:54,794::vmChannels::144::vds::(register) Add fileno 22 to listener's channels.
161 vdsm.log:66928:VM Channels Listener::DEBUG::2013-03-06 12:23:54,891::vmChannels::71::vds::(_do_add_channels) fileno 17 was added to unconnected channels.
162 vdsm.log:66929:VM Channels Listener::DEBUG::2013-03-06 12:23:54,896::vmChannels::71::vds::(_do_add_channels) fileno 20 was added to unconnected channels.
163 vdsm.log:66930:VM Channels Listener::DEBUG::2013-03-06 12:23:54,901::vmChannels::71::vds::(_do_add_channels) fileno 22 was added to unconnected channels.
164 vdsm.log:66932:VM Channels Listener::DEBUG::2013-03-06 12:23:54,934::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17.
165 vdsm.log:66934:VM Channels Listener::DEBUG::2013-03-06 12:23:54,936::guestIF::95::vm.Vm::(_connect) vmId=`1b755f3a-c511-4092-811e-ebfb90c0ecba`::Attempting connection to /var/lib/libvirt/qemu/channels/fw3007.com.redhat.rhevm.vdsm
166 vdsm.log:66935:VM Channels Listener::DEBUG::2013-03-06 12:23:54,939::guestIF::97::vm.Vm::(_connect) vmId=`1b755f3a-c511-4092-811e-ebfb90c0ecba`::Connected to /var/lib/libvirt/qemu/channels/fw3007.com.redhat.rhevm.vdsm
167 vdsm.log:66936:VM Channels Listener::DEBUG::2013-03-06 12:23:54,940::vmChannels::106::vds::(_handle_unconnected) Connect fileno 17 was succeed.
~~~

I *think* the following _epoll.poll(1) may be causing the overall VM Channels Listener thread to stall here but have no hard evidence to prove this as yet. Have we seen this before?

vdsm/vdsm/vmChannels.py

114     def _wait_for_events(self):
115         """ Wait for an epoll event and handle channels' timeout. """
116         events = self._epoll.poll(1)
117         for (fileno, event) in events:
118             self._handle_event(fileno, event)
119         else:
120             self._update_channels()
121             if (self._timeout is not None) and (self._timeout > 0):
122                 self._handle_timeouts()
123             self._handle_unconnected()
124
125     def run(self):
126         """ The listener thread's function. """
127         self.log.info("Starting VM channels listener thread.")
128         self._quit = False
129         while not self._quit:
130             self._wait_for_events()

Version-Release number of selected component (if applicable):
4.10.2-1.6

How reproducible:
No clear steps at present, the customer reports this reproducing every 30mins or so after a vdsmd restarts.

Steps to Reproduce:
N/A
  
Actual results:
vdsmd unable to connect, disconnect or communicate with any guest agent.

Expected results:
vdsmd restarts the stalled vm listener thread and is once again able to connect, disconnect and communicate with all guest agents.

Additional info:

vdsm.log.edited collected with the following one liner :

# xzegrep -in '(vmChannels|VM Channels Listener|I am)' vdsm.{log.2.xz,log.1.xz,log} > vdsm.log.edited

vdsm.logs.xz contains the full vdsm.log, vdsm.log.1.xz, vdsm.log.2.xz and ovirt-guest.

I also have direct access to the customer environment and am attempting to reproduce this with additional debugging in both vmChannels.py and guestIF.py.

Comment 6 Jason Dillaman 2013-03-15 20:00:36 UTC
Customer is able to periodically re-create this issue by using CloudForms and ManageIQ EVM VM self-service provisioning.  

I am able to recreate the issue by repeatedly starting and stopping VMs on a host.  For example, assign 10 VMs to a single hypervisor host and, using the web admin, concurrently start and stop the VMs repeatedly.

Comment 10 Vinzenz Feenstra [evilissimo] 2013-04-03 13:14:10 UTC
Merged u/s as http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=9dd359178dd3164eb7e425c81200d9440894436c

Comment 16 Jiri Belka 2013-04-30 13:04:22 UTC
Please provide steps for verification, the description is very vague.

Comment 20 Jiri Belka 2013-05-22 09:03:36 UTC
ok, sf17, cannot reproduce the behavior becase on actions described in comment#6.

Comment 21 Bryan Yount 2013-05-28 22:44:18 UTC
*** Bug 967136 has been marked as a duplicate of this bug. ***

Comment 23 errata-xmlrpc 2013-06-10 20:42:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0886.html


Note You need to log in before you can comment on or make changes to this bug.