Description of problem: The VM Channels Listener thread appears to stall causing communication between VDSM and the hosted guest agents to stop. For example, after restarting vdsmd we can see the listener connect to a number of guest agents : ~~~ 26 vdsm.log.2.xz:31928:MainThread::INFO::2013-03-05 11:10:13,573::vdsm::88::vds::(run) I am the actual vdsm 4.10-1.6 h1-test.ec.dev (2.6.32-358.0.1.el6.x86_64) 27 vdsm.log.2.xz:31964:MainThread::INFO::2013-03-05 11:10:17,089::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds. 28 vdsm.log.2.xz:31965:VM Channels Listener::INFO::2013-03-05 11:10:17,090::vmChannels::127::vds::(run) Starting VM channels listener thread. 29 vdsm.log.2.xz:32312:Thread-69::DEBUG::2013-03-05 11:19:49,070::vmChannels::144::vds::(register) Add fileno 17 to listener's channels. 30 vdsm.log.2.xz:32322:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::vmChannels::71::vds::(_do_add_channels) fileno 17 was added to unconnected channels. 31 vdsm.log.2.xz:32323:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17. 32 vdsm.log.2.xz:32324:VM Channels Listener::DEBUG::2013-03-05 11:19:49,339::guestIF::95::vm.Vm::(_connect) vmId=`31404679-cef1-4291-b9fa-e979b515b98c`::Attempting connection to /var/lib/libvirt/qemu/channels/pwallend-test.com.redhat.rhevm.vdsm 33 vdsm.log.2.xz:32325:VM Channels Listener::DEBUG::2013-03-05 11:19:49,340::guestIF::97::vm.Vm::(_connect) vmId=`31404679-cef1-4291-b9fa-e979b515b98c`::Connected to /var/lib/libvirt/qemu/channels/pwallend-test.com.redhat.rhevm.vdsm 34 vdsm.log.2.xz:32326:VM Channels Listener::DEBUG::2013-03-05 11:19:49,340::vmChannels::106::vds::(_handle_unconnected) Connect fileno 17 was succeed. ~~~ Some time later we see a final _do_del_channels() call after which the thread appears to stall : ~~~ 81 vdsm.log.2.xz:38891:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,513::vmChannels::152::vds::(unregister) Delete fileno 21 from listener. 82 vdsm.log.2.xz:38893:VM Channels Listener::DEBUG::2013-03-05 12:06:23,514::vmChannels::88::vds::(_do_del_channels) fileno 21 was removed from listener. 83 vdsm.log.2.xz:38896:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,582::vmChannels::152::vds::(unregister) Delete fileno 21 from listener. 84 vdsm.log.2.xz:38906:libvirtEventLoop::DEBUG::2013-03-05 12:06:23,598::vmChannels::152::vds::(unregister) Delete fileno 21 from listener. 85 vdsm.log.2.xz:39312:Thread-2002::DEBUG::2013-03-05 12:10:15,956::vmChannels::152::vds::(unregister) Delete fileno 46 from listener. 86 vdsm.log.2.xz:39318:Thread-2002::DEBUG::2013-03-05 12:10:16,759::vmChannels::152::vds::(unregister) Delete fileno 46 from listener. 87 vdsm.log.2.xz:39330:libvirtEventLoop::DEBUG::2013-03-05 12:10:16,787::vmChannels::152::vds::(unregister) Delete fileno 46 from listener. 88 vdsm.log.2.xz:39536:Thread-2015::DEBUG::2013-03-05 12:10:36,529::vmChannels::144::vds::(register) Add fileno 21 to listener's channels. 89 vdsm.log.2.xz:41495:Thread-2600::DEBUG::2013-03-05 12:22:44,969::vmChannels::144::vds::(register) Add fileno 46 to listener's channels. 90 vdsm.log.2.xz:41747:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,828::vmChannels::152::vds::(unregister) Delete fileno 46 from listener. 91 vdsm.log.2.xz:41750:Thread-2604::DEBUG::2013-03-05 12:23:19,841::vmChannels::152::vds::(unregister) Delete fileno 46 from listener. 92 vdsm.log.2.xz:41754:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,877::vmChannels::152::vds::(unregister) Delete fileno 46 from listener. 93 vdsm.log.2.xz:41764:libvirtEventLoop::DEBUG::2013-03-05 12:23:19,883::vmChannels::152::vds::(unregister) Delete fileno 46 from listener. 94 vdsm.log.2.xz:42615:Thread-2956::DEBUG::2013-03-05 12:29:27,364::vmChannels::152::vds::(unregister) Delete fileno 21 from listener. 95 vdsm.log.2.xz:42620:Thread-2956::DEBUG::2013-03-05 12:29:28,169::vmChannels::152::vds::(unregister) Delete fileno 21 from listener. ~~~ This results in the webadmin removing the guests IP etc from the UI and any shutdown requests falling back to the libvirt acpi method. This was corrected the next day by once again restarting vdsmd : ~~~ 155 vdsm.log:66815:MainThread::INFO::2013-03-06 12:23:48,966::vdsm::88::vds::(run) I am the actual vdsm 4.10-1.6 h1-test.ec.dev (2.6.32-358.0.1.el6.x86_64) 156 vdsm.log:66851:MainThread::INFO::2013-03-06 12:23:51,878::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds. 157 vdsm.log:66852:VM Channels Listener::INFO::2013-03-06 12:23:51,879::vmChannels::127::vds::(run) Starting VM channels listener thread. 158 vdsm.log:66912:Thread-13::DEBUG::2013-03-06 12:23:53,973::vmChannels::144::vds::(register) Add fileno 17 to listener's channels. 159 vdsm.log:66917:Thread-15::DEBUG::2013-03-06 12:23:54,517::vmChannels::144::vds::(register) Add fileno 20 to listener's channels. 160 vdsm.log:66925:Thread-17::DEBUG::2013-03-06 12:23:54,794::vmChannels::144::vds::(register) Add fileno 22 to listener's channels. 161 vdsm.log:66928:VM Channels Listener::DEBUG::2013-03-06 12:23:54,891::vmChannels::71::vds::(_do_add_channels) fileno 17 was added to unconnected channels. 162 vdsm.log:66929:VM Channels Listener::DEBUG::2013-03-06 12:23:54,896::vmChannels::71::vds::(_do_add_channels) fileno 20 was added to unconnected channels. 163 vdsm.log:66930:VM Channels Listener::DEBUG::2013-03-06 12:23:54,901::vmChannels::71::vds::(_do_add_channels) fileno 22 was added to unconnected channels. 164 vdsm.log:66932:VM Channels Listener::DEBUG::2013-03-06 12:23:54,934::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 17. 165 vdsm.log:66934:VM Channels Listener::DEBUG::2013-03-06 12:23:54,936::guestIF::95::vm.Vm::(_connect) vmId=`1b755f3a-c511-4092-811e-ebfb90c0ecba`::Attempting connection to /var/lib/libvirt/qemu/channels/fw3007.com.redhat.rhevm.vdsm 166 vdsm.log:66935:VM Channels Listener::DEBUG::2013-03-06 12:23:54,939::guestIF::97::vm.Vm::(_connect) vmId=`1b755f3a-c511-4092-811e-ebfb90c0ecba`::Connected to /var/lib/libvirt/qemu/channels/fw3007.com.redhat.rhevm.vdsm 167 vdsm.log:66936:VM Channels Listener::DEBUG::2013-03-06 12:23:54,940::vmChannels::106::vds::(_handle_unconnected) Connect fileno 17 was succeed. ~~~ I *think* the following _epoll.poll(1) may be causing the overall VM Channels Listener thread to stall here but have no hard evidence to prove this as yet. Have we seen this before? vdsm/vdsm/vmChannels.py 114 def _wait_for_events(self): 115 """ Wait for an epoll event and handle channels' timeout. """ 116 events = self._epoll.poll(1) 117 for (fileno, event) in events: 118 self._handle_event(fileno, event) 119 else: 120 self._update_channels() 121 if (self._timeout is not None) and (self._timeout > 0): 122 self._handle_timeouts() 123 self._handle_unconnected() 124 125 def run(self): 126 """ The listener thread's function. """ 127 self.log.info("Starting VM channels listener thread.") 128 self._quit = False 129 while not self._quit: 130 self._wait_for_events() Version-Release number of selected component (if applicable): 4.10.2-1.6 How reproducible: No clear steps at present, the customer reports this reproducing every 30mins or so after a vdsmd restarts. Steps to Reproduce: N/A Actual results: vdsmd unable to connect, disconnect or communicate with any guest agent. Expected results: vdsmd restarts the stalled vm listener thread and is once again able to connect, disconnect and communicate with all guest agents. Additional info: vdsm.log.edited collected with the following one liner : # xzegrep -in '(vmChannels|VM Channels Listener|I am)' vdsm.{log.2.xz,log.1.xz,log} > vdsm.log.edited vdsm.logs.xz contains the full vdsm.log, vdsm.log.1.xz, vdsm.log.2.xz and ovirt-guest. I also have direct access to the customer environment and am attempting to reproduce this with additional debugging in both vmChannels.py and guestIF.py.
Customer is able to periodically re-create this issue by using CloudForms and ManageIQ EVM VM self-service provisioning. I am able to recreate the issue by repeatedly starting and stopping VMs on a host. For example, assign 10 VMs to a single hypervisor host and, using the web admin, concurrently start and stop the VMs repeatedly.
Merged u/s as http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=9dd359178dd3164eb7e425c81200d9440894436c
Please provide steps for verification, the description is very vague.
ok, sf17, cannot reproduce the behavior becase on actions described in comment#6.
*** Bug 967136 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0886.html