Description of problem: If VM has beed powered down then host was in "prepare for maintenance" state vdsm consumes 100% of CPU after reactivation of the host. Version-Release number of selected component (if applicable): vdsm-4.16.13.1-1.el6ev.x86_64 How reproducible: unknown yet Steps to Reproduce: 1. Two host environment. Put one host to maintenance mode (second host doesn't have enough resources to run all VMs so migration(s) is failed) 2. Wait until migration is failed 3. Power the VM(s) off 4. Activate the host Actual results: vdsm consumes 100% of CPU vdsm log is flooded with [1] Expected results: vmchannel listener should not monitor stopped VMs / deleted fd Additional info: [1] VM Channels Listener::INFO::2015-06-01 12:20:17,429::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 82 VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll. VM Channels Listener::INFO::2015-06-01 12:20:17,430::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 13 VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll. VM Channels Listener::INFO::2015-06-01 12:20:17,430::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 84 VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll. VM Channels Listener::INFO::2015-06-01 12:20:17,431::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 82 VM Channels Listener::DEBUG::2015-06-01 12:20:17,431::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll. VM Channels Listener::INFO::2015-06-01 12:20:17,431::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 13 VM Channels Listener::DEBUG::2015-06-01 12:20:17,431::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
we need the rest of the vdsm.log, would you add it please
Created attachment 1035078 [details] vdsm log The acrhive was not completed because of system overloading but it can be extracted.
this happened at the end of migration (from src point of view) Can you add libvirt and qemu logs from both sides? Is it reproducible? Seems that the migration was not properly cancelled as I see the src VM got suspended, and then the qemu process went crazy
will add defensive code
nacked by upstream, will try one last push for solution
since we don't not know the cause of the CPU consumption and not happy with the defensive code of restarting vdsm (could create additional issues). closing this one till we can get a reproducer.
After some investigation we have found the root cause of this and can now handle this appropriately. The issue comes from a wrong assumption which has been made in some earlier changes. Originally it was assumed that we can rely on epoll to automatically unregister fds when they have been closed. However when forking a process (that is the case when starting child processes) the applications open file handles are getting shared with the child and if not both child and parent close them epoll won't consider those fds as closed and won't automatically unregister them from the watched set. Usually this problem is solved by passing the flag SOCK_CLOEXEC to a socket so in case of a fork, those handles get automatically closed in the child process. However this flag only exists in python 3.2 and higher. So we should at least set the flag even that we know it's raceful and we should handle errors on the fd with unregistering it from epoll - And in case that we're watching that socket and have a handle to it, then also close it.
verify with: RHEVM: 3.6.3-0.1.el6 vdsm: vdsm-4.17.19-0.el7ev libvirt: libvirt-1.2.17-13.el7_2.2 Steps to Reproduce: 1. Two host environment. Put one host to maintenance mode (second host doesn't have enough resources to run all VMs so migration(s) is failed) 2. Wait until migration is failed 3. Power the VM(s) off 4. Activate the host Actual results: The VM is down, Host is up and running. No 'vmchannels' print in vdsm log. Check PASS
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html