Bug 1226911 - vmchannel thread consumes 100% of CPU
Summary: vmchannel thread consumes 100% of CPU
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-3.6.3
: 3.6.3
Assignee: Vinzenz Feenstra [evilissimo]
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks: 1297414
TreeView+ depends on / blocked
 
Reported: 2015-06-01 12:23 UTC by Pavel Zhukov
Modified: 2019-10-10 09:50 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
In some cases after migration fails, many error messages flooded the VDSM log, which caused one of the VDSM threads to consume 100% CPU. This was caused by incorrect usage of epoll, which has been fixed in this update.
Clone Of:
: 1297414 (view as bug list)
Environment:
Last Closed: 2016-03-09 19:40:50 UTC
oVirt Team: Virt
Target Upstream Version:


Attachments (Terms of Use)
vdsm log (4.09 MB, application/x-xz)
2015-06-05 08:08 UTC, Pavel Zhukov
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2543181 None None None 2016-08-23 04:17:38 UTC
Red Hat Product Errata RHBA-2016:0362 normal SHIPPED_LIVE vdsm 3.6.0 bug fix and enhancement update 2016-03-09 23:49:32 UTC
oVirt gerrit 51521 master MERGED virt: Correct epoll unregistration usage in vmchannels 2016-01-14 10:43:20 UTC
oVirt gerrit 51840 ovirt-3.6 MERGED virt: Correct epoll unregistration usage in vmchannels 2016-01-15 13:49:03 UTC
oVirt gerrit 51877 ovirt-3.6 MERGED virt: Set cloexec flag on channel sockets 2016-01-15 13:48:37 UTC

Description Pavel Zhukov 2015-06-01 12:23:35 UTC
Description of problem:
If VM has beed powered down then host was in "prepare for maintenance" state vdsm consumes 100% of CPU after reactivation of the host.

Version-Release number of selected component (if applicable):
vdsm-4.16.13.1-1.el6ev.x86_64

How reproducible:
unknown yet

Steps to Reproduce:
1. Two host environment. Put one host to maintenance mode (second host doesn't have enough resources to run all VMs so migration(s) is failed)
2. Wait until migration is failed
3. Power the VM(s) off
4. Activate the host

Actual results:
vdsm consumes 100% of CPU
vdsm log is flooded with [1]

Expected results:
vmchannel listener should not monitor stopped VMs / deleted fd


Additional info:


[1]
VM Channels Listener::INFO::2015-06-01 12:20:17,429::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 82
VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,430::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 13
VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,430::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 84
VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,431::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 82
VM Channels Listener::DEBUG::2015-06-01 12:20:17,431::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,431::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 13
VM Channels Listener::DEBUG::2015-06-01 12:20:17,431::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.

Comment 1 Michal Skrivanek 2015-06-04 09:51:41 UTC
we need the rest of the vdsm.log, would you add it please

Comment 2 Pavel Zhukov 2015-06-05 08:08:53 UTC
Created attachment 1035078 [details]
vdsm log

The acrhive was not completed because of system overloading but it can be extracted.

Comment 3 Michal Skrivanek 2015-06-05 12:28:48 UTC
this happened at the end of migration (from src point of view)
Can you add libvirt and qemu logs from both sides? Is it reproducible?
Seems that the migration was not properly cancelled as I see the src VM got suspended, and then the qemu process went crazy

Comment 4 Michal Skrivanek 2015-06-23 08:28:32 UTC
will add defensive code

Comment 5 Michal Skrivanek 2015-07-30 08:19:07 UTC
nacked by upstream, will try one last push for solution

Comment 10 Moran Goldboim 2015-09-17 10:44:28 UTC
since we don't not know the cause of the CPU consumption and not happy with the defensive code of restarting vdsm (could create additional issues). closing this one till we can get a reproducer.

Comment 14 Vinzenz Feenstra [evilissimo] 2015-12-14 14:50:12 UTC
After some investigation we have found the root cause of this and can now handle this appropriately.

The issue comes from a wrong assumption which has been made in some earlier changes.
Originally it was assumed that we can rely on epoll to automatically unregister fds when they have been closed.
However when forking a process (that is the case when starting child processes) the applications open file handles are getting shared with the child and if not both child and parent close them epoll won't consider those fds as closed and won't automatically unregister them from the watched set.
Usually this problem is solved by passing the flag SOCK_CLOEXEC to a socket so in case of a fork, those handles get automatically closed in the child process. However this flag only exists in python 3.2 and higher.
So we should at least set the flag even that we know it's raceful and we should handle errors on the fd with unregistering it from epoll - And in case that we're watching that socket and have a handle to it, then also close it.

Comment 17 Israel Pinto 2016-02-07 12:41:04 UTC
verify with:
RHEVM: 3.6.3-0.1.el6
vdsm: vdsm-4.17.19-0.el7ev
libvirt: libvirt-1.2.17-13.el7_2.2

Steps to Reproduce:
1. Two host environment. Put one host to maintenance mode (second host doesn't have enough resources to run all VMs so migration(s) is failed)
2. Wait until migration is failed
3. Power the VM(s) off
4. Activate the host

Actual results:
The VM is down, Host is up and running.
No 'vmchannels' print in vdsm log.
Check PASS

Comment 19 errata-xmlrpc 2016-03-09 19:40:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html


Note You need to log in before you can comment on or make changes to this bug.