1226911 – vmchannel thread consumes 100% of CPU

Bug 1226911 - vmchannel thread consumes 100% of CPU

Summary: vmchannel thread consumes 100% of CPU

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-3.6.3
Target Release:	3.6.3
Assignee:	Vinzenz Feenstra [evilissimo]
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1297414
TreeView+	depends on / blocked

Reported:	2015-06-01 12:23 UTC by Pavel Zhukov
Modified:	2019-10-10 09:50 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	In some cases after migration fails, many error messages flooded the VDSM log, which caused one of the VDSM threads to consume 100% CPU. This was caused by incorrect usage of epoll, which has been fixed in this update.
Clone Of:
Clones:	1297414 (view as bug list)
Environment:
Last Closed:	2016-03-09 19:40:50 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm log (4.09 MB, application/x-xz) 2015-06-05 08:08 UTC, Pavel Zhukov	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2543181	None	None	None	2016-08-23 04:17:38 UTC
Red Hat Product Errata	RHBA-2016:0362	normal	SHIPPED_LIVE	vdsm 3.6.0 bug fix and enhancement update	2016-03-09 23:49:32 UTC
oVirt gerrit	51521	master	MERGED	virt: Correct epoll unregistration usage in vmchannels	2016-01-14 10:43:20 UTC
oVirt gerrit	51840	ovirt-3.6	MERGED	virt: Correct epoll unregistration usage in vmchannels	2016-01-15 13:49:03 UTC
oVirt gerrit	51877	ovirt-3.6	MERGED	virt: Set cloexec flag on channel sockets	2016-01-15 13:48:37 UTC

Description Pavel Zhukov 2015-06-01 12:23:35 UTC

Description of problem:
If VM has beed powered down then host was in "prepare for maintenance" state vdsm consumes 100% of CPU after reactivation of the host.

Version-Release number of selected component (if applicable):
vdsm-4.16.13.1-1.el6ev.x86_64

How reproducible:
unknown yet

Steps to Reproduce:
1. Two host environment. Put one host to maintenance mode (second host doesn't have enough resources to run all VMs so migration(s) is failed)
2. Wait until migration is failed
3. Power the VM(s) off
4. Activate the host

Actual results:
vdsm consumes 100% of CPU
vdsm log is flooded with [1]

Expected results:
vmchannel listener should not monitor stopped VMs / deleted fd


Additional info:


[1]
VM Channels Listener::INFO::2015-06-01 12:20:17,429::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 82
VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,430::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 13
VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,430::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 84
VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,431::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 82
VM Channels Listener::DEBUG::2015-06-01 12:20:17,431::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,431::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 13
VM Channels Listener::DEBUG::2015-06-01 12:20:17,431::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.

Comment 1 Michal Skrivanek 2015-06-04 09:51:41 UTC

we need the rest of the vdsm.log, would you add it please

Comment 2 Pavel Zhukov 2015-06-05 08:08:53 UTC

Created attachment 1035078 [details]
vdsm log

The acrhive was not completed because of system overloading but it can be extracted.

Comment 3 Michal Skrivanek 2015-06-05 12:28:48 UTC

this happened at the end of migration (from src point of view)
Can you add libvirt and qemu logs from both sides? Is it reproducible?
Seems that the migration was not properly cancelled as I see the src VM got suspended, and then the qemu process went crazy

Comment 4 Michal Skrivanek 2015-06-23 08:28:32 UTC

will add defensive code

Comment 5 Michal Skrivanek 2015-07-30 08:19:07 UTC

nacked by upstream, will try one last push for solution

Comment 10 Moran Goldboim 2015-09-17 10:44:28 UTC

since we don't not know the cause of the CPU consumption and not happy with the defensive code of restarting vdsm (could create additional issues). closing this one till we can get a reproducer.

Comment 14 Vinzenz Feenstra [evilissimo] 2015-12-14 14:50:12 UTC

After some investigation we have found the root cause of this and can now handle this appropriately.

The issue comes from a wrong assumption which has been made in some earlier changes.
Originally it was assumed that we can rely on epoll to automatically unregister fds when they have been closed.
However when forking a process (that is the case when starting child processes) the applications open file handles are getting shared with the child and if not both child and parent close them epoll won't consider those fds as closed and won't automatically unregister them from the watched set.
Usually this problem is solved by passing the flag SOCK_CLOEXEC to a socket so in case of a fork, those handles get automatically closed in the child process. However this flag only exists in python 3.2 and higher.
So we should at least set the flag even that we know it's raceful and we should handle errors on the fd with unregistering it from epoll - And in case that we're watching that socket and have a handle to it, then also close it.

Comment 17 Israel Pinto 2016-02-07 12:41:04 UTC

verify with:
RHEVM: 3.6.3-0.1.el6
vdsm: vdsm-4.17.19-0.el7ev
libvirt: libvirt-1.2.17-13.el7_2.2

Steps to Reproduce:
1. Two host environment. Put one host to maintenance mode (second host doesn't have enough resources to run all VMs so migration(s) is failed)
2. Wait until migration is failed
3. Power the VM(s) off
4. Activate the host

Actual results:
The VM is down, Host is up and running.
No 'vmchannels' print in vdsm log.
Check PASS

Comment 19 errata-xmlrpc 2016-03-09 19:40:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html

Note You need to log in before you can comment on or make changes to this bug.