Bug 1226911

Summary:

vmchannel thread consumes 100% of CPU

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Pavel Zhukov <pzhukov>

Component:

vdsm

Assignee:

Vinzenz Feenstra [evilissimo] <vfeenstr>

Status:

CLOSED ERRATA

QA Contact:

Israel Pinto <ipinto>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

3.5.1

CC:

bazulay, erik-fedora, gklein, gveitmic, kyulee, lpeer, lsurette, mavital, mgoldboi, michal.skrivanek, ofrenkel, pdwyer, pzhukov, sbonazzo, s.kieske, yeylon, ykaul, ylavi

Target Milestone:

ovirt-3.6.3

Keywords:

Reopened, ZStream

Target Release:

3.6.3

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

In some cases after migration fails, many error messages flooded the VDSM log, which caused one of the VDSM threads to consume 100% CPU. This was caused by incorrect usage of epoll, which has been fixed in this update.

Story Points:

---

Clone Of:

Clones:

1297414 (view as bug list)

Environment:

Last Closed:

2016-03-09 19:40:50 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1297414

Attachments:

Description	Flags
vdsm log	none

Description Pavel Zhukov 2015-06-01 12:23:35 UTC

Description of problem:
If VM has beed powered down then host was in "prepare for maintenance" state vdsm consumes 100% of CPU after reactivation of the host.

Version-Release number of selected component (if applicable):
vdsm-4.16.13.1-1.el6ev.x86_64

How reproducible:
unknown yet

Steps to Reproduce:
1. Two host environment. Put one host to maintenance mode (second host doesn't have enough resources to run all VMs so migration(s) is failed)
2. Wait until migration is failed
3. Power the VM(s) off
4. Activate the host

Actual results:
vdsm consumes 100% of CPU
vdsm log is flooded with [1]

Expected results:
vmchannel listener should not monitor stopped VMs / deleted fd


Additional info:


[1]
VM Channels Listener::INFO::2015-06-01 12:20:17,429::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 82
VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,430::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 13
VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,430::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 84
VM Channels Listener::DEBUG::2015-06-01 12:20:17,430::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,431::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 82
VM Channels Listener::DEBUG::2015-06-01 12:20:17,431::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.
VM Channels Listener::INFO::2015-06-01 12:20:17,431::vmchannels::54::vds::(_handle_event) Received 00000019 on fileno 13
VM Channels Listener::DEBUG::2015-06-01 12:20:17,431::vmchannels::59::vds::(_handle_event) Received 00000019. On fd removed by epoll.

Comment 1 Michal Skrivanek 2015-06-04 09:51:41 UTC

we need the rest of the vdsm.log, would you add it please

Comment 2 Pavel Zhukov 2015-06-05 08:08:53 UTC

Created attachment 1035078 [details]
vdsm log

The acrhive was not completed because of system overloading but it can be extracted.

Comment 3 Michal Skrivanek 2015-06-05 12:28:48 UTC

this happened at the end of migration (from src point of view)
Can you add libvirt and qemu logs from both sides? Is it reproducible?
Seems that the migration was not properly cancelled as I see the src VM got suspended, and then the qemu process went crazy

Comment 4 Michal Skrivanek 2015-06-23 08:28:32 UTC

will add defensive code

Comment 5 Michal Skrivanek 2015-07-30 08:19:07 UTC

nacked by upstream, will try one last push for solution

Comment 10 Moran Goldboim 2015-09-17 10:44:28 UTC

since we don't not know the cause of the CPU consumption and not happy with the defensive code of restarting vdsm (could create additional issues). closing this one till we can get a reproducer.

Comment 14 Vinzenz Feenstra [evilissimo] 2015-12-14 14:50:12 UTC

After some investigation we have found the root cause of this and can now handle this appropriately.

The issue comes from a wrong assumption which has been made in some earlier changes.
Originally it was assumed that we can rely on epoll to automatically unregister fds when they have been closed.
However when forking a process (that is the case when starting child processes) the applications open file handles are getting shared with the child and if not both child and parent close them epoll won't consider those fds as closed and won't automatically unregister them from the watched set.
Usually this problem is solved by passing the flag SOCK_CLOEXEC to a socket so in case of a fork, those handles get automatically closed in the child process. However this flag only exists in python 3.2 and higher.
So we should at least set the flag even that we know it's raceful and we should handle errors on the fd with unregistering it from epoll - And in case that we're watching that socket and have a handle to it, then also close it.

Comment 17 Israel Pinto 2016-02-07 12:41:04 UTC

verify with:
RHEVM: 3.6.3-0.1.el6
vdsm: vdsm-4.17.19-0.el7ev
libvirt: libvirt-1.2.17-13.el7_2.2

Steps to Reproduce:
1. Two host environment. Put one host to maintenance mode (second host doesn't have enough resources to run all VMs so migration(s) is failed)
2. Wait until migration is failed
3. Power the VM(s) off
4. Activate the host

Actual results:
The VM is down, Host is up and running.
No 'vmchannels' print in vdsm log.
Check PASS

Comment 19 errata-xmlrpc 2016-03-09 19:40:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html