1073624 – Nova delete fails due to Libvirtd timeout

Bug 1073624 - Nova delete fails due to Libvirtd timeout

Summary: Nova delete fails due to Libvirtd timeout

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	4.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.0
Assignee:	Daniel Berrangé
QA Contact:	Ami Jeain
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1040649 1081488
TreeView+	depends on / blocked

Reported:	2014-03-06 20:10 UTC by Joe Talerico
Modified:	2023-09-18 09:58 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-10-30 15:45:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Joe Talerico 2014-03-06 20:10:38 UTC

Description of problem:
When issuing multiple Nova deletes, some guests go into ShutOff state instead of being deleted.

Version-Release number of selected component (if applicable):
[root@pcloud16 ~]# libvirtd --version
libvirtd (libvirt) 0.10.2
[root@pcloud16 ~]# nova --version
2.15.0
[root@pcloud16 ~]# rpm -qa | grep nova
openstack-nova-compute-2013.2.2-2.el6ost.noarch
python-novaclient-2.15.0-2.el6ost.noarch
python-nova-2013.2.2-2.el6ost.noarch
openstack-nova-common-2013.2.2-2.el6ost.noarch
[root@pcloud16 ~]#

How reproducible:
Depends on the volume of deletes happening concurrently.


Steps to Reproduce:
0. RHEL OSP env. 1x controller, 1x neutron server, 1x compute node
1. Launch multiple guests (I see this at 40 guests on a single compute node)
2. Connect via SSH to the guests (I am running netperf between two guests)
3. Delete All guests 

Actual results:
Guest gets put into ShutOff state.

libvirtd logs:
30610: warning : qemuProcessKill:4174 : Timed out waiting after SIGTERM to process 26111, sending SIGKILL
2014-03-05 07:31:42.466+0000: 30610: warning : qemuProcessKill:4206 : Timed out waiting after SIGKILL to process 26111
2014-03-05 07:31:42.466+0000: 30610: error : qemuDomainDestroyFlags:2098 : operation failed: failed to kill qemu process with SIGTERM

Nova log:
2855 ERROR nova.virt.libvirt.driver [req-0e54e859-0a4e-42e9-a75b-4850efe4e6d1 313a479fb18348f980cc6870e5051104 fabc2431f2e2412b96459c918f749f54] [instance: cc4f67e0-fc8e-495b-a507-649699d46ed7] Error from libvirt during destroy. Code=9 Error=operation failed: failed to kill qemu process with SIGTERM

Expected results:
Nova to delete the guest or keep trying to remove the guest.

Additional info:

Comment 4 Jiri Denemark 2014-06-20 17:54:50 UTC

This doesn't look like a bug in qemu-kvm. It rather looks like the host is so busy processes don't die early enough even after SIGKILL.

I guess deleting a guest in nova means the guest is destroyed and undefining it. It seems to me that nova calls virDomainDestroyFlags() and since it fails (due to the timeout), it doesn't do the rest. And once the qemu-kvm process actually dies, nova reports the guest as ShutOff.

A possible solution could be to move the rest of the deletion process to a VIR_DOMAIN_EVENT_STOPPED handler, which would make sure the domain is deleted once it stops.

Another option would be to introduce new flag for virDomainDestroyFlags to make it wait until the process dies instead of giving up after 15 seconds.

Comment 5 Russell Bryant 2014-06-20 21:40:51 UTC

Dan, can you take a look at this and see if nova's libvirt driver should be changed per Jiri's comment #4?

Comment 6 Daniel Berrangé 2014-06-23 08:36:09 UTC

Seems Nova could just retry the destroy operation, and/or undefine the guest regardless.

Comment 7 Russell Bryant 2014-07-07 19:24:10 UTC

By email we discussed whether or not this was still an issue with a newer version of Nova.  I don't think I heard back on that.  If it is, I think this is something that should just be tracked upstream.  Please report a bug on launchpad if you're able to reproduce.  Thanks!

Comment 11 Stephen Gordon 2014-10-30 15:45:34 UTC

Eduard, we haven't received any further information from you about the customer's issue in response to Dan's comment. As a result I am closing this bug.

Comment 12 Qin Zhao 2014-10-31 18:00:35 UTC

A bug is opened in OpenStack community --> https://bugs.launchpad.net/nova/+bug/1387950

Note You need to log in before you can comment on or make changes to this bug.