1205647 – nova.virt.libvirt.driver fails to shutdown reboot instance with error 'Code=38 Error=Failed to terminate process 4260 with SIGKILL: Device or resource busy'

Bug 1205647 - nova.virt.libvirt.driver fails to shutdown reboot instance with error 'Code=38 Error=Failed to terminate process 4260 with SIGKILL: Device or resource busy'

Summary: nova.virt.libvirt.driver fails to shutdown reboot instance with error 'Code=3...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	5.0 (RHEL 7)
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	5.0 (RHEL 7)
Assignee:	Kashyap Chamarthy
QA Contact:	Prasanth Anbalagan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1303075 1304081 1309280 1309281
TreeView+	depends on / blocked

Reported:	2015-03-25 12:01 UTC by Jaison Raju
Modified:	2019-10-10 09:42 UTC (History)
CC List:	14 users (show)
Fixed In Version:	openstack-nova-2014.1.5-21.el7ost
Doc Type:	Bug Fix
Doc Text:	Cause: When libvirt kills the QEMU process it sends it SIGTERM first and waits 10 seconds. If it hasn't gone it sends SIGKILL and waits another 5 seconds. If it still hasn't gone then you get this EBUSY error Consequence: Shutdown/rescue operations on a Nova instance can fail with errors like "'Failed to terminate process $PID: Device or resource busy'" Fix: Retry the libvirt destroy() call three more times, as existing 15 seconds timeout can be a little too short in some cases. Result: Some more time to allow Nova instances to gracefully shutdown (or to be rescued).
Clone Of:
Clones:	1303075 1304081 1309280 1309281 (view as bug list)
Environment:
Last Closed:	2016-03-08 17:07:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1609683	0	None	None	None	Never
Red Hat Product Errata	RHBA-2016:0361	0	normal	SHIPPED_LIVE	openstack-nova bug fix advisory	2016-03-08 22:06:33 UTC

Description Jaison Raju 2015-03-25 12:01:17 UTC

Description of problem:
VM shutdown / reboot failed with following error & left the instance in ERROR state .

Error: Failed to launch instance "vminfo.casl.gov": Please try again later [Error: Failed to terminate process 4260 with SIGKILL: Device or resource busy].

Version-Release number of selected component (if applicable):
openstack-nova-compute-2014.1.1-4.el7ost.noarch

How reproducible:
No. Not after Compute node reboot on customer site .

Steps to Reproduce:
1.
2.
3.

Actual results:
Instance shutdown / reboot fails & leaves the instance in ERROR state .

Expected results:
Instance shutdown / reboot occurs successfully .

Additional info:
Similar bug raised for rhos6 : https://bugzilla.redhat.com/show_bug.cgi?id=1188609

Noticed the following on the Compute node -

# nova show 4b4f943b-8b27-4651-bc67-b6e2f14dbd07 | grep fault
| fault | {"message": "Failed to terminate process 4260 with SIGKILL: Device or resource busy", "code": 500, "details": " File
\"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 290, in decorated_function |

Notice a defunct qemu process:

# ps -f -p 108082
UID PID PPID C STIME TTY TIME CMD
qemu 4260 1 3 2014 ? 2-12:49:36 [qemu-kvm] <defunct>

Also notice these syslog messages on the console of hypervisor, although not sure if this is related .

Message from syslogd@comp01 at Mar 3 11:41:53 ...
kernel:BUG: soft lockup - CPU#14 stuck for 22s! [ovs-vswitchd:2124]

Comment 12 Kashyap Chamarthy 2015-05-06 10:31:39 UTC

A few things:

  - See the rationale here, for _why_ a SIGKILL (kill unconditionally,
    and the receiving process will not get a chance to cleanup) is
    issued:

      https://bugzilla.redhat.com/show_bug.cgi?id=1188609#c9

  - From the bug description, reporter says it's not clearly 
    reproducible. So, effectively this bug needs is in NEEDINFO, until a
    proper reproducer is provided with contextual libvirt and Nova debug 
    logs.

  - If it's reproducible consistenlty, when obtaining logs, please 
    follow the steps described below:

        https://bugzilla.redhat.com/show_bug.cgi?id=1188609#c4

    with one change, the 'log_filters' should be as below (ensure it is
    in a single line):
    
        log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json
        3:file 3:object 1:util 1: qemu_monitor"

Comment 13 Daniel Berrangé 2015-05-06 10:40:06 UTC

FYI, the EBUSY error code is actually one that's reported by libvirt when the process fails to die in an acceptable amount of time. The EBUSY isn't directly related to anything in the OS / storage stack.

There are two reasons why this might happen

 - The host is so overloaded that the kernel was not able to clean up the process in the time that libvirt was prepared to wait. If this is the case, the process should eventually go away on its own after a short while longer and everything should return to normal

 - There is some problem, causing the process to get stuck in an uninterruptable wait state. This is usually due to something going wrong in the storage stack, causing some I/O read/write operation to hang in kernel space. In this case the process will stay around in the zombie state forever, or until the storage problem is resolved.

Assuming the defunct process is not going away of its own accord, it sounds like the second scenario is more likely here. This isn't really a bug in the shutdown / reboot call in nova or libvirt  - there's nothing they can do if the process is stuck in kernel space

Comment 14 Kashyap Chamarthy 2015-06-08 09:08:40 UTC

I'm inclined to close this the bug as "CANTFIX" as per the analysis in comment #13, specifically note the last paragraph there.

If this can be reliably triggered by a Nova reproducer, feel free to reopen it.

Comment 15 Kashyap Chamarthy 2015-07-08 10:56:58 UTC

Also, there's a fix merged upstream (also I backported it to stable/Kilo branch upstream ), that should help alleviate this problem once the below fix makes into RHOS Nova as part of the next rebase.


commit dc6af6bf861b510834122aa75750fd784578e197
Author: Matt Riedemann <mriedem.com>
Date:   Sun May 10 18:46:37 2015 -0700

    libvirt: handle code=38 + sigkill (ebusy) in destroy()
    
    Handle the libvirt error during destroy when the sigkill fails due to an
    EBUSY. This is taken from a comment by danpb in the bug report as a
    potential workaround.
    
    Co-authored-by: Daniel Berrange (berrange)
    
    Closes-Bug: #1353939
    
    Conflicts:
        nova/tests/unit/virt/libvirt/test_driver.py
    
        NOTE (kashyapc): 'stable/kilo' branch doesn't have the
        'libvirt_guest' object, so, adjust the below unit tests accordingly:
    
            test_private_destroy_ebusy_timeout
            test_private_destroy_ebusy_multiple_attempt_ok
    
    Change-Id: I128bf6b939fbbc85df521fd3fe23c3c6f93b1b2c
    (cherry picked from commit 3907867601d1044eaadebff68a590d176abff6cf)

Comment 28 Prasanth Anbalagan 2016-03-01 16:06:46 UTC

Verified as follows - No errors observed during shutdown/reboot instances


**************
Version
**************

[root@rhos-compute-node-02 nova(keystone_admin)]# yum list installed | grep openstack-nova
openstack-nova-api.noarch        2014.1.5-27.el7ost      @rhelosp-5.0-el7-puddle
openstack-nova-cert.noarch       2014.1.5-27.el7ost      @rhelosp-5.0-el7-puddle
openstack-nova-common.noarch     2014.1.5-27.el7ost      @rhelosp-5.0-el7-puddle
openstack-nova-compute.noarch    2014.1.5-27.el7ost      @rhelosp-5.0-el7-puddle
openstack-nova-conductor.noarch  2014.1.5-27.el7ost      @rhelosp-5.0-el7-puddle
openstack-nova-console.noarch    2014.1.5-27.el7ost      @rhelosp-5.0-el7-puddle
openstack-nova-novncproxy.noarch 2014.1.5-27.el7ost      @rhelosp-5.0-el7-puddle
openstack-nova-scheduler.noarch  2014.1.5-27.el7ost      @rhelosp-5.0-el7-puddle
[root@rhos-compute-node-02 nova(keystone_admin)]# 

*******
Logs
******

[root@rhos-compute-node-02 nova(keystone_admin)]# nova list
+--------------------------------------+------+--------+------------+-------------+---------------------+
| ID                                   | Name | Status | Task State | Power State | Networks            |
+--------------------------------------+------+--------+------------+-------------+---------------------+
| 2d956abd-d4c6-406f-a627-55455bb32513 | vm1  | ACTIVE | -          | Running     | public=172.24.4.229 |
+--------------------------------------+------+--------+------------+-------------+---------------------+
[root@rhos-compute-node-02 nova(keystone_admin)]# 

[root@rhos-compute-node-02 nova(keystone_admin)]# nova stop vm1
[root@rhos-compute-node-02 nova(keystone_admin)]# 
[root@rhos-compute-node-02 nova(keystone_admin)]# 
[root@rhos-compute-node-02 nova(keystone_admin)]# nova list
+--------------------------------------+------+---------+------------+-------------+---------------------+
| ID                                   | Name | Status  | Task State | Power State | Networks            |
+--------------------------------------+------+---------+------------+-------------+---------------------+
| 2d956abd-d4c6-406f-a627-55455bb32513 | vm1  | SHUTOFF | -          | Shutdown    | public=172.24.4.229 |
+--------------------------------------+------+---------+------------+-------------+---------------------+
[root@rhos-compute-node-02 nova(keystone_admin)]# 
[root@rhos-compute-node-02 nova(keystone_admin)]# 
[root@rhos-compute-node-02 nova(keystone_admin)]# nova start vm1
[root@rhos-compute-node-02 nova(keystone_admin)]# 
[root@rhos-compute-node-02 nova(keystone_admin)]# nova list
+--------------------------------------+------+--------+------------+-------------+---------------------+
| ID                                   | Name | Status | Task State | Power State | Networks            |
+--------------------------------------+------+--------+------------+-------------+---------------------+
| 2d956abd-d4c6-406f-a627-55455bb32513 | vm1  | ACTIVE | -          | Running     | public=172.24.4.229 |
+--------------------------------------+------+--------+------------+-------------+---------------------+
[root@rhos-compute-node-02 nova(keystone_admin)]# 
[root@rhos-compute-node-02 nova(keystone_admin)]# 
[root@rhos-compute-node-02 nova(keystone_admin)]# nova suspend vm1
[root@rhos-compute-node-02 nova(keystone_admin)]# 
[root@rhos-compute-node-02 nova(keystone_admin)]# nova list
+--------------------------------------+------+-----------+------------+-------------+---------------------+
| ID                                   | Name | Status    | Task State | Power State | Networks            |
+--------------------------------------+------+-----------+------------+-------------+---------------------+
| 2d956abd-d4c6-406f-a627-55455bb32513 | vm1  | SUSPENDED | -          | Shutdown    | public=172.24.4.229 |
+--------------------------------------+------+-----------+------------+-------------+---------------------+
[root@rhos-compute-node-02 nova(keystone_admin)]# 
[root@rhos-compute-node-02 nova(keystone_admin)]# 
[root@rhos-compute-node-02 nova(keystone_admin)]# nova resume vm1
[root@rhos-compute-node-02 nova(keystone_admin)]# nova list
+--------------------------------------+------+--------+------------+-------------+---------------------+
| ID                                   | Name | Status | Task State | Power State | Networks            |
+--------------------------------------+------+--------+------------+-------------+---------------------+
| 2d956abd-d4c6-406f-a627-55455bb32513 | vm1  | ACTIVE | -          | Running     | public=172.24.4.229 |
+--------------------------------------+------+--------+------------+-------------+---------------------+
[root@rhos-compute-node-02 nova(keystone_admin)]# 


[root@rhos-compute-node-02 nova(keystone_admin)]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 6     instance-00000003              running

[root@rhos-compute-node-02 nova(keystone_admin)]# virsh shutdown 6
Domain 6 is being shutdown

[root@rhos-compute-node-02 nova(keystone_admin)]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     instance-00000003              shut off

[root@rhos-compute-node-02 nova(keystone_admin)]# virsh start instance-00000003
Domain instance-00000003 started

[root@rhos-compute-node-02 nova(keystone_admin)]# 

[root@rhos-compute-node-02 nova(keystone_admin)]# virsh  list --all
 Id    Name                           State
----------------------------------------------------
 10    instance-00000003              running



[root@rhos-compute-node-02 nova(keystone_admin)]# grep "SIGKILL " /var/log/nova/*

Comment 32 errata-xmlrpc 2016-03-08 17:07:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0361.html

Note You need to log in before you can comment on or make changes to this bug.