Description of problem: VM shutdown / reboot failed with following error & left the instance in ERROR state . Error: Failed to launch instance "vminfo.casl.gov": Please try again later [Error: Failed to terminate process 4260 with SIGKILL: Device or resource busy]. Version-Release number of selected component (if applicable): openstack-nova-compute-2014.1.1-4.el7ost.noarch How reproducible: No. Not after Compute node reboot on customer site . Steps to Reproduce: 1. 2. 3. Actual results: Instance shutdown / reboot fails & leaves the instance in ERROR state . Expected results: Instance shutdown / reboot occurs successfully . Additional info: Similar bug raised for rhos6 : https://bugzilla.redhat.com/show_bug.cgi?id=1188609 Noticed the following on the Compute node - # nova show 4b4f943b-8b27-4651-bc67-b6e2f14dbd07 | grep fault | fault | {"message": "Failed to terminate process 4260 with SIGKILL: Device or resource busy", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 290, in decorated_function | Notice a defunct qemu process: # ps -f -p 108082 UID PID PPID C STIME TTY TIME CMD qemu 4260 1 3 2014 ? 2-12:49:36 [qemu-kvm] <defunct> Also notice these syslog messages on the console of hypervisor, although not sure if this is related . Message from syslogd@comp01 at Mar 3 11:41:53 ... kernel:BUG: soft lockup - CPU#14 stuck for 22s! [ovs-vswitchd:2124]
A few things: - See the rationale here, for _why_ a SIGKILL (kill unconditionally, and the receiving process will not get a chance to cleanup) is issued: https://bugzilla.redhat.com/show_bug.cgi?id=1188609#c9 - From the bug description, reporter says it's not clearly reproducible. So, effectively this bug needs is in NEEDINFO, until a proper reproducer is provided with contextual libvirt and Nova debug logs. - If it's reproducible consistenlty, when obtaining logs, please follow the steps described below: https://bugzilla.redhat.com/show_bug.cgi?id=1188609#c4 with one change, the 'log_filters' should be as below (ensure it is in a single line): log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util 1: qemu_monitor"
FYI, the EBUSY error code is actually one that's reported by libvirt when the process fails to die in an acceptable amount of time. The EBUSY isn't directly related to anything in the OS / storage stack. There are two reasons why this might happen - The host is so overloaded that the kernel was not able to clean up the process in the time that libvirt was prepared to wait. If this is the case, the process should eventually go away on its own after a short while longer and everything should return to normal - There is some problem, causing the process to get stuck in an uninterruptable wait state. This is usually due to something going wrong in the storage stack, causing some I/O read/write operation to hang in kernel space. In this case the process will stay around in the zombie state forever, or until the storage problem is resolved. Assuming the defunct process is not going away of its own accord, it sounds like the second scenario is more likely here. This isn't really a bug in the shutdown / reboot call in nova or libvirt - there's nothing they can do if the process is stuck in kernel space
I'm inclined to close this the bug as "CANTFIX" as per the analysis in comment #13, specifically note the last paragraph there. If this can be reliably triggered by a Nova reproducer, feel free to reopen it.
Also, there's a fix merged upstream (also I backported it to stable/Kilo branch upstream ), that should help alleviate this problem once the below fix makes into RHOS Nova as part of the next rebase. commit dc6af6bf861b510834122aa75750fd784578e197 Author: Matt Riedemann <mriedem.com> Date: Sun May 10 18:46:37 2015 -0700 libvirt: handle code=38 + sigkill (ebusy) in destroy() Handle the libvirt error during destroy when the sigkill fails due to an EBUSY. This is taken from a comment by danpb in the bug report as a potential workaround. Co-authored-by: Daniel Berrange (berrange) Closes-Bug: #1353939 Conflicts: nova/tests/unit/virt/libvirt/test_driver.py NOTE (kashyapc): 'stable/kilo' branch doesn't have the 'libvirt_guest' object, so, adjust the below unit tests accordingly: test_private_destroy_ebusy_timeout test_private_destroy_ebusy_multiple_attempt_ok Change-Id: I128bf6b939fbbc85df521fd3fe23c3c6f93b1b2c (cherry picked from commit 3907867601d1044eaadebff68a590d176abff6cf)
Verified as follows - No errors observed during shutdown/reboot instances ************** Version ************** [root@rhos-compute-node-02 nova(keystone_admin)]# yum list installed | grep openstack-nova openstack-nova-api.noarch 2014.1.5-27.el7ost @rhelosp-5.0-el7-puddle openstack-nova-cert.noarch 2014.1.5-27.el7ost @rhelosp-5.0-el7-puddle openstack-nova-common.noarch 2014.1.5-27.el7ost @rhelosp-5.0-el7-puddle openstack-nova-compute.noarch 2014.1.5-27.el7ost @rhelosp-5.0-el7-puddle openstack-nova-conductor.noarch 2014.1.5-27.el7ost @rhelosp-5.0-el7-puddle openstack-nova-console.noarch 2014.1.5-27.el7ost @rhelosp-5.0-el7-puddle openstack-nova-novncproxy.noarch 2014.1.5-27.el7ost @rhelosp-5.0-el7-puddle openstack-nova-scheduler.noarch 2014.1.5-27.el7ost @rhelosp-5.0-el7-puddle [root@rhos-compute-node-02 nova(keystone_admin)]# ******* Logs ****** [root@rhos-compute-node-02 nova(keystone_admin)]# nova list +--------------------------------------+------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------+--------+------------+-------------+---------------------+ | 2d956abd-d4c6-406f-a627-55455bb32513 | vm1 | ACTIVE | - | Running | public=172.24.4.229 | +--------------------------------------+------+--------+------------+-------------+---------------------+ [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# nova stop vm1 [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# nova list +--------------------------------------+------+---------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------+---------+------------+-------------+---------------------+ | 2d956abd-d4c6-406f-a627-55455bb32513 | vm1 | SHUTOFF | - | Shutdown | public=172.24.4.229 | +--------------------------------------+------+---------+------------+-------------+---------------------+ [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# nova start vm1 [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# nova list +--------------------------------------+------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------+--------+------------+-------------+---------------------+ | 2d956abd-d4c6-406f-a627-55455bb32513 | vm1 | ACTIVE | - | Running | public=172.24.4.229 | +--------------------------------------+------+--------+------------+-------------+---------------------+ [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# nova suspend vm1 [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# nova list +--------------------------------------+------+-----------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------+-----------+------------+-------------+---------------------+ | 2d956abd-d4c6-406f-a627-55455bb32513 | vm1 | SUSPENDED | - | Shutdown | public=172.24.4.229 | +--------------------------------------+------+-----------+------------+-------------+---------------------+ [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# nova resume vm1 [root@rhos-compute-node-02 nova(keystone_admin)]# nova list +--------------------------------------+------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------+--------+------------+-------------+---------------------+ | 2d956abd-d4c6-406f-a627-55455bb32513 | vm1 | ACTIVE | - | Running | public=172.24.4.229 | +--------------------------------------+------+--------+------------+-------------+---------------------+ [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# virsh list --all Id Name State ---------------------------------------------------- 6 instance-00000003 running [root@rhos-compute-node-02 nova(keystone_admin)]# virsh shutdown 6 Domain 6 is being shutdown [root@rhos-compute-node-02 nova(keystone_admin)]# virsh list --all Id Name State ---------------------------------------------------- - instance-00000003 shut off [root@rhos-compute-node-02 nova(keystone_admin)]# virsh start instance-00000003 Domain instance-00000003 started [root@rhos-compute-node-02 nova(keystone_admin)]# [root@rhos-compute-node-02 nova(keystone_admin)]# virsh list --all Id Name State ---------------------------------------------------- 10 instance-00000003 running [root@rhos-compute-node-02 nova(keystone_admin)]# grep "SIGKILL " /var/log/nova/*
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0361.html