Bug 1723881 - [RHOS-15] User configurable time out for instance destroy to prevent error: libvirtError: Failed to terminate process <pid> with SIGKILL: Device or resource busy
Summary: [RHOS-15] User configurable time out for instance destroy to prevent error: ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 15.0 (Stein)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: beta
: 15.0 (Stein)
Assignee: Kashyap Chamarthy
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks: 1636190 1759125 1789339
TreeView+ depends on / blocked
 
Reported: 2019-06-25 15:39 UTC by Kashyap Chamarthy
Modified: 2023-03-21 19:18 UTC (History)
8 users (show)

Fixed In Version: openstack-nova-19.0.2-0.20190701170413.b01bc2f.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-21 11:23:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 639091 0 None MERGED libvirt: Rework 'EBUSY' (SIGKILL) error handling code path 2020-03-24 16:14:04 UTC
OpenStack gerrit 667389 0 None MERGED libvirt: Rework 'EBUSY' (SIGKILL) error handling code path 2020-03-24 16:14:00 UTC
Red Hat Product Errata RHEA-2019:2811 0 None None None 2019-09-21 11:23:51 UTC

Description Kashyap Chamarthy 2019-06-25 15:39:31 UTC
[This bug is a clone of the RHOS-10 bug #1636190]

-------------------------------------------------------------------------------
Description of problem:
In an environment with large compute nodes (3TB of RAM) and high instance add/delete churn, the current wait time outs for instance destroy may not be sufficient.  The destroy time outs should be user configurable to support such environment.

This environment sees the following instance destroy failures daily.

2018-09-25 14:50:25.251 250438 WARNING nova.virt.libvirt.driver [req-9c05c3c1-ab08-4a14-b036-ad10b987b8e5 ad64ce5e9890b9596163edd10c8a4da2bca62c4f84f720b59cf30d20903c60ab b297db4812004ed9938eae3c776467ad - - -] [instance: ee0cd616-c3da-41fa-9af4-4c3173f85764] Error from libvirt during destroy. Code=38 Error=Failed to terminate process 203650 with SIGKILL: Device or resource busy; attempt 3 of 3

The instances eventually terminate and allowing a longer time out would prevent this error.

References:

https://bugzilla.redhat.com/show_bug.cgi?id=1205647
https://github.com/libvirt/libvirt/blob/9a4e4b942df0474503e7524ea427351a46c0eabe/src/util/virprocess.c#L349
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L843

[...]
-------------------------------------------------------------------------------

Comment 2 Kashyap Chamarthy 2019-06-25 15:41:42 UTC
This is already merged in this commit upstream[*]:

    commit 10d50ca4e210039aeae84cb9bd5d18895948af54
    Author: Kashyap Chamarthy <kchamart>
    Date:   Mon Feb 25 13:26:24 2019 +0100

        libvirt: Rework 'EBUSY' (SIGKILL) error handling code path
    
        Change ID I128bf6b939 (libvirt: handle code=38 + sigkill (ebusy) in
        _destroy()) handled the case where a QEMU process "refuses to die" within
        a given timeout period set by libvirt.
    
        Originally, libvirt sent SIGTERM (allowing the process to clean-up
        resources), then waited 10 seconds, if the guest didn't go away.  Then
        it sent, the more lethal, SIGKILL and waited another 5 seconds for it to
        take effect.
    
        From libvirt v4.7.0 onwards, libvirt increased[1][2] the time it waits
        for a guest hard shutdown to complete.  It now waits for 30 seconds for
        SIGKILL to work (instead of 5).  Also, additional wait time is added if
        there are assigned PCI devices, as some of those tend to slow things
        down.
    
        In this change:
    
          - Increment the counter to retry the _destroy() call from 3 to 6, thus
            increasing the total time from 15 to 30 seconds, before SIGKILL
            takes effect.  And it matches the (more graceful) behaviour of
            libvirt v4.7.0.  This also gives breathing room for Nova instances
            running in environments with large compute nodes with high instance
            creation or delete churn, where the current timout may not be
            sufficient.
    
          - Retry the _destroy() API call _only_ if MIN_LIBVIRT_VERSION is lower
            than 4.7.0.
    
        [1] https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=9a4e4b9
            (process: wait longer 5->30s on hard shutdown)
        [2] https://libvirt.org/git/?p=libvirt.git;a=commit;h=be2ca04 ("process:
            wait longer on kill per assigned Hostdev")
    
        Related-bug: #1353939
    
        Change-Id: If2035cac931c42c440d61ba97ebc7e9e92141a28
        Signed-off-by: Kashyap Chamarthy <kchamart>


[*] https://opendev.org/openstack/nova/commit/10d50ca4e2
    — "libvirt: Rework 'EBUSY' (SIGKILL) error handling code path"

Comment 8 errata-xmlrpc 2019-09-21 11:23:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811


Note You need to log in before you can comment on or make changes to this bug.