Bug 1636190 - [OSP-10] Wait longer before issuing SIGKILL on instance destroy, to handle "libvirtError: Failed to terminate process <pid> with SIGKILL: Device or resource busy"
Summary: [OSP-10] Wait longer before issuing SIGKILL on instance destroy, to handle "l...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: z14
: 10.0 (Newton)
Assignee: Kashyap Chamarthy
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
: 1489980 (view as bug list)
Depends On: 1723881
Blocks: 1759125 1789339
TreeView+ depends on / blocked
 
Reported: 2018-10-04 16:44 UTC by Matt Flusche
Modified: 2024-03-25 15:08 UTC (History)
12 users (show)

Fixed In Version: openstack-nova-14.1.0-58.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1759125 1789339 (view as bug list)
Environment:
Last Closed: 2019-12-17 16:52:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 639091 0 'None' MERGED libvirt: Rework 'EBUSY' (SIGKILL) error handling code path 2021-01-28 02:14:07 UTC
Red Hat Issue Tracker OSP-23387 0 None None None 2023-03-21 19:04:38 UTC
Red Hat Product Errata RHBA-2019:4299 0 None None None 2019-12-17 16:52:45 UTC

Description Matt Flusche 2018-10-04 16:44:01 UTC
Description of problem:
In an environment with large compute nodes (3TB of RAM) and high instance add/delete churn, the current wait time outs for instance destroy may not be sufficient.  The destroy time outs should be user configurable to support such environment.

This environment sees the following instance destroy failures daily.

2018-09-25 14:50:25.251 250438 WARNING nova.virt.libvirt.driver [req-9c05c3c1-ab08-4a14-b036-ad10b987b8e5 ad64ce5e9890b9596163edd10c8a4da2bca62c4f84f720b59cf30d20903c60ab b297db4812004ed9938eae3c776467ad - - -] [instance: ee0cd616-c3da-41fa-9af4-4c3173f85764] Error from libvirt during destroy. Code=38 Error=Failed to terminate process 203650 with SIGKILL: Device or resource busy; attempt 3 of 3

The instances eventually terminate and allowing a longer time out would prevent this error.

References:

https://bugzilla.redhat.com/show_bug.cgi?id=1205647
https://github.com/libvirt/libvirt/blob/9a4e4b942df0474503e7524ea427351a46c0eabe/src/util/virprocess.c#L349
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L843

Version-Release number of selected component (if applicable):
OSP 10
openstack-nova-compute-14.1.0-22.el7ost.noarch
libvirt-daemon-3.9.0-14.el7_5.6.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.3.x86_64

How reproducible:
Daily in this specific environment


Additional info:
I'll provide additional environment details and logs

Comment 2 Matthew Booth 2018-10-11 10:21:10 UTC
Matt's random thoughts: 

* do we have any way to ask libvirt if the shutdown is still in progress and expected to complete eventually?
* are there circumstances in which a shutdown will never complete, but also not fail?
  * i.e. Can we just remove the timeout and just handle failure?

Comment 4 Daniel Berrangé 2018-10-12 12:12:43 UTC
(In reply to Matthew Booth from comment #2)
> Matt's random thoughts: 
> 
> * do we have any way to ask libvirt if the shutdown is still in progress and
> expected to complete eventually?

In theory there is a "Shutting down" state but I don't think we use that in the QEMU driver in libvirt. It just remains "running" until it goes to "shutoff".

We few weeks ago though we did majorly increase the time we wait for shutdown to complete in libvirt. Originally we send SIGTERM, then wait 10 seconds, and sent SIGKILL and wait another 5 seconds.

With the new code we wait 30 seconds for SIGKILL to work instead of 5. We also add even longer wait if there are PCI devices assigned as some of those slow things down alot.

commit 9a4e4b942df0474503e7524ea427351a46c0eabe
Author: Christian Ehrhardt <christian.ehrhardt>
Date:   Mon Aug 6 12:10:38 2018 +0200

    process: wait longer 5->30s on hard shutdown
    
    In cases where virProcessKillPainfully already reailizes that
    SIGTERM wasn't enough we are partially on a bad path already.
    Maybe the system is overloaded or having serious trouble to free and
    reap resources in time.
    
    In those case give the SIGKILL that was sent after 10 seconds some more
    time to take effect if force was set (only then we are falling back to
    SIGKILL anyway).
    
    Signed-off-by: Christian Ehrhardt <christian.ehrhardt>
    Reviewed-by: Daniel P. Berrangé <berrange>

commit be2ca0444728edd12a000653d3693d68a5c9102f
Author: Christian Ehrhardt <christian.ehrhardt>
Date:   Thu Aug 2 09:05:18 2018 +0200

    process: wait longer on kill per assigned Hostdev

    
    It was found that in cases with host devices virProcessKillPainfully
    might be able to send signal zero to the target PID for quite a while
    with the process already being gone from /proc/<PID>.
    
    That is due to cleanup and reset of devices which might include a
    secondary bus reset that on top of the actions taken has a 1s delay
    to let the bus settle. Due to that guests with plenty of Host devices
    could easily exceed the default timeouts.
    
    To solve that, this adds an extra delay of 2s per hostdev that is associated
    to a VM.
    
    Reviewed-by: Daniel P. Berrangé <berrange>
    Signed-off-by: Christian Ehrhardt <christian.ehrhardt>



> * are there circumstances in which a shutdown will never complete, but also
> not fail?
>   * i.e. Can we just remove the timeout and just handle failure?

I'm not sure what you mean by "not fail" ?

If the process does die after we sent it SIGKILL, then we'll return an error from virDomainDestroy after the timeout (5secs, now 30 secs).

If the system was merely busy the QEMU might still die after that. If the QEMU was stuck in kernel space, eg due to dead storage path, it might be stuck forever (until host reboot).

We can't easily distinguish which of these two scenarios applies, but you can't rese resources for another VM until the original QEMU has gone completely.

Comment 5 Matthew Booth 2018-10-22 10:21:29 UTC
Can't say I'm a huge fan of this kind of tuning knob, but based on comment 4 I can't think of a better solution. I'll bring it up in the team meeting.

Comment 8 Alex Stupnikov 2018-12-25 09:02:16 UTC
Hello. Could we please have some update on this one. BR, Alex.

Comment 10 Kashyap Chamarthy 2019-02-25 10:29:01 UTC
(In reply to Daniel Berrange from comment #4)

[...]

> We few weeks ago though we did majorly increase the time we wait for
> shutdown to complete in libvirt. Originally we send SIGTERM, then wait 10
> seconds, and sent SIGKILL and wait another 5 seconds.
> 
> With the new code we wait 30 seconds for SIGKILL to work instead of 5. We
> also add even longer wait if there are PCI devices assigned as some of those
> slow things down alot.
> 
> commit 9a4e4b942df0474503e7524ea427351a46c0eabe
> Author: Christian Ehrhardt <christian.ehrhardt>
> Date:   Mon Aug 6 12:10:38 2018 +0200
> 
>     process: wait longer 5->30s on hard shutdown
>     
>     In cases where virProcessKillPainfully already reailizes that
>     SIGTERM wasn't enough we are partially on a bad path already.
>     Maybe the system is overloaded or having serious trouble to free and
>     reap resources in time.
>     
>     In those case give the SIGKILL that was sent after 10 seconds some more
>     time to take effect if force was set (only then we are falling back to
>     SIGKILL anyway).
>     
>     Signed-off-by: Christian Ehrhardt <christian.ehrhardt>
>     Reviewed-by: Daniel P. Berrangé <berrange>
> 
> commit be2ca0444728edd12a000653d3693d68a5c9102f
> Author: Christian Ehrhardt <christian.ehrhardt>
> Date:   Thu Aug 2 09:05:18 2018 +0200
> 
>     process: wait longer on kill per assigned Hostdev
> 
>     
>     It was found that in cases with host devices virProcessKillPainfully
>     might be able to send signal zero to the target PID for quite a while
>     with the process already being gone from /proc/<PID>.
>     
>     That is due to cleanup and reset of devices which might include a
>     secondary bus reset that on top of the actions taken has a 1s delay
>     to let the bus settle. Due to that guests with plenty of Host devices
>     could easily exceed the default timeouts.
>     
>     To solve that, this adds an extra delay of 2s per hostdev that is
> associated
>     to a VM.
>     
>     Reviewed-by: Daniel P. Berrangé <berrange>
>     Signed-off-by: Christian Ehrhardt <christian.ehrhardt>


I wonder if it's a reasonable to request to backport the above two 
commits (available from libvirt v4.10.0 onwards) to RHEL 7.6.  FWIW, 
the `diffstat` looks small, and doesn't look very risky to my eyes.

The customer is using RHEL 7.5.  And _assuming_ the new timeouts are 
sufficient, maybe they are willing to update to RHEL 7.6.

Comment 11 Kashyap Chamarthy 2019-02-25 11:07:48 UTC
(In reply to Kashyap Chamarthy from comment #10)

> I wonder if it's a reasonable to request to backport the above two 
> commits (available from libvirt v4.10.0 onwards) to RHEL 7.6. 

I mixed up the libvirt version from which the said two libvirt 
patches are available:

    "v4.10.0" --> "v4.7.0"

[...]


Based on discussion with DanPB and Matt on IRC, we are leaning towards
the following solution:

    Increase the counter to call destroy() API (when EBUSY hits) in Nova
    from 3 to 6 (so that it matches what libvirt upstream does).  We 
    don't want to add yet-another config attribute; we already have far
    too many.

Comment 12 Kashyap Chamarthy 2019-06-03 15:16:00 UTC
The upstream patch has merged:

    https://opendev.org/openstack/nova/commit/10d50ca4e2
    — "libvirt: Rework 'EBUSY' (SIGKILL) error handling code path"

    (https://review.opendev.org/#/c/639091/)

Comment 16 Lee Yarwood 2019-10-28 13:24:12 UTC
*** Bug 1489980 has been marked as a duplicate of this bug. ***

Comment 19 errata-xmlrpc 2019-12-17 16:52:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4299


Note You need to log in before you can comment on or make changes to this bug.