Bug 1789339

Summary:	[OSP-10] Wait longer before issuing SIGKILL on instance destroy, to handle "libvirtError: Failed to terminate process <pid> with SIGKILL: Device or resource busy"
Product:	Red Hat OpenStack	Reporter:	Andreas Karis <akaris>
Component:	openstack-nova	Assignee:	OSP DFG:Compute <osp-dfg-compute>
Status:	CLOSED CANTFIX	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	10.0 (Newton)	CC:	astupnik, dasmith, ebarrera, eglynn, jhakimra, kchamart, lyarwood, mbooth, mflusche, nova-maint, sbauza, sgordon, smykhail, stephenfin, vromanso, vumrao
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1636190	Environment:
Last Closed:	2020-12-11 14:55:23 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1636190, 1723881
Bug Blocks:	1759125

Comment 2 Andreas Karis 2020-01-09 12:02:27 UTC

Hi,

Cloning https://bugzilla.redhat.com/show_bug.cgi?id=1636190 and forwarding this here from the customer:

---------------------------------------------------------------------------------------------

Created By: Igor Tiunov (1/9/2020 9:40 AM)

I dont agree with this ERRATA and I will try to challenge the proposed solution. Perhaps, I will succeed.

There the first timeout was implemented on the libvirt side: [1]. When libvirt kills the QEMU process, it sends it SIGTERM first and waits 10 seconds. There are entered randomly 10 seconds, without any justification.
There is another timeout for hostdev problem already implemented [2], which is 2s for some reason.

As you can see from the comments of source code, there already several magic numbers [3] (2 seconds, 10 seconds, 30 seconds), which all intended to wait for process termination. We never know the time needed for the process to complete the I/O process, so the purpose of all these numbers is "green tests" on the CI, but not the solution of the core issue.

Move on, we have another magic number on the nova side [4]. Nova-compute service will wait for three more times to stop the instance. As you can see, these three times implemented because of the CI issue, but not the real issue from the battlefronts.

Now, an engineer suggests a workaround [5], and this change increases the timeout up to six times. So, we have another magic number from the sky not related to reality.

Assume that you will receive another request this year, and the solution will be an increase of up to 9 times?

We have found that the solution can be the graceful shutdown of instance, so, all I/O processes killed inside the VM operating system. Such a graceful shutdown implemented by the qemu guest agent [6].

In the case of an increase in the number of magic numbers in the project code, we get an over-complicated system of workarounds, rather than a well-thought-out system.

I hope I've delivered my message and you'll pass it on to the engineers and project managers and we'll find the right solution now, not an endless increase in timeouts.

[1] https://github.com/libvirt/libvirt/commit/9a4e4b942df0474503e7524ea427351a46c0eabe
[2] https://github.com/libvirt/libvirt/commit/be2ca0444728edd12a000653d3693d68a5c9102f
[3] https://github.com/libvirt/libvirt/blob/master/src/util/virprocess.c#L359
[4] https://review.opendev.org/#/c/181781/
[5] https://review.opendev.org/#/c/639091/
[6] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/html/virtual_machine_management_guide/installing_guest_agents_and_drivers_linux

---------------------------------------------------------------------------------------------

Comment 3 Andreas Karis 2020-01-09 12:21:43 UTC

Actually one correction from me though wrt the customer's comment about the  6 times wait timeout for nova: 
https://review.opendev.org/#/c/639091/11/nova/virt/libvirt/driver.py
~~~
# If the host has this libvirt version, then we skip the retry loop of
# instance destroy() call, as libvirt itself increased the wait time
# before the SIGKILL signal takes effect.
MIN_LIBVIRT_BETTER_SIGKILL_HANDLING = (4, 7, 0)
(...)
                        # TODO(kchamart): Once MIN_LIBVIRT_VERSION
                        # reaches v4.7.0, (a) rewrite the above note,
                        # and (b) remove the following code that retries
                        # _destroy() API call (which gives SIGKILL 30
                        # seconds to take effect) -- because from v4.7.0
                        # onwards, libvirt _automatically_ increases the
                        # timeout to 30 seconds.  This was added in the
                        # following libvirt commits:
                        #
                        #   - 9a4e4b942 (process: wait longer 5->30s on
                        #     hard shutdown)
                        #
                        #   - be2ca0444 (process: wait longer on kill
                        #     per assigned Hostdev)
~~~

We can see from the code that destroy only calls itself if the min libvirt version is not detected:
~~~
    def _destroy(self, instance, attempt=1):
 (...)
                        with excutils.save_and_reraise_exception() as ctxt:
                            if not self._host.has_min_version(
                                    MIN_LIBVIRT_BETTER_SIGKILL_HANDLING):
                                LOG.warning('Error from libvirt during '
                                            'destroy. Code=%(errcode)s '
                                            'Error=%(e)s; attempt '
                                            '%(attempt)d of 6 ',
                                            {'errcode': errcode, 'e': e,
                                             'attempt': attempt},
                                            instance=instance)
                                # Try up to 6 times before giving up.
                                if attempt < 6:
                                    ctxt.reraise = False
                                    self._destroy(instance, attempt + 1)
                                    return
~~~

Starting with libvirt 4.7.0, libvirt itself waits longer, that's [1][2] and in that case, nova will not retry:
[1] https://github.com/libvirt/libvirt/commit/9a4e4b942df0474503e7524ea427351a46c0eabe
[2] https://github.com/libvirt/libvirt/commit/be2ca0444728edd12a000653d3693d68a5c9102f

So, starting with libvirt 4.7.0, nova only waits for libvirt once and doesn't repeatedly call _destroy. Before libvirt 4.7.0, nova will indeed now retry 6 times, instead of 3.

I think though that this doesn't change anything to the customer's opinion that we are just using and tuning timeouts to tape this together, instead of having a real solution.

Comment 4 Andreas Karis 2020-01-10 09:10:03 UTC

Hi,

The customer has found that the solution can be the graceful shutdown of instance, so, all I/O processes killed inside the VM operating system. Such a graceful shutdown implemented by the qemu guest agent.

This fixes the issue in their environment.

- Andreas

Comment 5 Artom Lifshitz 2020-09-23 15:55:23 UTC

(In reply to Andreas Karis from comment #4)
> Hi,
> 
> The customer has found that the solution can be the graceful shutdown of
> instance, so, all I/O processes killed inside the VM operating system. Such
> a graceful shutdown implemented by the qemu guest agent.
> 
> This fixes the issue in their environment.

So they're asking for a qemu/libvirt feature wherein libvirt waits for the qemu guest agent to report that all guest IO has stopped (how would the agent even know that?) before killing the VM? This would presumably be configured by some XML that the orchestrator (Nova in this case) would pass to libvirt?

Comment 7 Artom Lifshitz 2020-11-05 00:53:29 UTC

I have closed this bug as it has been waiting for more info for at least 4 weeks. We only do this to ensure that we don't accumulate stale bugs which can't be addressed. If you are able to provide the requested information, please feel free to re-open this bug.

Comment 9 Artom Lifshitz 2020-11-27 16:17:02 UTC

There isn't much we can do on the Nova side if libvirt is unable to kill the guests's qemu process with a SIGKILL. If the guest kernel is stuck in sleep waiting for IO that will never arrive, no amount of retries on either libvirt's or Nova's part can help. When Nova hits the retry limit, the instance should go into error, at which point the recovery procedure is to reset state and attempt the reboot/delete again.

I suppose in this case the guest's IO should eventually die down (ie, it's not a deadlock), so we could make the Nova retry limit configurable, but that'd be a bandaid at best. There's no good way to address the root cause.

Comment 10 Artom Lifshitz 2020-12-11 14:55:23 UTC

(In reply to Artom Lifshitz from comment #9)
> I suppose in this case the guest's IO should eventually die down (ie, it's
> not a deadlock), so we could make the Nova retry limit configurable, but
> that'd be a bandaid at best. There's no good way to address the root cause.

So for soft shutdown, this already exists, in the form of the [DEFAULT]/shutdown_timeout option, which can also be specified as an image property in the shape of `os_shutdown_timeout`. This gets passed as the `timeout` parameter to the libvirt driver's power_off() method. How it's used is a bit weird - namely, the number of times Nova sleeps for 1 second before retrying to soft shutdown the guest - but the effect is the same. In other words, in deployments where the workloads are known to generate lots of IO, [DEFAULT]/shutdown_timeout (or `os_shutdown_timeout` as an image property) can be increased in order to accommodate guests that take longer to flush their IO before shutting down.

I'm going to close this as CANTFIX (to indicate that Nova has limited options when it comes to guests with heavy IO). Feel free to re-open if the concerns havne't been addressed properly.

Comment 11 Red Hat Bugzilla 2023-09-18 00:19:39 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days