Bug 1789339
| Summary: | [OSP-10] Wait longer before issuing SIGKILL on instance destroy, to handle "libvirtError: Failed to terminate process <pid> with SIGKILL: Device or resource busy" | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Andreas Karis <akaris> |
| Component: | openstack-nova | Assignee: | OSP DFG:Compute <osp-dfg-compute> |
| Status: | CLOSED CANTFIX | QA Contact: | OSP DFG:Compute <osp-dfg-compute> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 10.0 (Newton) | CC: | astupnik, dasmith, ebarrera, eglynn, jhakimra, kchamart, lyarwood, mbooth, mflusche, nova-maint, sbauza, sgordon, smykhail, stephenfin, vromanso, vumrao |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1636190 | Environment: | |
| Last Closed: | 2020-12-11 14:55:23 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1636190, 1723881 | ||
| Bug Blocks: | 1759125 | ||
|
Comment 2
Andreas Karis
2020-01-09 12:02:27 UTC
Actually one correction from me though wrt the customer's comment about the 6 times wait timeout for nova: https://review.opendev.org/#/c/639091/11/nova/virt/libvirt/driver.py ~~~ # If the host has this libvirt version, then we skip the retry loop of # instance destroy() call, as libvirt itself increased the wait time # before the SIGKILL signal takes effect. MIN_LIBVIRT_BETTER_SIGKILL_HANDLING = (4, 7, 0) (...) # TODO(kchamart): Once MIN_LIBVIRT_VERSION # reaches v4.7.0, (a) rewrite the above note, # and (b) remove the following code that retries # _destroy() API call (which gives SIGKILL 30 # seconds to take effect) -- because from v4.7.0 # onwards, libvirt _automatically_ increases the # timeout to 30 seconds. This was added in the # following libvirt commits: # # - 9a4e4b942 (process: wait longer 5->30s on # hard shutdown) # # - be2ca0444 (process: wait longer on kill # per assigned Hostdev) ~~~ We can see from the code that destroy only calls itself if the min libvirt version is not detected: ~~~ def _destroy(self, instance, attempt=1): (...) with excutils.save_and_reraise_exception() as ctxt: if not self._host.has_min_version( MIN_LIBVIRT_BETTER_SIGKILL_HANDLING): LOG.warning('Error from libvirt during ' 'destroy. Code=%(errcode)s ' 'Error=%(e)s; attempt ' '%(attempt)d of 6 ', {'errcode': errcode, 'e': e, 'attempt': attempt}, instance=instance) # Try up to 6 times before giving up. if attempt < 6: ctxt.reraise = False self._destroy(instance, attempt + 1) return ~~~ Starting with libvirt 4.7.0, libvirt itself waits longer, that's [1][2] and in that case, nova will not retry: [1] https://github.com/libvirt/libvirt/commit/9a4e4b942df0474503e7524ea427351a46c0eabe [2] https://github.com/libvirt/libvirt/commit/be2ca0444728edd12a000653d3693d68a5c9102f So, starting with libvirt 4.7.0, nova only waits for libvirt once and doesn't repeatedly call _destroy. Before libvirt 4.7.0, nova will indeed now retry 6 times, instead of 3. I think though that this doesn't change anything to the customer's opinion that we are just using and tuning timeouts to tape this together, instead of having a real solution. Hi, The customer has found that the solution can be the graceful shutdown of instance, so, all I/O processes killed inside the VM operating system. Such a graceful shutdown implemented by the qemu guest agent. This fixes the issue in their environment. - Andreas (In reply to Andreas Karis from comment #4) > Hi, > > The customer has found that the solution can be the graceful shutdown of > instance, so, all I/O processes killed inside the VM operating system. Such > a graceful shutdown implemented by the qemu guest agent. > > This fixes the issue in their environment. So they're asking for a qemu/libvirt feature wherein libvirt waits for the qemu guest agent to report that all guest IO has stopped (how would the agent even know that?) before killing the VM? This would presumably be configured by some XML that the orchestrator (Nova in this case) would pass to libvirt? I have closed this bug as it has been waiting for more info for at least 4 weeks. We only do this to ensure that we don't accumulate stale bugs which can't be addressed. If you are able to provide the requested information, please feel free to re-open this bug. There isn't much we can do on the Nova side if libvirt is unable to kill the guests's qemu process with a SIGKILL. If the guest kernel is stuck in sleep waiting for IO that will never arrive, no amount of retries on either libvirt's or Nova's part can help. When Nova hits the retry limit, the instance should go into error, at which point the recovery procedure is to reset state and attempt the reboot/delete again. I suppose in this case the guest's IO should eventually die down (ie, it's not a deadlock), so we could make the Nova retry limit configurable, but that'd be a bandaid at best. There's no good way to address the root cause. (In reply to Artom Lifshitz from comment #9) > I suppose in this case the guest's IO should eventually die down (ie, it's > not a deadlock), so we could make the Nova retry limit configurable, but > that'd be a bandaid at best. There's no good way to address the root cause. So for soft shutdown, this already exists, in the form of the [DEFAULT]/shutdown_timeout option, which can also be specified as an image property in the shape of `os_shutdown_timeout`. This gets passed as the `timeout` parameter to the libvirt driver's power_off() method. How it's used is a bit weird - namely, the number of times Nova sleeps for 1 second before retrying to soft shutdown the guest - but the effect is the same. In other words, in deployments where the workloads are known to generate lots of IO, [DEFAULT]/shutdown_timeout (or `os_shutdown_timeout` as an image property) can be increased in order to accommodate guests that take longer to flush their IO before shutting down. I'm going to close this as CANTFIX (to indicate that Nova has limited options when it comes to guests with heavy IO). Feel free to re-open if the concerns havne't been addressed properly. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |