Bug 1091389 - releaseVm() takes too much time when VM doesn't want to die
Summary: releaseVm() takes too much time when VM doesn't want to die
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: 3.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.5.0
Assignee: Francesco Romani
QA Contact: Pavel Novotny
URL:
Whiteboard: virt
Depends On:
Blocks: 1081962
TreeView+ depends on / blocked
 
Reported: 2014-04-25 13:37 UTC by Michal Skrivanek
Modified: 2016-02-10 19:50 UTC (History)
13 users (show)

Fixed In Version: v4.16.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-10-17 12:20:40 UTC
oVirt Team: Virt
Embargoed:


Attachments (Terms of Use)
qemu-kvm shell wrapper (96 bytes, text/plain)
2014-08-04 15:49 UTC, Pavel Novotny
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 27175 0 None MERGED vm: drop wasteful time.sleep() in releaseVm Never

Description Michal Skrivanek 2014-04-25 13:37:42 UTC
the current releaseVm() procedure does:

                    try:
                        self._dom.destroyFlags(
                            libvirt.VIR_DOMAIN_DESTROY_GRACEFUL)
                    except libvirt.libvirtError as e:
                        if (e.get_error_code() ==
                                libvirt.VIR_ERR_OPERATION_FAILED):
                            self.log.warn("Failed to destroy VM '%s' "
                                          "gracefully", self.conf['vmId'])
                            time.sleep(30)
                            self._dom.destroy()

The sleep time can cause problems since the VDSM destroy() API call is synchronous and here we can wait for too long for VM to die…causing a timeouts on the caller side.
Moreover libvirt does it's own waiting on its own (10s TERM + 5s KILL, hardcoded)


Also we should think about what to do better/different when the destroy() fails. We should probably really have something like "unknown" state in VDSM

Comment 1 Francesco Romani 2014-04-29 07:50:43 UTC
Taking the bug. Let's starting by the biggest and worst offender, this time.sleep(30)

Comment 2 Pavel Novotny 2014-08-04 15:48:03 UTC
Verified in vdsm-4.16.0-42.git3bfad86.el6.x86_64 (oVirt 3.5 beta2).

Verification steps:
1. Rename qemu-kvm binary to qemu-kvm-real and replace qemu-kvm by a shell wrapper which ignores SIGTERM signals (see attachment).
2. Run a VM on the host.
3. Power off the VM, wait until is down and check vdsm.log for how long it took.

Result:
It takes now 25 seconds from the first sent SIGTERM till the final SIGKILL. From vdsm.log:

2014-08-04 15:08:05.822+0000: 2264: debug : virDomainDestroyFlags:2238 : dom=0x7f3940066f30, (VM: name=simple-vm, uuid=8171a0ff-7fce-4df3-ae88-b3cac2ddc87a), flags=1
2014-08-04 15:08:05.822+0000: 2264: debug : qemuProcessKill:4160 : vm=simple-vm pid=27686 flags=0
2014-08-04 15:08:20.831+0000: 2264: warning : qemuProcessKill:4224 : Timed out waiting after SIGTERM to process 27686
2014-08-04 15:08:20.831+0000: 2264: error : qemuDomainDestroyFlags:2094 : operation failed: failed to kill qemu process with SIGTERM
2014-08-04 15:08:20.831+0000: 2264: debug : virDomainFree:2281 : dom=0x7f3940066f30, (VM: name=simple-vm, uuid=8171a0ff-7fce-4df3-ae88-b3cac2ddc87a)
2014-08-04 15:08:20.833+0000: 2261: debug : virDomainDestroy:2172 : dom=0x7f391c05e890, (VM: name=simple-vm, uuid=8171a0ff-7fce-4df3-ae88-b3cac2ddc87a)
2014-08-04 15:08:20.833+0000: 2261: debug : qemuProcessKill:4160 : vm=simple-vm pid=27686 flags=1
2014-08-04 15:08:30.840+0000: 2261: warning : qemuProcessKill:4192 : Timed out waiting after SIGTERM to process 27686, sending SIGKILL

Comment 3 Pavel Novotny 2014-08-04 15:49:15 UTC
Created attachment 923929 [details]
qemu-kvm shell wrapper

Comment 4 Sandro Bonazzola 2014-10-17 12:20:40 UTC
oVirt 3.5 has been released and should include the fix for this issue.


Note You need to log in before you can comment on or make changes to this bug.