Bug 1091389

Summary: releaseVm() takes too much time when VM doesn't want to die
Product: [Retired] oVirt Reporter: Michal Skrivanek <michal.skrivanek>
Component: vdsmAssignee: Francesco Romani <fromani>
Status: CLOSED CURRENTRELEASE QA Contact: Pavel Novotny <pnovotny>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4CC: amureini, bazulay, bugs, gklein, iheim, jentrena, mavital, mgoldboi, mkalinin, rbalakri, rhodain, s.kieske, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: v4.16.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:20:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1081962    
Attachments:
Description Flags
qemu-kvm shell wrapper none

Description Michal Skrivanek 2014-04-25 13:37:42 UTC
the current releaseVm() procedure does:

                    try:
                        self._dom.destroyFlags(
                            libvirt.VIR_DOMAIN_DESTROY_GRACEFUL)
                    except libvirt.libvirtError as e:
                        if (e.get_error_code() ==
                                libvirt.VIR_ERR_OPERATION_FAILED):
                            self.log.warn("Failed to destroy VM '%s' "
                                          "gracefully", self.conf['vmId'])
                            time.sleep(30)
                            self._dom.destroy()

The sleep time can cause problems since the VDSM destroy() API call is synchronous and here we can wait for too long for VM to dieā€¦causing a timeouts on the caller side.
Moreover libvirt does it's own waiting on its own (10s TERM + 5s KILL, hardcoded)


Also we should think about what to do better/different when the destroy() fails. We should probably really have something like "unknown" state in VDSM

Comment 1 Francesco Romani 2014-04-29 07:50:43 UTC
Taking the bug. Let's starting by the biggest and worst offender, this time.sleep(30)

Comment 2 Pavel Novotny 2014-08-04 15:48:03 UTC
Verified in vdsm-4.16.0-42.git3bfad86.el6.x86_64 (oVirt 3.5 beta2).

Verification steps:
1. Rename qemu-kvm binary to qemu-kvm-real and replace qemu-kvm by a shell wrapper which ignores SIGTERM signals (see attachment).
2. Run a VM on the host.
3. Power off the VM, wait until is down and check vdsm.log for how long it took.

Result:
It takes now 25 seconds from the first sent SIGTERM till the final SIGKILL. From vdsm.log:

2014-08-04 15:08:05.822+0000: 2264: debug : virDomainDestroyFlags:2238 : dom=0x7f3940066f30, (VM: name=simple-vm, uuid=8171a0ff-7fce-4df3-ae88-b3cac2ddc87a), flags=1
2014-08-04 15:08:05.822+0000: 2264: debug : qemuProcessKill:4160 : vm=simple-vm pid=27686 flags=0
2014-08-04 15:08:20.831+0000: 2264: warning : qemuProcessKill:4224 : Timed out waiting after SIGTERM to process 27686
2014-08-04 15:08:20.831+0000: 2264: error : qemuDomainDestroyFlags:2094 : operation failed: failed to kill qemu process with SIGTERM
2014-08-04 15:08:20.831+0000: 2264: debug : virDomainFree:2281 : dom=0x7f3940066f30, (VM: name=simple-vm, uuid=8171a0ff-7fce-4df3-ae88-b3cac2ddc87a)
2014-08-04 15:08:20.833+0000: 2261: debug : virDomainDestroy:2172 : dom=0x7f391c05e890, (VM: name=simple-vm, uuid=8171a0ff-7fce-4df3-ae88-b3cac2ddc87a)
2014-08-04 15:08:20.833+0000: 2261: debug : qemuProcessKill:4160 : vm=simple-vm pid=27686 flags=1
2014-08-04 15:08:30.840+0000: 2261: warning : qemuProcessKill:4192 : Timed out waiting after SIGTERM to process 27686, sending SIGKILL

Comment 3 Pavel Novotny 2014-08-04 15:49:15 UTC
Created attachment 923929 [details]
qemu-kvm shell wrapper

Comment 4 Sandro Bonazzola 2014-10-17 12:20:40 UTC
oVirt 3.5 has been released and should include the fix for this issue.