Bug 1045626 - Vdsm silently drops a VM that crashed while Vdsm was down
Summary: Vdsm silently drops a VM that crashed while Vdsm was down
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.5.0
Assignee: Francesco Romani
QA Contact: Nikolai Sednev
URL:
Whiteboard: virt
Depends On:
Blocks: 1026441 rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2013-12-20 21:26 UTC by Dan Kenigsberg
Modified: 2015-02-16 15:39 UTC (History)
7 users (show)

Fixed In Version: ovirt-3.5.0-alpha1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-16 13:38:21 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 25275 0 None None None Never
oVirt gerrit 25276 0 None None None Never

Description Dan Kenigsberg 2013-12-20 21:26:05 UTC
Description of problem:
Assume Vdsm is running a VM and then stopped. While it's down, the VM crashes. When Vdsm is restarted, it tries to reconnect to the VM, fails, logs the fact, and but does not report the Down state to Engine. No explicit destroy() call is made, and resources allocated during startup may not be freed.

Version-Release number of selected component (if applicable):
vdsm-4.13 (but actually, since ever)

How reproducible:
100%

Steps to Reproduce:
1. start a VM
2. stop Vdsm
3. pkill qemu
4. restart Vdsm

Actual results:
No Vm is reported.

Expected results:
The Vm should be reported as Down. Subsequent call to destroy() should release anything allocated during startup, and particularly, trigger the after_vm_destroy hook.

Comment 1 Michal Skrivanek 2014-01-27 09:16:45 UTC
I think in this case vdsm shouldn't even try to recover anything.
Why is it trying to connect on startup? I though it's only for running VMs

Comment 2 Dan Kenigsberg 2014-01-27 17:55:53 UTC
I do not understand your question, Michal. On startup, Vdsm finds those *.recovery files, and tries to re-attach to their respective qemu processes.

If the process is found - no problem.

This bug discusses the case where libvirt no longer reports the VM. Current behavior is to silently accept this.

That's bad. It means that Engine should handle the case where a VM miraculously disappeared from a host. It means that the exit code and new timeOffset, that should have been reported back to Engine, are lost. And it also means that no destroy() call is sent to Vdsm, which leads to resource leak in certain cases.

A vdsmd restart should not affect the reported state of Vdsm (except for things that are explicitly requested to change, like the generationID).

Comment 3 Michal Skrivanek 2014-01-30 10:30:02 UTC
I still tend to think that in these cases it's better to not do anything smart. We have to deal with a situation when someone intentionally stops vdsm and then do something manually via virsh (e.g. migrate away), then the first update after vdsm is started may be confusing to the engine

Looking at the current handling in removeVmsFromCache(VdsUpdateRunTimeInfo.java) it does handle the situation (logging "Could not find VM %s on host, assuming it went down unexpectedly") which seems to me as an adequate behavior in your scenario.
So it seems to me we should rather fail the recovery in this case.

But then we still have the after_vm_destroy hook problem…hmm…still if the vdsm is intentionally shut down I don't think anyone can expect to perform hooks on lifecycle events.

Do you have a real-world scenario? Did this happen during automated tests or anything?

Comment 4 Dan Kenigsberg 2014-01-30 12:29:57 UTC
I was not considering the case of an evil admin migrating VMs away, but much more mundane cases.

Vdsm can crash due to a python/libvirt/m2crytpo bug. It can be killed by spmprotect. Or oom killer. When it starts up again, it should keep reporting the Down VMs, and not silently drop them.

I know that Engine handles the case of disappearing VMs after vdsmd restart. It has always has, since this is as VERY old Vdsm bug. But as noted above, this handling is not flawless: we loose the exitCode and the timeOffset, and fail to free up local resources.

Comment 5 Michal Skrivanek 2014-01-31 10:58:53 UTC
we do lose exit code (though since it died in the meantime we can safely assume this was an exceptional crash and not a normal shutdown)
timeOffset is propagated immediately after a clock change, no? Either way we're not using it anymore when starting VM (we do want to display it though)
vdsm resources - so wouldn't it be enough to not recover such VMs?

we have basically 2 options IIUC
1 - create a hollow VM object just for the sake of storing the exitCode (which we would make up anyway) and engine is going to destroy it once it connects
2 - or not create VM object at all and ignore the recovery file when QEMU process is not there anymore

Comment 6 Michal Skrivanek 2014-02-28 10:25:11 UTC
(In reply to Michal Skrivanek from comment #5)
since that crash is exceptional and the exitCode&stats should have been collected already I'd be in favor of (2)

Comment 7 Francesco Romani 2014-05-02 11:54:34 UTC
both change committed to vdsm master

Comment 8 Nikolai Sednev 2014-08-12 10:47:32 UTC
I tested system with 2 VMs running on one of two hosts, while they were managed from hosted engine, I received VM's statuses as "down" via WEBUI and engine logs.

System components were as follows:
On engine:
Linux version 2.6.32-431.23.3.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014
ovirt-engine-setup-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch
ovirt-engine-setup-base-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch
libvirt-0.10.2-29.el6_5.10.x86_64


On hosts:
sanlock-2.8-1.el6.x86_64
vdsm-4.16.1-6.gita4a4614.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64

Linux version 2.6.32-431.23.3.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014


Note You need to log in before you can comment on or make changes to this bug.