Description of problem: Assume Vdsm is running a VM and then stopped. While it's down, the VM crashes. When Vdsm is restarted, it tries to reconnect to the VM, fails, logs the fact, and but does not report the Down state to Engine. No explicit destroy() call is made, and resources allocated during startup may not be freed. Version-Release number of selected component (if applicable): vdsm-4.13 (but actually, since ever) How reproducible: 100% Steps to Reproduce: 1. start a VM 2. stop Vdsm 3. pkill qemu 4. restart Vdsm Actual results: No Vm is reported. Expected results: The Vm should be reported as Down. Subsequent call to destroy() should release anything allocated during startup, and particularly, trigger the after_vm_destroy hook.
I think in this case vdsm shouldn't even try to recover anything. Why is it trying to connect on startup? I though it's only for running VMs
I do not understand your question, Michal. On startup, Vdsm finds those *.recovery files, and tries to re-attach to their respective qemu processes. If the process is found - no problem. This bug discusses the case where libvirt no longer reports the VM. Current behavior is to silently accept this. That's bad. It means that Engine should handle the case where a VM miraculously disappeared from a host. It means that the exit code and new timeOffset, that should have been reported back to Engine, are lost. And it also means that no destroy() call is sent to Vdsm, which leads to resource leak in certain cases. A vdsmd restart should not affect the reported state of Vdsm (except for things that are explicitly requested to change, like the generationID).
I still tend to think that in these cases it's better to not do anything smart. We have to deal with a situation when someone intentionally stops vdsm and then do something manually via virsh (e.g. migrate away), then the first update after vdsm is started may be confusing to the engine Looking at the current handling in removeVmsFromCache(VdsUpdateRunTimeInfo.java) it does handle the situation (logging "Could not find VM %s on host, assuming it went down unexpectedly") which seems to me as an adequate behavior in your scenario. So it seems to me we should rather fail the recovery in this case. But then we still have the after_vm_destroy hook problem…hmm…still if the vdsm is intentionally shut down I don't think anyone can expect to perform hooks on lifecycle events. Do you have a real-world scenario? Did this happen during automated tests or anything?
I was not considering the case of an evil admin migrating VMs away, but much more mundane cases. Vdsm can crash due to a python/libvirt/m2crytpo bug. It can be killed by spmprotect. Or oom killer. When it starts up again, it should keep reporting the Down VMs, and not silently drop them. I know that Engine handles the case of disappearing VMs after vdsmd restart. It has always has, since this is as VERY old Vdsm bug. But as noted above, this handling is not flawless: we loose the exitCode and the timeOffset, and fail to free up local resources.
we do lose exit code (though since it died in the meantime we can safely assume this was an exceptional crash and not a normal shutdown) timeOffset is propagated immediately after a clock change, no? Either way we're not using it anymore when starting VM (we do want to display it though) vdsm resources - so wouldn't it be enough to not recover such VMs? we have basically 2 options IIUC 1 - create a hollow VM object just for the sake of storing the exitCode (which we would make up anyway) and engine is going to destroy it once it connects 2 - or not create VM object at all and ignore the recovery file when QEMU process is not there anymore
(In reply to Michal Skrivanek from comment #5) since that crash is exceptional and the exitCode&stats should have been collected already I'd be in favor of (2)
both change committed to vdsm master
I tested system with 2 VMs running on one of two hosts, while they were managed from hosted engine, I received VM's statuses as "down" via WEBUI and engine logs. System components were as follows: On engine: Linux version 2.6.32-431.23.3.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014 ovirt-engine-setup-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch ovirt-engine-setup-base-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch libvirt-0.10.2-29.el6_5.10.x86_64 On hosts: sanlock-2.8-1.el6.x86_64 vdsm-4.16.1-6.gita4a4614.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64 Linux version 2.6.32-431.23.3.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014