Bug 1051854
| Summary: | VM stuck in migrating from status forever, when libvirt stopped on source host. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Ilanit Stein <istein> |
| Component: | ovirt-engine | Assignee: | Shahar Havivi <shavivi> |
| Status: | CLOSED WORKSFORME | QA Contact: | Ilanit Stein <istein> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.3.0 | CC: | fromani, gklein, iheim, istein, lpeer, mavital, michal.skrivanek, rbalakri, rgolan, Rhev-m-bugs, sherold, yeylon |
| Target Milestone: | --- | ||
| Target Release: | 3.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | virt | ||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-03-08 08:41:31 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Ilanit Stein
2014-01-12 08:31:56 UTC
Created attachment 848845 [details]
engine.log
Created attachment 848846 [details]
host_1 logs (hostname: host19-rack06..., time 2 hours behind rhevm)
Created attachment 848847 [details]
host_2 logs (hostname: host20-rack06..., time 2 hours behind rhevm)
logs from host 1 seems to be corrupted (or wrong, I see date 6.1., but host 2 is 12.1.) Also please specify which VM it was to filter out the noise Created attachment 849465 [details]
host_1 logs (hostname: host19-rack06..., time 2 hours behind rhevm)
Adding the correct host 1 (host19) logs.
VM name: POOLTEST-1 Migration start line in engine.log: 2014-01-12 09:37:42,284 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-46) [3ffbb2e7] Correlation ID: 3ffbb2e7, Job ID: 3cd1a039-3e2f-4234-969a-fe4838757d55, Call Stack: null, Custom Event ID: -1, Message: Migration started (VM: POOLTEST-1, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com, User: admin@internal). we need to address the case when libvirt stays in "recovering from crash" state forever, by moving the VM status to Unknown after some time Since the scenario is artificial, after a libvirt crash it would restart immediately, I'd plan for 3.5 *** Bug 1051847 has been marked as a duplicate of this bug. *** The list of VMs which is returned by VDSM contains the VMs with their "last known statuses", the last status that was received from libvirt before it crashed, so in this case we'll keep the VM in MigratingFrom even though we don't get statistics from VDSM due to the corrupted connection with libvirt. We should probably add some logic which states that while we don't get statistics from VDSM, the VMs should switch to UNKNOWN as part of the upcoming refactoring of VdsUpdateRuntimeInfo we should really check why the host didn't go non-operational in that case which should cover us in this case VDSM at that point must expose unknown status for that vm so the engine and vdsm will be synced. (as opposed to migratingFrom on vdsm and unknown on engine) francesco what do you think? will vdsm will return an answer for getVmStats while libvirt is in error state? doesn't http://gerrit.ovirt.org/25276 fix this? seems to me so. Francesco please confirm (In reply to Michal Skrivanek from comment #12) > doesn't http://gerrit.ovirt.org/25276 fix this? seems to me so. Francesco > please confirm It is similar but not really the same case. 25276 addressed the case on which libvirtd stays up, but VDSM crashes. In more general terms, 25276 improved the perception of the VDSM after crash; this means VDSM failed, while the rest of the stack didn't. Point in case (and main point for that change): migration completes while VDSM is down (so QEMU/libvirt/etc. all but VDSM behaved correctly). Here IIUC another part of the stack is failing, libvirt as per report, while VDSM is not. To detect a crashed libvirtd looks a very different business at first glance. I'll need to check, however. (In reply to Roy Golan from comment #11) > VDSM at that point must expose unknown status for that vm so the engine and > vdsm will be synced. (as opposed to migratingFrom on vdsm and unknown on > engine) > > francesco what do you think? will vdsm will return an answer for getVmStats > while libvirt is in error state? I need to check carefully. I don't recall to have seen cases like this recently. But I'd expect at very least VDSM to reports all VMs as unresponsive. (In reply to Roy Golan from comment #10) > we should really check why the host didn't go non-operational in that case > which should cover us in this case AFAIU this is the only thing which remains vdsm was reporting "recovering from crash" and no VM updates went through the engine should have moved the host to non-responsive, but it didn't - i believe that's the bug tested on vdsm 4.17 several times - the VM is moving to Unknown status. when starting libvirt and restarting vdsm the VM is returning to an up status. Once I encounter the VM move to down - I suspect that this is because of the timing of the migration. (near the end) |