Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1051854

Summary: VM stuck in migrating from status forever, when libvirt stopped on source host.
Product: Red Hat Enterprise Virtualization Manager Reporter: Ilanit Stein <istein>
Component: ovirt-engineAssignee: Shahar Havivi <shavivi>
Status: CLOSED WORKSFORME QA Contact: Ilanit Stein <istein>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3.0CC: fromani, gklein, iheim, istein, lpeer, mavital, michal.skrivanek, rbalakri, rgolan, Rhev-m-bugs, sherold, yeylon
Target Milestone: ---   
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-08 08:41:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log
none
host_1 logs (hostname: host19-rack06..., time 2 hours behind rhevm)
none
host_2 logs (hostname: host20-rack06..., time 2 hours behind rhevm)
none
host_1 logs (hostname: host19-rack06..., time 2 hours behind rhevm) none

Description Ilanit Stein 2014-01-12 08:31:56 UTC
Description of problem:
After starting migrate VM by user, while migration is precessed, run initctl stop libvirtd on source host. 
VM stays in status "migrating from" forever (after ~1hour VM was still with same status). 

Version-Release number of selected component (if applicable):
is30

Expected results:
Migration should fail, after some time out.

Comment 1 Ilanit Stein 2014-01-12 08:43:54 UTC
Created attachment 848845 [details]
engine.log

Comment 2 Ilanit Stein 2014-01-12 08:44:37 UTC
Created attachment 848846 [details]
host_1 logs (hostname: host19-rack06..., time 2 hours behind rhevm)

Comment 3 Ilanit Stein 2014-01-12 08:45:26 UTC
Created attachment 848847 [details]
host_2 logs (hostname: host20-rack06..., time 2 hours behind rhevm)

Comment 4 Michal Skrivanek 2014-01-13 08:32:47 UTC
logs from host 1 seems to be corrupted (or wrong, I see date 6.1., but host 2 is 12.1.)
Also please specify which VM it was to filter out the noise

Comment 5 Ilanit Stein 2014-01-13 15:45:25 UTC
Created attachment 849465 [details]
host_1 logs (hostname: host19-rack06..., time 2 hours behind rhevm)

Adding the correct host 1 (host19) logs.

Comment 6 Ilanit Stein 2014-01-14 08:07:38 UTC
VM name: POOLTEST-1

Migration start line in engine.log:
2014-01-12 09:37:42,284 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-46) [3ffbb2e7] Correlation ID: 3ffbb2e7, Job ID: 3cd1a039-3e2f-4234-969a-fe4838757d55, Call Stack: null, Custom Event ID: -1, Message: Migration started (VM: POOLTEST-1, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com, User: admin@internal).

Comment 7 Michal Skrivanek 2014-01-14 14:18:55 UTC
we need to address the case when libvirt stays in "recovering from crash" state forever, by moving the VM status to Unknown after some time

Since the scenario is artificial, after a libvirt crash it would restart immediately, I'd plan for 3.5

Comment 8 Michal Skrivanek 2014-01-14 14:19:18 UTC
*** Bug 1051847 has been marked as a duplicate of this bug. ***

Comment 9 Arik 2014-03-31 11:05:03 UTC
The list of VMs which is returned by VDSM contains the VMs with their "last known statuses", the last status that was received from libvirt before it crashed, so in this case we'll keep the VM in MigratingFrom even though we don't get statistics from VDSM due to the corrupted connection with libvirt.

We should probably add some logic which states that while we don't get statistics from VDSM, the VMs should switch to UNKNOWN as part of the upcoming refactoring of VdsUpdateRuntimeInfo

Comment 10 Roy Golan 2014-04-07 08:36:55 UTC
we should really check why the host didn't go non-operational in that case which should cover us in this case

Comment 11 Roy Golan 2014-08-13 14:53:37 UTC
VDSM at that point must expose unknown status for that vm so the engine and vdsm will be synced. (as opposed to migratingFrom on vdsm and unknown on engine)

francesco what do you think? will vdsm will return an answer for getVmStats while libvirt is in error state?

Comment 12 Michal Skrivanek 2014-08-14 13:50:51 UTC
doesn't http://gerrit.ovirt.org/25276 fix this? seems to me so. Francesco please confirm

Comment 13 Francesco Romani 2014-08-14 14:08:51 UTC
(In reply to Michal Skrivanek from comment #12)
> doesn't http://gerrit.ovirt.org/25276 fix this? seems to me so. Francesco
> please confirm

It is similar but not really the same case.

25276 addressed the case on which libvirtd stays up, but VDSM crashes.
In more general terms, 25276 improved the perception of the VDSM after crash;
this means VDSM failed, while the rest of the stack didn't.

Point in case (and main point for that change): migration completes while VDSM is down (so QEMU/libvirt/etc. all but VDSM behaved correctly).

Here IIUC another part of the stack is failing, libvirt as per report, while VDSM is not. To detect a crashed libvirtd looks a very different business at first glance. I'll need to check, however.

(In reply to Roy Golan from comment #11)
> VDSM at that point must expose unknown status for that vm so the engine and
> vdsm will be synced. (as opposed to migratingFrom on vdsm and unknown on
> engine)
> 
> francesco what do you think? will vdsm will return an answer for getVmStats
> while libvirt is in error state?

I need to check carefully. I don't recall to have seen cases like this recently.
But I'd expect at very least VDSM to reports all VMs as unresponsive.

Comment 14 Michal Skrivanek 2014-08-14 15:12:57 UTC
(In reply to Roy Golan from comment #10)
> we should really check why the host didn't go non-operational in that case
> which should cover us in this case

AFAIU this is the only thing which remains
vdsm was reporting "recovering from crash" and no VM updates went through
the engine should have moved the host to non-responsive, but it didn't - i believe that's the bug

Comment 15 Shahar Havivi 2015-03-03 14:33:15 UTC
tested on vdsm 4.17 several times - the VM is moving to Unknown status.
when starting libvirt and restarting vdsm the VM is returning to an up status.
Once I encounter the VM move to down - I suspect that this is because of the timing of the migration. (near the end)