Bug 970645

Summary: migration_timeout not honoured, live migration goes on beyond it
Product: Red Hat Enterprise Virtualization Manager Reporter: Julio Entrena Perez <jentrena>
Component: vdsmAssignee: Vinzenz Feenstra [evilissimo] <vfeenstr>
Status: CLOSED ERRATA QA Contact: Lukas Svaty <lsvaty>
Severity: medium Docs Contact:
Priority: high    
Version: 3.1.4CC: acathrow, bazulay, eedri, flo_bugzilla, iheim, jentrena, jkt, lbopf, lpeer, lsvaty, lyarwood, mavital, michal.skrivanek, pbandark, pstehlik, sbonazzo, sputhenp, vfeenstr, yeylon
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 3.4.0   
Hardware: All   
OS: Linux   
Whiteboard: virt
Fixed In Version: ovirt-3.4.0-beta2 Doc Type: Bug Fix
Doc Text:
Live migration operations now respect the 300 second limit, and live migration operations continue for only 300 seconds.
Story Points: ---
Clone Of:
: 1069220 (view as bug list) Environment:
Last Closed: 2014-06-09 13:24:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1015887    
Bug Blocks: 1069220, 1069731, 1078909, 1142926    

Comment 2 Saveliev Peter 2013-06-05 13:14:26 UTC
The confusion is caused by variable naming.

Actually, migration_timeout is counted not from the migration start, but from the moment the migration is stalled, so here it worked as designed.

But the issue raises not only the question of variable naming — that's easy and will be fixed. More serious is the behaviour of the destination host, which is totally wrong. That's is being investigated.

Comment 3 Julio Entrena Perez 2013-06-05 13:39:47 UTC
(In reply to Saveliev Peter from comment #2)
> The confusion is caused by variable naming.

According to /usr/share/doc/vdsm-4.10.2/vdsm.conf.sample :

# Maximum time the destination waits for migration to end. Source
# waits twice as long (to avoid races).
# migration_timeout = 300

> 
> Actually, migration_timeout is counted not from the migration start, but
> from the moment the migration is stalled, so here it worked as designed.

If that's the case we still need to rephrase the above comment (and explain behaviour around migration_timeout properly somewhere).

Comment 4 Saveliev Peter 2013-06-05 16:28:19 UTC
Yes, surely. It will be done as well.

Comment 5 Michal Skrivanek 2013-07-03 04:03:39 UTC
also need to address/verify engine error on timeout as it seems the migration fails with Migration failed due to Error: Internal Engine Error (VM: dev31bc4a, Source Host: devrhev06)."

Comment 6 Saveliev Peter 2013-07-09 14:32:38 UTC
(In reply to Michal Skrivanek from comment #5)
> also need to address/verify engine error on timeout as it seems the
> migration fails with Migration failed due to Error: Internal Engine Error
> (VM: dev31bc4a, Source Host: devrhev06)."

Ok.

Comment 7 Martin Kletzander 2013-08-15 14:05:18 UTC
*** Bug 965172 has been marked as a duplicate of this bug. ***

Comment 10 Vinzenz Feenstra [evilissimo] 2013-11-06 09:07:49 UTC
The internal error happened due to a 'ClassCastException' in the vdsbroker:

2013-05-17 12:34:00,569 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (pool-3-thread-49) START, MigrateStatusVDSCommand(HostName = i-mpapp3, HostId = 1a62f776-695e-11e2-a97a-fb8bf5530f36, vmId=d6446340-b00a-4068-8778-2227f89776fd), log id: 3b3e8edd
2013-05-17 12:34:00,607 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (pool-3-thread-49) Failed in MigrateStatusVDS method, for vds: i-mpapp3; host: 10.204.125.31
2013-05-17 12:34:00,607 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) Command MigrateStatusVDS execution failed. Exception: ClassCastException: java.util.HashMap cannot be cast to java.lang.Integer
2013-05-17 12:34:00,607 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (pool-3-thread-49) FINISH, MigrateStatusVDSCommand, log id: 3b3e8edd
2013-05-17 12:34:00,781 INFO  [org.ovirt.engine.core.bll.VdsSelector] (pool-3-thread-49)  VDS i-mpapp1 419a3eb6-4452-11e2-ab96-575e82ebec1e is not in up status or belongs to the VM's cluster VDS i-mpapp4 2bb65ff4-5bd0-11e2-8088-8f3b14835353 have failed running this VM in the current selection cycle VDS jtest02 1948e33c-490b-11e2-8443-1b53e1383a1a is not in up status or belongs to the VM's cluster VDS i-mpweb2 33ff1c5e-7a9e-11e2-ab5e-170d2d7c2bd6 is not in up status or belongs to the VM's cluster VDS jtest01 c5ea366a-43a0-11e2-b207-ff9e163144da is not in up status or belongs to the VM's cluster VDS i-mpapp2 3550eabc-5b43-11e2-af4e-5b3ed4fe7828 is not in up status or belongs to the VM's cluster VDS i-mpweb1 92af67dc-4938-11e2-baf4-eb85f55b5ed5 is not in up status or belongs to the VM's cluster
2013-05-17 12:34:00,781 WARN  [org.ovirt.engine.core.bll.MigrateVmCommand] (pool-3-thread-49) CanDoAction of action MigrateVm failed. Reasons:ACTION_TYPE_FAILED_VDS_VM_CLUSTER,VAR__ACTION__MIGRATE,VAR__TYPE__VM

This most likely is due to receiving a different value (probably an error message) from VDSM than it was expected.

Comment 11 Michal Skrivanek 2013-11-19 09:13:28 UTC
bug 1015887 is supposedly fixing comment #10

Comment 15 Eyal Edri 2014-02-10 10:31:34 UTC
moving to 3.3.2 since 3.3.1 was built and moved to QE.
please make sure to backport into z-stream.

Comment 19 Lukas Svaty 2014-02-27 15:29:54 UTC
FailedQA

Changing migration_max_time_per_gib_mem to smaller value (5) makes migration times out

Appropriate message should be displayed about this in the event log. Instead we get two errors:

2014-Feb-27, 16:22
Migration failed due to Error: Migration not in progress (VM: a, Source: host1, Destination: host2).
		
2014-Feb-27, 16:22
Migration failed due to Error: Migration not in progress. Trying to migrate to another Host (VM: a, Source: host1, Destination: host2).

"Message like migration timed out after %d seconds." should be displayed instead.

Comment 22 Michal Skrivanek 2014-02-28 11:45:56 UTC
error message tracked as bug 1071260. moving back to ON_QA as the functionality is not affected

Comment 23 Lukas Svaty 2014-02-28 15:32:49 UTC
functionality working moving to verified

Comment 24 errata-xmlrpc 2014-06-09 13:24:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html