Bug 1312909 - Live disk migration fails noting "Failed to VmReplicateDiskFinishVDS" [NEEDINFO]
Summary: Live disk migration fails noting "Failed to VmReplicateDiskFinishVDS"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.3.3
: 4.3.0
Assignee: Benny Zlotnik
QA Contact: Shir Fishbain
URL:
Whiteboard: storage
: 1413128 1509030 1520546 1522851 1572895 (view as bug list)
Depends On: 1547768
Blocks: 1520566 Gluster-HC-2 1434105 1509030 1520546
TreeView+ depends on / blocked
 
Reported: 2016-02-29 13:57 UTC by Gilad Lazarovich
Modified: 2019-09-09 14:55 UTC (History)
19 users (show)

Fixed In Version: ovirt-engine-4.3.3.1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1413128 (view as bug list)
Environment:
Last Closed: 2019-05-08 12:36:47 UTC
oVirt Team: Storage
tnisan: needinfo? (amarchuk)


Attachments (Terms of Use)
engine and vdsm logs (5.36 MB, application/octet-stream)
2016-02-29 13:57 UTC, Gilad Lazarovich
no flags Details
2016-03-08 engine and vdsm logs (11.43 MB, application/octet-stream)
2016-03-08 06:46 UTC, Gilad Lazarovich
no flags Details
new logs (3.74 MB, application/x-gzip)
2016-03-24 18:46 UTC, Raz Tamir
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2019:1085 None None None 2019-05-08 12:37:29 UTC
oVirt gerrit 98552 master MERGED core: resend diskReplicateFinish if it's still in progress 2019-03-18 14:16:08 UTC
Red Hat Knowledge Base (Solution) 2877381 None None None 2017-12-11 04:36:57 UTC
Red Hat Knowledge Base (Solution) 3231051 None None None 2018-04-23 15:25:11 UTC

Description Gilad Lazarovich 2016-02-29 13:57:35 UTC
Created attachment 1131533 [details]
engine and vdsm logs

Description of problem:
Live disk migration fails noting "Failed to VmReplicateDiskFinishVDS", error = Resource unavailable, code = 40 (Failed with error unavail and code 40)

Version-Release number of selected component (if applicable):
3.6.3.3-0.1

How reproducible:
Occurs once in 2-3 tier 2 runs (each consisting of about 50 tests)

Steps to Reproduce:
1. Live migrate a 1 GB disk from an NFS domain to an NFS domain
2. Repeat this process in a loop

Actual results:
The live migrate fails noting:
2016-02-29 04:18:53,511 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Command 'org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Resource unavailable, code = 40 (Failed with error unavail and code 40)

Expected results:


Additional info:
Please see attached engine and vdsm logs

Comment 1 Yaniv Kaul 2016-03-01 13:18:04 UTC
Anything suspicious on VDSM side? Clearly it's complaining on something.

Comment 2 Gilad Lazarovich 2016-03-02 08:45:29 UTC
Yaniv, I looked through the VDSM logs but didn't find anything that seemed to explain this failure. Can you please have a look at the logs and see if you can find a clue?

Comment 3 Yaniv Kaul 2016-03-02 08:57:38 UTC
Are you sure you have the logs from the correct host?
This is what I see in engine, and nothing relevant on the host (for example, not even the relevant VM):
2016-02-29 04:18:52,446 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91'}), log id: 7a0d82f6
2016-02-29 04:18:53,455 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Failed in 'VmReplicateDiskFinishVDS' method
2016-02-29 04:18:53,468 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-42) [50c82473] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_2 command failed: Resource unavailable
2016-02-29 04:18:53,469 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturnForXmlRpc [status=StatusForXmlRpc [code=40, message=Resource unavailable]]'
2016-02-29 04:18:53,469 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] HostName = host_mixed_2

So we need logs from host_mixed_2, to see what the issue is.

Even the create VM command, that took place just before that:
016-02-29 04:16:51,955 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] START, CreateVmVDSCommand( CreateVmVDSCommandParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91', vm='VM [vm_TestCase5992_REST_NFS_2016-02-29_04-16-08]'}), log id: 44086246
2016-02-29 04:16:51,963 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] START, CreateVDSCommand(HostName = host_mixed_2, CreateVmVDSCommandParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91', vm='VM [vm_TestCase5992_REST_NFS_2016-02-29_04-16-08]'}), log id: 202717eb
2016-02-29 04:16:51,984 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmInfoBuilderBase] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] Bootable disk '0c5cc7be-8df0-47ea-91d2-94ea18354faa' set to index '0'
2016-02-29 04:16:52,080 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand kvmEnable=true,smpThreadsPerCore=1,keyboardLayout=en-us,nice=0,pitReinjection=false,displayNetwork=ovirtmgmt,copyPasteEnable=true,timeOffset=0,transparentHugePages=true,vmId=634ce755-a823-4176-81ab-3c1544da2d91,acpiEnable=true,custom={},maxMemSize=4194304,spiceSslCipherSuite=DEFAULT,memSize=1024,smp=1,emulatedMachine=pc-i440fx-rhel7.2.0,vmType=kvm,guestNumaNodes=[{cpus=0, nodeIndex=0, memory=1024}],memGuaranteedSize=1024,display=qxl,maxMemSlots=16,smartcardEnable=false,bootMenuEnable=false,smpCoresPerSocket=1,spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard,maxVCpus=16,devices=[{specParams={ram=65536, vram=8192, heads=1, vgamem=16384}, device=qxl, type=video, deviceId=fbfb040e-d108-439e-9714-bfbed4f8c050}, {specParams={copyPasteEnable=true, fileTransferEnable=true}, device=spice, type=graphics, deviceId=b7a5070e-f950-4f08-8f7d-d8d95a4a54bd}, {shared=false, iface=ide, index=2, specParams={path=}, path=, device=cdrom, type=disk, readonly=true, deviceId=388b70cf-0a63-4378-9dd4-1953df9c6e7d}, {shared=false, index=0, volumeID=14d54c2e-c98b-4f84-a845-53a2784e6034, propagateErrors=off, format=cow, type=disk, iface=virtio, bootOrder=1, domainID=10526a85-d74c-4fe2-92e3-f9e60522418b, imageID=0c5cc7be-8df0-47ea-91d2-94ea18354faa, specParams={}, optional=false, device=disk, poolID=a1dcb24f-6656-4864-892b-e9628dedf974, readonly=false, deviceId=0c5cc7be-8df0-47ea-91d2-94ea18354faa}, {nicModel=pv, specParams={outbound={}, inbound={}}, macAddr=00:1a:4a:16:81:f4, device=bridge, linkActive=true, type=interface, filter=vdsm-no-mac-spoofing, network=ovirtmgmt, deviceId=4af9b1e4-36c1-46f4-ba81-acf29c43b1cb}, {specParams={}, device=ich6, type=sound, deviceId=810b2f9d-c1e2-4891-8398-060fe3272ca8}, {specParams={model=virtio}, device=memballoon, type=balloon, deviceId=00a09059-37e9-47b7-a385-d7d12b6150cc}, {index=0, model=virtio-scsi, specParams={}, device=scsi, type=controller, deviceId=a0e6a7a6-6f53-4026-a6c4-68193aad5c18}, {specParams={}, device=virtio-serial, type=controller, deviceId=252e544a-8538-49c1-9ebe-f4e1124e31db}],vmName=vm_TestCase5992_REST_NFS_2016-02-29_04-16-08,fileTransferEnable=true,cpuType=Nehalem

Does not seem to be on the logs - so I assume it's the logs from the wrong host?

Comment 4 Gilad Lazarovich 2016-03-08 06:37:55 UTC
Yaniv, please note that our automated run failed to capture the vdsm logs or I would have provided everything, I only picked up the vdsm logs from the host in question after the run. In any case, I managed to reproduce this again overnight with iSCSI this time. Relevant error is as follows:

2016-03-08 03:05:46,216 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-7) [41133acc] FINISH, VmReplicateDiskFinishVDSCommand, log id: b1c81e9
2016-03-08 03:05:46,241 ERROR [org.ovirt.engine.core.bll.AbstractSPMAsyncTaskHandler] (org.ovirt.thread.pool-6-thread-7) [41133acc] Failed VmReplicateDiskFinish (Disk 'b411d186-65d9-44a8-bb12-fdee27291f2d', VM '10fc4e36-2a6d-4bbe-a48d-21563d64d7aa')

I attached the full engine and vdsm logs for you.

Comment 5 Gilad Lazarovich 2016-03-08 06:46:13 UTC
Created attachment 1134051 [details]
2016-03-08 engine and vdsm logs

Please note that I excluded the libvirtd logs due to size, I can attach those separately if needed

Comment 6 Yaniv Kaul 2016-03-08 10:42:05 UTC
Please notice you may be seeing bug 1306334 as well.

Comment 7 Yaniv Lavi 2016-03-21 12:36:02 UTC
Since it happen in 4-6% of time, it happens in automation and no customer ticket, moving to 4.0.

Comment 8 Raz Tamir 2016-03-24 18:44:41 UTC
Attaching new logs.
From engine log:

2016-03-23 23:24:12,301 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{runAsync='true', hostId='f6f6b24e-d948-409b-b45b-bb4e27f449e9', vmId='80789070-cf50-40c8-98bd-2122814c86d8'}), log id: 64d96be9

2016-03-23 23:24:13,301 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] Failed in 'VmReplicateDiskFinishVDS' method

2016-03-23 23:24:13,319 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{runAsync='true', hostId='f6f6b24e-d948-409b-b45b-bb4e27f449e9', vmId='80789070-cf50-40c8-98bd-2122814c86d8'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Resource unavailable, code = 40

From vdsm log:
{'c04940fe-a4c8-44b8-a552-ac2ec61ef185': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'c04940fe-a4c8-44b8-a552-ac2ec61ef185'}}

So the task semms to finish successfully.
There is nothing unusual in vdsm logs

Comment 9 Raz Tamir 2016-03-24 18:46:16 UTC
Created attachment 1140098 [details]
new logs

Comment 10 Daniel Erez 2016-06-01 09:13:21 UTC
Hi Gilad,

A few questions for further investigation:
* Is it reproduced constantly in automation tests?
* Is it related to scale tests?
* Does it happens also when triggering the specific test manually?
* Is it also reproduced on 4.0 tests?

Comment 11 Daniel Erez 2016-06-01 09:13:22 UTC
Hi Gilad,

A few questions for further investigation:
* Is it reproduced constantly in automation tests?
* Is it related to scale tests?
* Does it happens also when triggering the specific test manually?
* Is it also reproduced on 4.0 tests?

Comment 12 Raz Tamir 2016-06-15 13:27:52 UTC
Hi Daniel,
* No, this happened from time to time, I would say ~10%
* No, There is nothing to with scale tests. it can be even in the most basic flow of live migrating single disk
* It is very hard to reproduce manually
* I saw it also in 4.0 executions

Comment 14 Liron Aravot 2016-06-26 11:37:41 UTC
(In reply to Gilad Lazarovich from comment #5)
> Created attachment 1134051 [details]
> 2016-03-08 engine and vdsm logs
> 
> Please note that I excluded the libvirtd logs due to size, I can attach
> those separately if needed

The LSM flow goes as follows:
1. Create snapshot for the relevant disk(s)
2. Clone the image group entire structure
3. Calls VmReplicateDiskStart to start the replication of the leaf/active volume.
4. Start task to sync the image group data (copies the data between the source target/volumes except the leaf).
5. as soon as step 3 ends, calls vmReplicateDiskFinish.

The engine doesn't actually track the job, but assumes that entire chain copy (step 4) will always take longer then the leaf volume replication.
In the logs provided by Gilad in comment #5, it can be seen that there are extensions of the leaf volume which means that there are ongoing writes that causes to the leaf replication to not end before the chain data sync completes which leads to the failure.

in bz 1325791 the logging for that scenario was improved.

I'm updating that BZ to be a RFE - the engine may track the job instead of assuming it'll always take shorter then the image group data sync.

Comment 15 Liron Aravot 2016-06-26 11:41:55 UTC
Raz, please see if the above aligns with the automation test (as the vdsm logs from the correct time were missing from the log provided in comment 9).

The automation test should be aligned with that limitation to not encounter the issue.

Comment 16 Raz Tamir 2016-06-26 12:46:30 UTC
Hi Liron,
Our automation failed on timeout (1800 seconds) - we are waiting for the relevant disks to be in status ok, wait for the vm not to be locked, and also for the job to finish and those checks are not met in the time frame of 30 minutes

Comment 19 Liron Aravot 2017-02-05 14:19:44 UTC
*** Bug 1413128 has been marked as a duplicate of this bug. ***

Comment 24 Adam Litke 2017-03-28 19:36:00 UTC
*** Bug 1434105 has been marked as a duplicate of this bug. ***

Comment 25 Adam Litke 2017-03-28 19:38:09 UTC
Is this really an RFE and not a bug?  It's easy to trigger just by trying to live migrate a gluster disk which is under heavy I/O load.

Comment 26 Tal Nisan 2018-02-19 09:09:43 UTC
*** Bug 1509030 has been marked as a duplicate of this bug. ***

Comment 37 Sandro Bonazzola 2019-01-28 09:43:55 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 39 Tal Nisan 2019-03-01 16:00:21 UTC
*** Bug 1572895 has been marked as a duplicate of this bug. ***

Comment 40 Benny Zlotnik 2019-03-21 12:43:09 UTC
Steps to verify:
1. Create a vm with disks (with an installed OS)
2. Edit /usr/lib/python2.7/site-packages/vdsm/virt/vm.py#diskReplicateFinish on the relevant host
and add a timed sleep at the start
3. Restart vdsm
4. Start LSM on the disk
5. once "START, VmReplicateDiskFinishVDSCommand" appears in the engine.log, start a `dd` command
$ for ((;;)) do dd if=/dev/urandom of=img bs=1M; sleep 2; done
6. Wait for a log message in the engine stating: "Replication not finished yet, will retry in next polling cycle"
7. Abort the for-loop from step 5
LSM should finish successfully

Comment 41 RHV Bugzilla Automation and Verification Bot 2019-03-29 11:14:42 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 44 Shir Fishbain 2019-04-02 16:19:57 UTC
Verified - LSM finished successfully

ovirt-engine-4.3.3.1-0.1.el7.noarch
vdsm-4.30.12-1.el7ev.x86_64

2019-04-02 19:10:28,654+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [55009beb-71a9-4b56-b3f1-5e42bfe16fb5] FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 1199fb4f
2019-04-02 19:10:28,654+03 WARN  [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [55009beb-71a9-4b56-b3f1-5e42bfe16fb5] Replication not finished yet, will retry in next polling cycle

Comment 45 Benny Zlotnik 2019-04-07 10:50:52 UTC
*** Bug 1520546 has been marked as a duplicate of this bug. ***

Comment 47 errata-xmlrpc 2019-05-08 12:36:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 48 Benny Zlotnik 2019-07-10 08:52:18 UTC
*** Bug 1522851 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.