Bug 1312909

Summary:

Live disk migration fails noting "Failed to VmReplicateDiskFinishVDS"

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Gilad Lazarovich <glazarov>

Component:

ovirt-engine

Assignee:

Benny Zlotnik <bzlotnik>

Status:

CLOSED ERRATA

QA Contact:

Shir Fishbain <sfishbai>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

3.6.0

CC:

acanan, aefrat, amarchuk, audgiri, bcholler, bugs, bzlotnik, derez, ebenahar, gwatson, lsurette, mkalinin, nashok, ratamir, Rhev-m-bugs, sasundar, smaudet, srevivo, tnisan

Target Milestone:

ovirt-4.3.3

Keywords:

Automation, Reopened

Target Release:

4.3.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

storage

Fixed In Version:

ovirt-engine-4.3.3.1

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1413128 (view as bug list)

Environment:

Last Closed:

2019-05-08 12:36:47 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1547768

Bug Blocks:

1277939, 1434105, 1509030, 1520546, 1520566

Attachments:

Description	Flags
engine and vdsm logs	none
2016-03-08 engine and vdsm logs	none
new logs	none

Description Gilad Lazarovich 2016-02-29 13:57:35 UTC

Created attachment 1131533 [details]
engine and vdsm logs

Description of problem:
Live disk migration fails noting "Failed to VmReplicateDiskFinishVDS", error = Resource unavailable, code = 40 (Failed with error unavail and code 40)

Version-Release number of selected component (if applicable):
3.6.3.3-0.1

How reproducible:
Occurs once in 2-3 tier 2 runs (each consisting of about 50 tests)

Steps to Reproduce:
1. Live migrate a 1 GB disk from an NFS domain to an NFS domain
2. Repeat this process in a loop

Actual results:
The live migrate fails noting:
2016-02-29 04:18:53,511 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Command 'org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Resource unavailable, code = 40 (Failed with error unavail and code 40)

Expected results:


Additional info:
Please see attached engine and vdsm logs

Comment 1 Yaniv Kaul 2016-03-01 13:18:04 UTC

Anything suspicious on VDSM side? Clearly it's complaining on something.

Comment 2 Gilad Lazarovich 2016-03-02 08:45:29 UTC

Yaniv, I looked through the VDSM logs but didn't find anything that seemed to explain this failure. Can you please have a look at the logs and see if you can find a clue?

Comment 3 Yaniv Kaul 2016-03-02 08:57:38 UTC

Are you sure you have the logs from the correct host?
This is what I see in engine, and nothing relevant on the host (for example, not even the relevant VM):
2016-02-29 04:18:52,446 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91'}), log id: 7a0d82f6
2016-02-29 04:18:53,455 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Failed in 'VmReplicateDiskFinishVDS' method
2016-02-29 04:18:53,468 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-42) [50c82473] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_2 command failed: Resource unavailable
2016-02-29 04:18:53,469 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturnForXmlRpc [status=StatusForXmlRpc [code=40, message=Resource unavailable]]'
2016-02-29 04:18:53,469 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] HostName = host_mixed_2

So we need logs from host_mixed_2, to see what the issue is.

Even the create VM command, that took place just before that:
016-02-29 04:16:51,955 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] START, CreateVmVDSCommand( CreateVmVDSCommandParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91', vm='VM [vm_TestCase5992_REST_NFS_2016-02-29_04-16-08]'}), log id: 44086246
2016-02-29 04:16:51,963 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] START, CreateVDSCommand(HostName = host_mixed_2, CreateVmVDSCommandParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91', vm='VM [vm_TestCase5992_REST_NFS_2016-02-29_04-16-08]'}), log id: 202717eb
2016-02-29 04:16:51,984 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmInfoBuilderBase] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] Bootable disk '0c5cc7be-8df0-47ea-91d2-94ea18354faa' set to index '0'
2016-02-29 04:16:52,080 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand kvmEnable=true,smpThreadsPerCore=1,keyboardLayout=en-us,nice=0,pitReinjection=false,displayNetwork=ovirtmgmt,copyPasteEnable=true,timeOffset=0,transparentHugePages=true,vmId=634ce755-a823-4176-81ab-3c1544da2d91,acpiEnable=true,custom={},maxMemSize=4194304,spiceSslCipherSuite=DEFAULT,memSize=1024,smp=1,emulatedMachine=pc-i440fx-rhel7.2.0,vmType=kvm,guestNumaNodes=[{cpus=0, nodeIndex=0, memory=1024}],memGuaranteedSize=1024,display=qxl,maxMemSlots=16,smartcardEnable=false,bootMenuEnable=false,smpCoresPerSocket=1,spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard,maxVCpus=16,devices=[{specParams={ram=65536, vram=8192, heads=1, vgamem=16384}, device=qxl, type=video, deviceId=fbfb040e-d108-439e-9714-bfbed4f8c050}, {specParams={copyPasteEnable=true, fileTransferEnable=true}, device=spice, type=graphics, deviceId=b7a5070e-f950-4f08-8f7d-d8d95a4a54bd}, {shared=false, iface=ide, index=2, specParams={path=}, path=, device=cdrom, type=disk, readonly=true, deviceId=388b70cf-0a63-4378-9dd4-1953df9c6e7d}, {shared=false, index=0, volumeID=14d54c2e-c98b-4f84-a845-53a2784e6034, propagateErrors=off, format=cow, type=disk, iface=virtio, bootOrder=1, domainID=10526a85-d74c-4fe2-92e3-f9e60522418b, imageID=0c5cc7be-8df0-47ea-91d2-94ea18354faa, specParams={}, optional=false, device=disk, poolID=a1dcb24f-6656-4864-892b-e9628dedf974, readonly=false, deviceId=0c5cc7be-8df0-47ea-91d2-94ea18354faa}, {nicModel=pv, specParams={outbound={}, inbound={}}, macAddr=00:1a:4a:16:81:f4, device=bridge, linkActive=true, type=interface, filter=vdsm-no-mac-spoofing, network=ovirtmgmt, deviceId=4af9b1e4-36c1-46f4-ba81-acf29c43b1cb}, {specParams={}, device=ich6, type=sound, deviceId=810b2f9d-c1e2-4891-8398-060fe3272ca8}, {specParams={model=virtio}, device=memballoon, type=balloon, deviceId=00a09059-37e9-47b7-a385-d7d12b6150cc}, {index=0, model=virtio-scsi, specParams={}, device=scsi, type=controller, deviceId=a0e6a7a6-6f53-4026-a6c4-68193aad5c18}, {specParams={}, device=virtio-serial, type=controller, deviceId=252e544a-8538-49c1-9ebe-f4e1124e31db}],vmName=vm_TestCase5992_REST_NFS_2016-02-29_04-16-08,fileTransferEnable=true,cpuType=Nehalem

Does not seem to be on the logs - so I assume it's the logs from the wrong host?

Comment 4 Gilad Lazarovich 2016-03-08 06:37:55 UTC

Yaniv, please note that our automated run failed to capture the vdsm logs or I would have provided everything, I only picked up the vdsm logs from the host in question after the run. In any case, I managed to reproduce this again overnight with iSCSI this time. Relevant error is as follows:

2016-03-08 03:05:46,216 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-7) [41133acc] FINISH, VmReplicateDiskFinishVDSCommand, log id: b1c81e9
2016-03-08 03:05:46,241 ERROR [org.ovirt.engine.core.bll.AbstractSPMAsyncTaskHandler] (org.ovirt.thread.pool-6-thread-7) [41133acc] Failed VmReplicateDiskFinish (Disk 'b411d186-65d9-44a8-bb12-fdee27291f2d', VM '10fc4e36-2a6d-4bbe-a48d-21563d64d7aa')

I attached the full engine and vdsm logs for you.

Comment 5 Gilad Lazarovich 2016-03-08 06:46:13 UTC

Created attachment 1134051 [details]
2016-03-08 engine and vdsm logs

Please note that I excluded the libvirtd logs due to size, I can attach those separately if needed

Comment 6 Yaniv Kaul 2016-03-08 10:42:05 UTC

Please notice you may be seeing bug 1306334 as well.

Comment 7 Yaniv Lavi 2016-03-21 12:36:02 UTC

Since it happen in 4-6% of time, it happens in automation and no customer ticket, moving to 4.0.

Comment 8 Raz Tamir 2016-03-24 18:44:41 UTC

Attaching new logs.
From engine log:

2016-03-23 23:24:12,301 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{runAsync='true', hostId='f6f6b24e-d948-409b-b45b-bb4e27f449e9', vmId='80789070-cf50-40c8-98bd-2122814c86d8'}), log id: 64d96be9

2016-03-23 23:24:13,301 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] Failed in 'VmReplicateDiskFinishVDS' method

2016-03-23 23:24:13,319 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{runAsync='true', hostId='f6f6b24e-d948-409b-b45b-bb4e27f449e9', vmId='80789070-cf50-40c8-98bd-2122814c86d8'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Resource unavailable, code = 40

From vdsm log:
{'c04940fe-a4c8-44b8-a552-ac2ec61ef185': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'c04940fe-a4c8-44b8-a552-ac2ec61ef185'}}

So the task semms to finish successfully.
There is nothing unusual in vdsm logs

Comment 9 Raz Tamir 2016-03-24 18:46:16 UTC

Created attachment 1140098 [details]
new logs

Comment 10 Daniel Erez 2016-06-01 09:13:21 UTC

Hi Gilad,

A few questions for further investigation:
* Is it reproduced constantly in automation tests?
* Is it related to scale tests?
* Does it happens also when triggering the specific test manually?
* Is it also reproduced on 4.0 tests?

Comment 11 Daniel Erez 2016-06-01 09:13:22 UTC

Hi Gilad,

A few questions for further investigation:
* Is it reproduced constantly in automation tests?
* Is it related to scale tests?
* Does it happens also when triggering the specific test manually?
* Is it also reproduced on 4.0 tests?

Comment 12 Raz Tamir 2016-06-15 13:27:52 UTC

Hi Daniel,
* No, this happened from time to time, I would say ~10%
* No, There is nothing to with scale tests. it can be even in the most basic flow of live migrating single disk
* It is very hard to reproduce manually
* I saw it also in 4.0 executions

Comment 14 Liron Aravot 2016-06-26 11:37:41 UTC

(In reply to Gilad Lazarovich from comment #5)
> Created attachment 1134051 [details]
> 2016-03-08 engine and vdsm logs
> 
> Please note that I excluded the libvirtd logs due to size, I can attach
> those separately if needed

The LSM flow goes as follows:
1. Create snapshot for the relevant disk(s)
2. Clone the image group entire structure
3. Calls VmReplicateDiskStart to start the replication of the leaf/active volume.
4. Start task to sync the image group data (copies the data between the source target/volumes except the leaf).
5. as soon as step 3 ends, calls vmReplicateDiskFinish.

The engine doesn't actually track the job, but assumes that entire chain copy (step 4) will always take longer then the leaf volume replication.
In the logs provided by Gilad in comment #5, it can be seen that there are extensions of the leaf volume which means that there are ongoing writes that causes to the leaf replication to not end before the chain data sync completes which leads to the failure.

in bz 1325791 the logging for that scenario was improved.

I'm updating that BZ to be a RFE - the engine may track the job instead of assuming it'll always take shorter then the image group data sync.

Comment 15 Liron Aravot 2016-06-26 11:41:55 UTC

Raz, please see if the above aligns with the automation test (as the vdsm logs from the correct time were missing from the log provided in comment 9).

The automation test should be aligned with that limitation to not encounter the issue.

Comment 16 Raz Tamir 2016-06-26 12:46:30 UTC

Hi Liron,
Our automation failed on timeout (1800 seconds) - we are waiting for the relevant disks to be in status ok, wait for the vm not to be locked, and also for the job to finish and those checks are not met in the time frame of 30 minutes

Comment 19 Liron Aravot 2017-02-05 14:19:44 UTC

*** Bug 1413128 has been marked as a duplicate of this bug. ***

Comment 24 Adam Litke 2017-03-28 19:36:00 UTC

*** Bug 1434105 has been marked as a duplicate of this bug. ***

Comment 25 Adam Litke 2017-03-28 19:38:09 UTC

Is this really an RFE and not a bug?  It's easy to trigger just by trying to live migrate a gluster disk which is under heavy I/O load.

Comment 26 Tal Nisan 2018-02-19 09:09:43 UTC

*** Bug 1509030 has been marked as a duplicate of this bug. ***

Comment 37 Sandro Bonazzola 2019-01-28 09:43:55 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 39 Tal Nisan 2019-03-01 16:00:21 UTC

*** Bug 1572895 has been marked as a duplicate of this bug. ***

Comment 40 Benny Zlotnik 2019-03-21 12:43:09 UTC

Steps to verify:
1. Create a vm with disks (with an installed OS)
2. Edit /usr/lib/python2.7/site-packages/vdsm/virt/vm.py#diskReplicateFinish on the relevant host
and add a timed sleep at the start
3. Restart vdsm
4. Start LSM on the disk
5. once "START, VmReplicateDiskFinishVDSCommand" appears in the engine.log, start a `dd` command
$ for ((;;)) do dd if=/dev/urandom of=img bs=1M; sleep 2; done
6. Wait for a log message in the engine stating: "Replication not finished yet, will retry in next polling cycle"
7. Abort the for-loop from step 5
LSM should finish successfully

Comment 41 RHV bug bot 2019-03-29 11:14:42 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 44 Shir Fishbain 2019-04-02 16:19:57 UTC

Verified - LSM finished successfully

ovirt-engine-4.3.3.1-0.1.el7.noarch
vdsm-4.30.12-1.el7ev.x86_64

2019-04-02 19:10:28,654+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [55009beb-71a9-4b56-b3f1-5e42bfe16fb5] FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 1199fb4f
2019-04-02 19:10:28,654+03 WARN  [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [55009beb-71a9-4b56-b3f1-5e42bfe16fb5] Replication not finished yet, will retry in next polling cycle

Comment 45 Benny Zlotnik 2019-04-07 10:50:52 UTC

*** Bug 1520546 has been marked as a duplicate of this bug. ***

Comment 47 errata-xmlrpc 2019-05-08 12:36:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 48 Benny Zlotnik 2019-07-10 08:52:18 UTC

*** Bug 1522851 has been marked as a duplicate of this bug. ***

Comment 49 Red Hat Bugzilla 2023-09-14 23:59:06 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days