Bug 1312909 - Live Storage Migration - engine can track the replication job instead of assuming it was ended
Live Storage Migration - engine can track the replication job instead of assu...
Status: NEW
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.6.0
Unspecified Unspecified
unspecified Severity high
: ovirt-4.3.0
: ---
Assigned To: Benny Zlotnik
Elad
storage
: Automation
: 1413128 1509030 (view as bug list)
Depends On:
Blocks: 1434105 1520566 Gluster-HC-2 1509030
  Show dependency treegraph
 
Reported: 2016-02-29 08:57 EST by Gilad Lazarovich
Modified: 2018-05-18 12:30 EDT (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1413128 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
engine and vdsm logs (5.36 MB, application/octet-stream)
2016-02-29 08:57 EST, Gilad Lazarovich
no flags Details
2016-03-08 engine and vdsm logs (11.43 MB, application/octet-stream)
2016-03-08 01:46 EST, Gilad Lazarovich
no flags Details
new logs (3.74 MB, application/x-gzip)
2016-03-24 14:46 EDT, Raz Tamir
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2877381 None None None 2017-12-10 23:36 EST
Red Hat Knowledge Base (Solution) 3231051 None None None 2018-04-23 11:25 EDT

  None (edit)
Description Gilad Lazarovich 2016-02-29 08:57:35 EST
Created attachment 1131533 [details]
engine and vdsm logs

Description of problem:
Live disk migration fails noting "Failed to VmReplicateDiskFinishVDS", error = Resource unavailable, code = 40 (Failed with error unavail and code 40)

Version-Release number of selected component (if applicable):
3.6.3.3-0.1

How reproducible:
Occurs once in 2-3 tier 2 runs (each consisting of about 50 tests)

Steps to Reproduce:
1. Live migrate a 1 GB disk from an NFS domain to an NFS domain
2. Repeat this process in a loop

Actual results:
The live migrate fails noting:
2016-02-29 04:18:53,511 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Command 'org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Resource unavailable, code = 40 (Failed with error unavail and code 40)

Expected results:


Additional info:
Please see attached engine and vdsm logs
Comment 1 Yaniv Kaul 2016-03-01 08:18:04 EST
Anything suspicious on VDSM side? Clearly it's complaining on something.
Comment 2 Gilad Lazarovich 2016-03-02 03:45:29 EST
Yaniv, I looked through the VDSM logs but didn't find anything that seemed to explain this failure. Can you please have a look at the logs and see if you can find a clue?
Comment 3 Yaniv Kaul 2016-03-02 03:57:38 EST
Are you sure you have the logs from the correct host?
This is what I see in engine, and nothing relevant on the host (for example, not even the relevant VM):
2016-02-29 04:18:52,446 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91'}), log id: 7a0d82f6
2016-02-29 04:18:53,455 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Failed in 'VmReplicateDiskFinishVDS' method
2016-02-29 04:18:53,468 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-42) [50c82473] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_2 command failed: Resource unavailable
2016-02-29 04:18:53,469 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturnForXmlRpc [status=StatusForXmlRpc [code=40, message=Resource unavailable]]'
2016-02-29 04:18:53,469 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] HostName = host_mixed_2

So we need logs from host_mixed_2, to see what the issue is.

Even the create VM command, that took place just before that:
016-02-29 04:16:51,955 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] START, CreateVmVDSCommand( CreateVmVDSCommandParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91', vm='VM [vm_TestCase5992_REST_NFS_2016-02-29_04-16-08]'}), log id: 44086246
2016-02-29 04:16:51,963 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] START, CreateVDSCommand(HostName = host_mixed_2, CreateVmVDSCommandParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91', vm='VM [vm_TestCase5992_REST_NFS_2016-02-29_04-16-08]'}), log id: 202717eb
2016-02-29 04:16:51,984 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmInfoBuilderBase] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] Bootable disk '0c5cc7be-8df0-47ea-91d2-94ea18354faa' set to index '0'
2016-02-29 04:16:52,080 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand kvmEnable=true,smpThreadsPerCore=1,keyboardLayout=en-us,nice=0,pitReinjection=false,displayNetwork=ovirtmgmt,copyPasteEnable=true,timeOffset=0,transparentHugePages=true,vmId=634ce755-a823-4176-81ab-3c1544da2d91,acpiEnable=true,custom={},maxMemSize=4194304,spiceSslCipherSuite=DEFAULT,memSize=1024,smp=1,emulatedMachine=pc-i440fx-rhel7.2.0,vmType=kvm,guestNumaNodes=[{cpus=0, nodeIndex=0, memory=1024}],memGuaranteedSize=1024,display=qxl,maxMemSlots=16,smartcardEnable=false,bootMenuEnable=false,smpCoresPerSocket=1,spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard,maxVCpus=16,devices=[{specParams={ram=65536, vram=8192, heads=1, vgamem=16384}, device=qxl, type=video, deviceId=fbfb040e-d108-439e-9714-bfbed4f8c050}, {specParams={copyPasteEnable=true, fileTransferEnable=true}, device=spice, type=graphics, deviceId=b7a5070e-f950-4f08-8f7d-d8d95a4a54bd}, {shared=false, iface=ide, index=2, specParams={path=}, path=, device=cdrom, type=disk, readonly=true, deviceId=388b70cf-0a63-4378-9dd4-1953df9c6e7d}, {shared=false, index=0, volumeID=14d54c2e-c98b-4f84-a845-53a2784e6034, propagateErrors=off, format=cow, type=disk, iface=virtio, bootOrder=1, domainID=10526a85-d74c-4fe2-92e3-f9e60522418b, imageID=0c5cc7be-8df0-47ea-91d2-94ea18354faa, specParams={}, optional=false, device=disk, poolID=a1dcb24f-6656-4864-892b-e9628dedf974, readonly=false, deviceId=0c5cc7be-8df0-47ea-91d2-94ea18354faa}, {nicModel=pv, specParams={outbound={}, inbound={}}, macAddr=00:1a:4a:16:81:f4, device=bridge, linkActive=true, type=interface, filter=vdsm-no-mac-spoofing, network=ovirtmgmt, deviceId=4af9b1e4-36c1-46f4-ba81-acf29c43b1cb}, {specParams={}, device=ich6, type=sound, deviceId=810b2f9d-c1e2-4891-8398-060fe3272ca8}, {specParams={model=virtio}, device=memballoon, type=balloon, deviceId=00a09059-37e9-47b7-a385-d7d12b6150cc}, {index=0, model=virtio-scsi, specParams={}, device=scsi, type=controller, deviceId=a0e6a7a6-6f53-4026-a6c4-68193aad5c18}, {specParams={}, device=virtio-serial, type=controller, deviceId=252e544a-8538-49c1-9ebe-f4e1124e31db}],vmName=vm_TestCase5992_REST_NFS_2016-02-29_04-16-08,fileTransferEnable=true,cpuType=Nehalem

Does not seem to be on the logs - so I assume it's the logs from the wrong host?
Comment 4 Gilad Lazarovich 2016-03-08 01:37:55 EST
Yaniv, please note that our automated run failed to capture the vdsm logs or I would have provided everything, I only picked up the vdsm logs from the host in question after the run. In any case, I managed to reproduce this again overnight with iSCSI this time. Relevant error is as follows:

2016-03-08 03:05:46,216 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-7) [41133acc] FINISH, VmReplicateDiskFinishVDSCommand, log id: b1c81e9
2016-03-08 03:05:46,241 ERROR [org.ovirt.engine.core.bll.AbstractSPMAsyncTaskHandler] (org.ovirt.thread.pool-6-thread-7) [41133acc] Failed VmReplicateDiskFinish (Disk 'b411d186-65d9-44a8-bb12-fdee27291f2d', VM '10fc4e36-2a6d-4bbe-a48d-21563d64d7aa')

I attached the full engine and vdsm logs for you.
Comment 5 Gilad Lazarovich 2016-03-08 01:46 EST
Created attachment 1134051 [details]
2016-03-08 engine and vdsm logs

Please note that I excluded the libvirtd logs due to size, I can attach those separately if needed
Comment 6 Yaniv Kaul 2016-03-08 05:42:05 EST
Please notice you may be seeing bug 1306334 as well.
Comment 7 Yaniv Lavi 2016-03-21 08:36:02 EDT
Since it happen in 4-6% of time, it happens in automation and no customer ticket, moving to 4.0.
Comment 8 Raz Tamir 2016-03-24 14:44:41 EDT
Attaching new logs.
From engine log:

2016-03-23 23:24:12,301 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{runAsync='true', hostId='f6f6b24e-d948-409b-b45b-bb4e27f449e9', vmId='80789070-cf50-40c8-98bd-2122814c86d8'}), log id: 64d96be9

2016-03-23 23:24:13,301 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] Failed in 'VmReplicateDiskFinishVDS' method

2016-03-23 23:24:13,319 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{runAsync='true', hostId='f6f6b24e-d948-409b-b45b-bb4e27f449e9', vmId='80789070-cf50-40c8-98bd-2122814c86d8'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Resource unavailable, code = 40

From vdsm log:
{'c04940fe-a4c8-44b8-a552-ac2ec61ef185': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'c04940fe-a4c8-44b8-a552-ac2ec61ef185'}}

So the task semms to finish successfully.
There is nothing unusual in vdsm logs
Comment 9 Raz Tamir 2016-03-24 14:46 EDT
Created attachment 1140098 [details]
new logs
Comment 10 Daniel Erez 2016-06-01 05:13:21 EDT
Hi Gilad,

A few questions for further investigation:
* Is it reproduced constantly in automation tests?
* Is it related to scale tests?
* Does it happens also when triggering the specific test manually?
* Is it also reproduced on 4.0 tests?
Comment 11 Daniel Erez 2016-06-01 05:13:22 EDT
Hi Gilad,

A few questions for further investigation:
* Is it reproduced constantly in automation tests?
* Is it related to scale tests?
* Does it happens also when triggering the specific test manually?
* Is it also reproduced on 4.0 tests?
Comment 12 Raz Tamir 2016-06-15 09:27:52 EDT
Hi Daniel,
* No, this happened from time to time, I would say ~10%
* No, There is nothing to with scale tests. it can be even in the most basic flow of live migrating single disk
* It is very hard to reproduce manually
* I saw it also in 4.0 executions
Comment 14 Liron Aravot 2016-06-26 07:37:41 EDT
(In reply to Gilad Lazarovich from comment #5)
> Created attachment 1134051 [details]
> 2016-03-08 engine and vdsm logs
> 
> Please note that I excluded the libvirtd logs due to size, I can attach
> those separately if needed

The LSM flow goes as follows:
1. Create snapshot for the relevant disk(s)
2. Clone the image group entire structure
3. Calls VmReplicateDiskStart to start the replication of the leaf/active volume.
4. Start task to sync the image group data (copies the data between the source target/volumes except the leaf).
5. as soon as step 3 ends, calls vmReplicateDiskFinish.

The engine doesn't actually track the job, but assumes that entire chain copy (step 4) will always take longer then the leaf volume replication.
In the logs provided by Gilad in comment #5, it can be seen that there are extensions of the leaf volume which means that there are ongoing writes that causes to the leaf replication to not end before the chain data sync completes which leads to the failure.

in bz 1325791 the logging for that scenario was improved.

I'm updating that BZ to be a RFE - the engine may track the job instead of assuming it'll always take shorter then the image group data sync.
Comment 15 Liron Aravot 2016-06-26 07:41:55 EDT
Raz, please see if the above aligns with the automation test (as the vdsm logs from the correct time were missing from the log provided in comment 9).

The automation test should be aligned with that limitation to not encounter the issue.
Comment 16 Raz Tamir 2016-06-26 08:46:30 EDT
Hi Liron,
Our automation failed on timeout (1800 seconds) - we are waiting for the relevant disks to be in status ok, wait for the vm not to be locked, and also for the job to finish and those checks are not met in the time frame of 30 minutes
Comment 19 Liron Aravot 2017-02-05 09:19:44 EST
*** Bug 1413128 has been marked as a duplicate of this bug. ***
Comment 24 Adam Litke 2017-03-28 15:36:00 EDT
*** Bug 1434105 has been marked as a duplicate of this bug. ***
Comment 25 Adam Litke 2017-03-28 15:38:09 EDT
Is this really an RFE and not a bug?  It's easy to trigger just by trying to live migrate a gluster disk which is under heavy I/O load.
Comment 26 Tal Nisan 2018-02-19 04:09:43 EST
*** Bug 1509030 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.