Bug 1312909
Summary: | Live disk migration fails noting "Failed to VmReplicateDiskFinishVDS" | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Gilad Lazarovich <glazarov> | ||||||||
Component: | ovirt-engine | Assignee: | Benny Zlotnik <bzlotnik> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Shir Fishbain <sfishbai> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.6.0 | CC: | acanan, aefrat, amarchuk, audgiri, bcholler, bugs, bzlotnik, derez, ebenahar, gwatson, lsurette, mkalinin, nashok, ratamir, Rhev-m-bugs, sasundar, smaudet, srevivo, tnisan | ||||||||
Target Milestone: | ovirt-4.3.3 | Keywords: | Automation, Reopened | ||||||||
Target Release: | 4.3.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | storage | ||||||||||
Fixed In Version: | ovirt-engine-4.3.3.1 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1413128 (view as bug list) | Environment: | |||||||||
Last Closed: | 2019-05-08 12:36:47 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1547768 | ||||||||||
Bug Blocks: | 1277939, 1434105, 1509030, 1520546, 1520566 | ||||||||||
Attachments: |
|
Anything suspicious on VDSM side? Clearly it's complaining on something. Yaniv, I looked through the VDSM logs but didn't find anything that seemed to explain this failure. Can you please have a look at the logs and see if you can find a clue? Are you sure you have the logs from the correct host? This is what I see in engine, and nothing relevant on the host (for example, not even the relevant VM): 2016-02-29 04:18:52,446 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91'}), log id: 7a0d82f6 2016-02-29 04:18:53,455 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Failed in 'VmReplicateDiskFinishVDS' method 2016-02-29 04:18:53,468 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-42) [50c82473] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_2 command failed: Resource unavailable 2016-02-29 04:18:53,469 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturnForXmlRpc [status=StatusForXmlRpc [code=40, message=Resource unavailable]]' 2016-02-29 04:18:53,469 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] HostName = host_mixed_2 So we need logs from host_mixed_2, to see what the issue is. Even the create VM command, that took place just before that: 016-02-29 04:16:51,955 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] START, CreateVmVDSCommand( CreateVmVDSCommandParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91', vm='VM [vm_TestCase5992_REST_NFS_2016-02-29_04-16-08]'}), log id: 44086246 2016-02-29 04:16:51,963 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] START, CreateVDSCommand(HostName = host_mixed_2, CreateVmVDSCommandParameters:{runAsync='true', hostId='c6e25b09-2e4d-4816-b3b0-94222ff3c8ea', vmId='634ce755-a823-4176-81ab-3c1544da2d91', vm='VM [vm_TestCase5992_REST_NFS_2016-02-29_04-16-08]'}), log id: 202717eb 2016-02-29 04:16:51,984 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmInfoBuilderBase] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] Bootable disk '0c5cc7be-8df0-47ea-91d2-94ea18354faa' set to index '0' 2016-02-29 04:16:52,080 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (ajp-/127.0.0.1:8702-6) [vms_syncAction_9429f01b-236c-4719] org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand kvmEnable=true,smpThreadsPerCore=1,keyboardLayout=en-us,nice=0,pitReinjection=false,displayNetwork=ovirtmgmt,copyPasteEnable=true,timeOffset=0,transparentHugePages=true,vmId=634ce755-a823-4176-81ab-3c1544da2d91,acpiEnable=true,custom={},maxMemSize=4194304,spiceSslCipherSuite=DEFAULT,memSize=1024,smp=1,emulatedMachine=pc-i440fx-rhel7.2.0,vmType=kvm,guestNumaNodes=[{cpus=0, nodeIndex=0, memory=1024}],memGuaranteedSize=1024,display=qxl,maxMemSlots=16,smartcardEnable=false,bootMenuEnable=false,smpCoresPerSocket=1,spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,susbredir,ssmartcard,maxVCpus=16,devices=[{specParams={ram=65536, vram=8192, heads=1, vgamem=16384}, device=qxl, type=video, deviceId=fbfb040e-d108-439e-9714-bfbed4f8c050}, {specParams={copyPasteEnable=true, fileTransferEnable=true}, device=spice, type=graphics, deviceId=b7a5070e-f950-4f08-8f7d-d8d95a4a54bd}, {shared=false, iface=ide, index=2, specParams={path=}, path=, device=cdrom, type=disk, readonly=true, deviceId=388b70cf-0a63-4378-9dd4-1953df9c6e7d}, {shared=false, index=0, volumeID=14d54c2e-c98b-4f84-a845-53a2784e6034, propagateErrors=off, format=cow, type=disk, iface=virtio, bootOrder=1, domainID=10526a85-d74c-4fe2-92e3-f9e60522418b, imageID=0c5cc7be-8df0-47ea-91d2-94ea18354faa, specParams={}, optional=false, device=disk, poolID=a1dcb24f-6656-4864-892b-e9628dedf974, readonly=false, deviceId=0c5cc7be-8df0-47ea-91d2-94ea18354faa}, {nicModel=pv, specParams={outbound={}, inbound={}}, macAddr=00:1a:4a:16:81:f4, device=bridge, linkActive=true, type=interface, filter=vdsm-no-mac-spoofing, network=ovirtmgmt, deviceId=4af9b1e4-36c1-46f4-ba81-acf29c43b1cb}, {specParams={}, device=ich6, type=sound, deviceId=810b2f9d-c1e2-4891-8398-060fe3272ca8}, {specParams={model=virtio}, device=memballoon, type=balloon, deviceId=00a09059-37e9-47b7-a385-d7d12b6150cc}, {index=0, model=virtio-scsi, specParams={}, device=scsi, type=controller, deviceId=a0e6a7a6-6f53-4026-a6c4-68193aad5c18}, {specParams={}, device=virtio-serial, type=controller, deviceId=252e544a-8538-49c1-9ebe-f4e1124e31db}],vmName=vm_TestCase5992_REST_NFS_2016-02-29_04-16-08,fileTransferEnable=true,cpuType=Nehalem Does not seem to be on the logs - so I assume it's the logs from the wrong host? Yaniv, please note that our automated run failed to capture the vdsm logs or I would have provided everything, I only picked up the vdsm logs from the host in question after the run. In any case, I managed to reproduce this again overnight with iSCSI this time. Relevant error is as follows: 2016-03-08 03:05:46,216 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-7) [41133acc] FINISH, VmReplicateDiskFinishVDSCommand, log id: b1c81e9 2016-03-08 03:05:46,241 ERROR [org.ovirt.engine.core.bll.AbstractSPMAsyncTaskHandler] (org.ovirt.thread.pool-6-thread-7) [41133acc] Failed VmReplicateDiskFinish (Disk 'b411d186-65d9-44a8-bb12-fdee27291f2d', VM '10fc4e36-2a6d-4bbe-a48d-21563d64d7aa') I attached the full engine and vdsm logs for you. Created attachment 1134051 [details]
2016-03-08 engine and vdsm logs
Please note that I excluded the libvirtd logs due to size, I can attach those separately if needed
Please notice you may be seeing bug 1306334 as well. Since it happen in 4-6% of time, it happens in automation and no customer ticket, moving to 4.0. Attaching new logs. From engine log: 2016-03-23 23:24:12,301 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{runAsync='true', hostId='f6f6b24e-d948-409b-b45b-bb4e27f449e9', vmId='80789070-cf50-40c8-98bd-2122814c86d8'}), log id: 64d96be9 2016-03-23 23:24:13,301 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] Failed in 'VmReplicateDiskFinishVDS' method 2016-03-23 23:24:13,319 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (org.ovirt.thread.pool-6-thread-21) [27911db6] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{runAsync='true', hostId='f6f6b24e-d948-409b-b45b-bb4e27f449e9', vmId='80789070-cf50-40c8-98bd-2122814c86d8'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Resource unavailable, code = 40 From vdsm log: {'c04940fe-a4c8-44b8-a552-ac2ec61ef185': {'code': 0, 'message': '1 jobs completed successfully', 'taskState': 'finished', 'taskResult': 'success', 'taskID': 'c04940fe-a4c8-44b8-a552-ac2ec61ef185'}} So the task semms to finish successfully. There is nothing unusual in vdsm logs Created attachment 1140098 [details]
new logs
Hi Gilad, A few questions for further investigation: * Is it reproduced constantly in automation tests? * Is it related to scale tests? * Does it happens also when triggering the specific test manually? * Is it also reproduced on 4.0 tests? Hi Gilad, A few questions for further investigation: * Is it reproduced constantly in automation tests? * Is it related to scale tests? * Does it happens also when triggering the specific test manually? * Is it also reproduced on 4.0 tests? Hi Daniel, * No, this happened from time to time, I would say ~10% * No, There is nothing to with scale tests. it can be even in the most basic flow of live migrating single disk * It is very hard to reproduce manually * I saw it also in 4.0 executions (In reply to Gilad Lazarovich from comment #5) > Created attachment 1134051 [details] > 2016-03-08 engine and vdsm logs > > Please note that I excluded the libvirtd logs due to size, I can attach > those separately if needed The LSM flow goes as follows: 1. Create snapshot for the relevant disk(s) 2. Clone the image group entire structure 3. Calls VmReplicateDiskStart to start the replication of the leaf/active volume. 4. Start task to sync the image group data (copies the data between the source target/volumes except the leaf). 5. as soon as step 3 ends, calls vmReplicateDiskFinish. The engine doesn't actually track the job, but assumes that entire chain copy (step 4) will always take longer then the leaf volume replication. In the logs provided by Gilad in comment #5, it can be seen that there are extensions of the leaf volume which means that there are ongoing writes that causes to the leaf replication to not end before the chain data sync completes which leads to the failure. in bz 1325791 the logging for that scenario was improved. I'm updating that BZ to be a RFE - the engine may track the job instead of assuming it'll always take shorter then the image group data sync. Raz, please see if the above aligns with the automation test (as the vdsm logs from the correct time were missing from the log provided in comment 9). The automation test should be aligned with that limitation to not encounter the issue. Hi Liron, Our automation failed on timeout (1800 seconds) - we are waiting for the relevant disks to be in status ok, wait for the vm not to be locked, and also for the job to finish and those checks are not met in the time frame of 30 minutes *** Bug 1413128 has been marked as a duplicate of this bug. *** *** Bug 1434105 has been marked as a duplicate of this bug. *** Is this really an RFE and not a bug? It's easy to trigger just by trying to live migrate a gluster disk which is under heavy I/O load. *** Bug 1509030 has been marked as a duplicate of this bug. *** This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. *** Bug 1572895 has been marked as a duplicate of this bug. *** Steps to verify: 1. Create a vm with disks (with an installed OS) 2. Edit /usr/lib/python2.7/site-packages/vdsm/virt/vm.py#diskReplicateFinish on the relevant host and add a timed sleep at the start 3. Restart vdsm 4. Start LSM on the disk 5. once "START, VmReplicateDiskFinishVDSCommand" appears in the engine.log, start a `dd` command $ for ((;;)) do dd if=/dev/urandom of=img bs=1M; sleep 2; done 6. Wait for a log message in the engine stating: "Replication not finished yet, will retry in next polling cycle" 7. Abort the for-loop from step 5 LSM should finish successfully WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops Verified - LSM finished successfully ovirt-engine-4.3.3.1-0.1.el7.noarch vdsm-4.30.12-1.el7ev.x86_64 2019-04-02 19:10:28,654+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [55009beb-71a9-4b56-b3f1-5e42bfe16fb5] FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 1199fb4f 2019-04-02 19:10:28,654+03 WARN [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-43) [55009beb-71a9-4b56-b3f1-5e42bfe16fb5] Replication not finished yet, will retry in next polling cycle *** Bug 1520546 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1085 *** Bug 1522851 has been marked as a duplicate of this bug. *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |
Created attachment 1131533 [details] engine and vdsm logs Description of problem: Live disk migration fails noting "Failed to VmReplicateDiskFinishVDS", error = Resource unavailable, code = 40 (Failed with error unavail and code 40) Version-Release number of selected component (if applicable): 3.6.3.3-0.1 How reproducible: Occurs once in 2-3 tier 2 runs (each consisting of about 50 tests) Steps to Reproduce: 1. Live migrate a 1 GB disk from an NFS domain to an NFS domain 2. Repeat this process in a loop Actual results: The live migrate fails noting: 2016-02-29 04:18:53,511 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-6-thread-42) [50c82473] Command 'org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Resource unavailable, code = 40 (Failed with error unavail and code 40) Expected results: Additional info: Please see attached engine and vdsm logs