Created attachment 1536879 [details] engine.log and vdsm from each gluster node. Description of problem: I tried moving some disks from one gluster volume to another. 8 worked, 6 failed. I can't retry the move because ovirt says: 2019-02-14 21:36:49,450-08 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler4) [2d9789d1] BaseAsyncTask::logEndTaskFailure: Task '2a0e703b-0239-41f8-a920 -50c1ae096590' (Parent Command 'CreateImagePlaceholder', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: -- Result: 'cleanSuccess' -- Message: 'VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Volume already exists: ('d33e8048-a4b4-4b85-bf44-20be65b854f2',)', Version-Release number of selected component (if applicable): ovirt-engine-4.1.8.2-1.el7.centos.noarch glusterfs-3.8.15-2.el7.x86_64 vdsm-4.19.43-1.el7.centos.x86_64 How reproducible: not sure, 8 worked, 6 failed. Not sure why some failed. Steps to Reproduce: 1. create VM with disk volume in Gluster storage domain 2. Move disk to a different Gluster storage domain with VM running. 3. Actual results: 2019-02-14 21:34:03,079-08 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQ uartzScheduler10) [7adfa09d-d0d6-4478-9a6a-c505535e325b] Command 'LiveMigrateVmDisks' id: '4e1628ad-3396-48 6d-ae69-5702a0173e6f' child commands '[6e15830b-3fb6-42a6-bb36-3251f8fd8c25, 2984f152-666f-4862-bbf2-a36ce7 bd1985, 15606702-7708-4bac-8039-6c63005518e4]' executions were completed, status 'FAILED' Expected results: Successful move of disk. Additional info:
moved some more disks tonight. 15 worked fine, 1 failed. Main thing I need is to know how to clean up failed disks so I can try move again.
since haven't heard anything I did some experimenting. Mounted destination gluster volume via NFS and removed previous failed move (37db52be-89bb-4867-9854-97f215ecd3a2, awsnms). Tried disk move again. Failed. 2019-02-27 14:06:40,304-08 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler10) [a26c9ab5-425c-4d0d-8cfb-af5be48cb877] Command 'VmReplicateDiskFinishVDSCommand(HostName = ovirt9.j2noc.com, VmReplicateDiskParameters:{runAsync='true', hostId='ad1b0f7f-99b1-48eb-b113-acbc57ec280b', vmId='6cc68abc-4263-4b5f-81ab-c967fd4169e2', storagePoolId='00000001-0001-0001-0001-0000000002c5', srcStorageDomainId='22df0943-c131-4ed8-ba9c-05923afcf8e3', targetStorageDomainId='22df0943-c131-4ed8-ba9c-05923afcf8e3', imageGroupId='37db52be-89bb-4867-9854-97f215ecd3a2', imageId='426bd122-eb9c-4c2f-ac1f-229a0e207aec'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error, code = 55 vdsm.log: 2019-02-27 14:06:39,116-0800 ERROR (jsonrpc/6) [virt.vm] (vmId='6cc68abc-4263-4b5f-81ab-c967fd4169e2') Replication job not found (drive: 'vda', srcDisk: {u'device': u'disk', u'poolID': u'00000001-0001-000 1-0001-0000000002c5', u'volumeID': u'426bd122-eb9c-4c2f-ac1f-229a0e207aec', u'domainID': u'22df0943-c1 31-4ed8-ba9c-05923afcf8e3', u'imageID': u'37db52be-89bb-4867-9854-97f215ecd3a2'}, job: {}) (vm:3828) 2019-02-27 14:06:39,128-0800 INFO (jsonrpc/6) [vdsm.api] FINISH diskReplicateFinish return={'status': {'message': 'Drive replication error', 'code': 55}} from=::ffff:10.144.110.101,52116, flow_id=a26c9ab 5-425c-4d0d-8cfb-af5be48cb877 (api:52) Don't know why its calling ILLEGAL: 2019-02-27 14:06:56,473-0800 INFO (merge/6cc68abc) [vdsm.api] START imageSyncVolumeChain(sdUUID=u'22d f0943-c131-4ed8-ba9c-05923afcf8e3', imgUUID=u'37db52be-89bb-4867-9854-97f215ecd3a2', volUUID=u'426bd12 2-eb9c-4c2f-ac1f-229a0e207aec', newChain=[u'11c58d32-be15-42aa-b782-657ca1510ccc']) from=internal, tas k_id=70a9f22b-0f6f-4855-804e-2fb2912d8436 (api:46) 2019-02-27 14:06:56,554-0800 INFO (merge/6cc68abc) [storage.Image] Current chain=11c58d32-be15-42aa-b 782-657ca1510ccc < 426bd122-eb9c-4c2f-ac1f-229a0e207aec (top) (image:1266) 2019-02-27 14:06:56,554-0800 INFO (merge/6cc68abc) [storage.Image] Unlinking subchain: [u'426bd122-eb 9c-4c2f-ac1f-229a0e207aec'] (image:1276) 2019-02-27 14:06:56,570-0800 INFO (merge/6cc68abc) [storage.Image] Leaf volume 426bd122-eb9c-4c2f-ac1 f-229a0e207aec is being removed from the chain. Marking it ILLEGAL to prevent data corruption (image:1 284) 2019-02-27 14:06:56,570-0800 INFO (merge/6cc68abc) [storage.VolumeManifest] sdUUID=22df0943-c131-4ed8 -ba9c-05923afcf8e3 imgUUID=37db52be-89bb-4867-9854-97f215ecd3a2 volUUID = 426bd122-eb9c-4c2f-ac1f-229a 0e207aec legality = ILLEGAL (volume:398) I'll try again later tonight with VM shutdown.
moving the VM when down worked fine. Is there some kind of limitation that oVirt can't move a disk image over say 15G while Vm is live or a 1G network connection? I tried moving a few more disks on live systems, all the ones over 20G failed after trying for between 40-70 minutes.
Is your network saturated during the move? Tal, do you have anything to add?
(In reply to Sahina Bose from comment #4) > Is your network saturated during the move? > Tal, do you have anything to add? No, most likely it's exactly that
A similar bug exist also not on Gluster storage - bug 1520546
*** This bug has been marked as a duplicate of bug 1520546 ***