1679355 – Left over volume after failed disk move

Bug 1679355 - Left over volume after failed disk move

Summary: Left over volume after failed disk move

Keywords:
Status:	CLOSED DUPLICATE of bug 1520546
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Gluster
Sub Component:
Version:	---
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Yaniv Kaul
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-21 00:21 UTC by bill.james@j2.com
Modified:	2019-09-04 12:57 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-09-04 12:57:46 UTC
oVirt Team:	Gluster
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
engine.log and vdsm from each gluster node. (1.79 MB, application/x-tar) 2019-02-21 00:21 UTC, bill.james@j2.com	no flags	Details
View All

Description bill.james@j2.com 2019-02-21 00:21:27 UTC

Created attachment 1536879 [details]
engine.log and vdsm from each gluster node.

Description of problem:
I tried moving some disks from one gluster volume to another.
8 worked, 6 failed.

I can't retry the move because ovirt says:
2019-02-14 21:36:49,450-08 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler4) [2d9789d1] BaseAsyncTask::logEndTaskFailure: Task '2a0e703b-0239-41f8-a920
-50c1ae096590' (Parent Command 'CreateImagePlaceholder', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Volume already exists: ('d33e8048-a4b4-4b85-bf44-20be65b854f2',)',




Version-Release number of selected component (if applicable):
ovirt-engine-4.1.8.2-1.el7.centos.noarch
glusterfs-3.8.15-2.el7.x86_64
vdsm-4.19.43-1.el7.centos.x86_64


How reproducible:
not sure, 8 worked, 6 failed. Not sure why some failed.

Steps to Reproduce:
1. create VM with disk volume in Gluster storage domain
2. Move disk to a different Gluster storage domain with VM running.
3.

Actual results:
2019-02-14 21:34:03,079-08 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQ
uartzScheduler10) [7adfa09d-d0d6-4478-9a6a-c505535e325b] Command 'LiveMigrateVmDisks' id: '4e1628ad-3396-48
6d-ae69-5702a0173e6f' child commands '[6e15830b-3fb6-42a6-bb36-3251f8fd8c25, 2984f152-666f-4862-bbf2-a36ce7
bd1985, 15606702-7708-4bac-8039-6c63005518e4]' executions were completed, status 'FAILED'



Expected results:
Successful move of disk.

Additional info:

Comment 1 bill.james@j2.com 2019-02-21 06:27:41 UTC

moved some more disks tonight.  15 worked fine, 1 failed.
Main thing I need is to know how to clean up failed disks so I can try move again.

Comment 2 bill.james@j2.com 2019-02-27 22:23:47 UTC

since haven't heard anything I did some experimenting.
Mounted destination gluster volume via NFS and removed previous failed move (37db52be-89bb-4867-9854-97f215ecd3a2, awsnms).
Tried disk move again. Failed.

2019-02-27 14:06:40,304-08 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler10) [a26c9ab5-425c-4d0d-8cfb-af5be48cb877] Command 'VmReplicateDiskFinishVDSCommand(HostName = ovirt9.j2noc.com, VmReplicateDiskParameters:{runAsync='true', hostId='ad1b0f7f-99b1-48eb-b113-acbc57ec280b', vmId='6cc68abc-4263-4b5f-81ab-c967fd4169e2', storagePoolId='00000001-0001-0001-0001-0000000002c5', srcStorageDomainId='22df0943-c131-4ed8-ba9c-05923afcf8e3', targetStorageDomainId='22df0943-c131-4ed8-ba9c-05923afcf8e3', imageGroupId='37db52be-89bb-4867-9854-97f215ecd3a2', imageId='426bd122-eb9c-4c2f-ac1f-229a0e207aec'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error, code = 55

vdsm.log:
2019-02-27 14:06:39,116-0800 ERROR (jsonrpc/6) [virt.vm] (vmId='6cc68abc-4263-4b5f-81ab-c967fd4169e2')
 Replication job not found (drive: 'vda', srcDisk: {u'device': u'disk', u'poolID': u'00000001-0001-000
1-0001-0000000002c5', u'volumeID': u'426bd122-eb9c-4c2f-ac1f-229a0e207aec', u'domainID': u'22df0943-c1
31-4ed8-ba9c-05923afcf8e3', u'imageID': u'37db52be-89bb-4867-9854-97f215ecd3a2'}, job: {}) (vm:3828)
2019-02-27 14:06:39,128-0800 INFO  (jsonrpc/6) [vdsm.api] FINISH diskReplicateFinish return={'status':
 {'message': 'Drive replication error', 'code': 55}} from=::ffff:10.144.110.101,52116, flow_id=a26c9ab
5-425c-4d0d-8cfb-af5be48cb877 (api:52)



Don't know why its calling ILLEGAL:
2019-02-27 14:06:56,473-0800 INFO  (merge/6cc68abc) [vdsm.api] START imageSyncVolumeChain(sdUUID=u'22d
f0943-c131-4ed8-ba9c-05923afcf8e3', imgUUID=u'37db52be-89bb-4867-9854-97f215ecd3a2', volUUID=u'426bd12
2-eb9c-4c2f-ac1f-229a0e207aec', newChain=[u'11c58d32-be15-42aa-b782-657ca1510ccc']) from=internal, tas
k_id=70a9f22b-0f6f-4855-804e-2fb2912d8436 (api:46)
2019-02-27 14:06:56,554-0800 INFO  (merge/6cc68abc) [storage.Image] Current chain=11c58d32-be15-42aa-b
782-657ca1510ccc < 426bd122-eb9c-4c2f-ac1f-229a0e207aec (top)  (image:1266)
2019-02-27 14:06:56,554-0800 INFO  (merge/6cc68abc) [storage.Image] Unlinking subchain: [u'426bd122-eb
9c-4c2f-ac1f-229a0e207aec'] (image:1276)
2019-02-27 14:06:56,570-0800 INFO  (merge/6cc68abc) [storage.Image] Leaf volume 426bd122-eb9c-4c2f-ac1
f-229a0e207aec is being removed from the chain. Marking it ILLEGAL to prevent data corruption (image:1
284)
2019-02-27 14:06:56,570-0800 INFO  (merge/6cc68abc) [storage.VolumeManifest] sdUUID=22df0943-c131-4ed8
-ba9c-05923afcf8e3 imgUUID=37db52be-89bb-4867-9854-97f215ecd3a2 volUUID = 426bd122-eb9c-4c2f-ac1f-229a
0e207aec legality = ILLEGAL  (volume:398)


I'll try again later tonight with VM shutdown.

Comment 3 bill.james@j2.com 2019-02-28 17:15:00 UTC

moving the VM when down worked fine.
Is there some kind of limitation that oVirt can't move a disk image over say 15G while Vm is live or a 1G network connection?
I tried moving a few more disks on live systems, all the ones over 20G failed after trying for between 40-70 minutes.

Comment 4 Sahina Bose 2019-03-21 12:29:13 UTC

Is your network saturated during the move? 
Tal, do you have anything to add?

Comment 5 Tal Nisan 2019-03-26 13:22:28 UTC

(In reply to Sahina Bose from comment #4)
> Is your network saturated during the move? 
> Tal, do you have anything to add?

No, most likely it's exactly that

Comment 6 Tal Nisan 2019-03-26 13:26:31 UTC

A similar bug exist also not on Gluster storage - bug 1520546

Comment 9 Sahina Bose 2019-09-04 12:57:46 UTC


*** This bug has been marked as a duplicate of bug 1520546 ***

Note You need to log in before you can comment on or make changes to this bug.