Bug 1497931 - [RACE] Storage migration fails on Cannot create logical volume if a disk is moved from x to y and then back again before the original disk is cleared from the original source
Summary: [RACE] Storage migration fails on Cannot create logical volume if a disk is m...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact: Avihai
URL:
Whiteboard:
: 1559929 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-03 07:20 UTC by Lilach Zitnitski
Modified: 2022-01-18 11:14 UTC (History)
10 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2021-09-29 11:33:06 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
logs (720.32 KB, application/zip)
2017-10-03 07:21 UTC, Lilach Zitnitski
no flags Details
new-logs (401.17 KB, application/zip)
2017-10-18 08:48 UTC, Lilach Zitnitski
no flags Details
new-logs2 (3.18 MB, application/zip)
2018-07-01 11:50 UTC, Elad
no flags Details
logs3 (3.52 MB, application/x-gzip)
2018-09-04 11:42 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 82515 0 'None' MERGED lvm: Add stderr to CannotCreateLogicalVolume 2020-12-16 00:27:19 UTC

Description Lilach Zitnitski 2017-10-03 07:20:34 UTC
Description of problem:
Sometimes storage migration (I've seen it so far only for cold migration) fails to complete and logs are showing the error - Cannot create Logical Volume.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20171001151854.git75e5c70.el7.centos.noarch
vdsm-4.20.3-126.git48cdf1b.el7.centos.x86_64

How reproducible:
~50%

Steps to Reproduce:
1. create vm with disks
2. migrate the disks to a different storage domain

Actual results:
In some cases, the migration fails 

Expected results:
All disks should move to the new storage domain

Additional info:

relevant disk migration correlationID - disks_syncAction_f463ead4-87ac-44c8

engine.log

2017-10-02 23:26:20,640+03 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [] BaseAsyncTask::logEndTaskFailure: Task '4978ff6e-cd20-4c88-9659-67d9fa0e0b5e' (Parent Command 'CreateVolumeContainer', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Cannot create Logical Volume: ('1fcd4803-1fe6-4526-a778-aecec4f3f235', 'd50cc533-3398-4b2e-8ef0-93a55e372fc2'), code = 550',
-- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = Cannot create Logical Volume: ('1fcd4803-1fe6-4526-a778-aecec4f3f235', 'd50cc533-3398-4b2e-8ef0-93a55e372fc2'), code = 550'

2017-10-02 23:26:31,055+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), User admin@internal-authz have failed to move disk disk_virtio_scsiraw_0223223354 to domain iscsi_0.


vdsm.log

2017-10-02 23:26:11,940+0300 ERROR (tasks/4) [storage.Volume] Failed to create volume /rhev/data-center/mnt/blockSD/1fcd4803-1fe6-4526-a778-aecec4f3f235/images/78292e21-b44c-4be2-9879-33636057c9c3/d50cc533-3398-4b2e-8ef0-93a55e372fc2: Cannot create Logical Volume: ('1fcd4803-1fe6-4526-a778-aecec4f3f235', 'd50cc533-3398-4b2e-8ef0-93a55e372fc2') (volume:1159)
2017-10-02 23:26:11,941+0300 ERROR (tasks/4) [storage.Volume] Unexpected error (volume:1196)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1156, in create
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 501, in _create
    initialTags=(sc.TAG_VOL_UNINIT,))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1161, in createLV
    raise se.CannotCreateLogicalVolume(vgName, lvName)
CannotCreateLogicalVolume: Cannot create Logical Volume: ('1fcd4803-1fe6-4526-a778-aecec4f3f235', 'd50cc533-3398-4b2e-8ef0-93a55e372fc2')

Comment 1 Lilach Zitnitski 2017-10-03 07:21:19 UTC
Created attachment 1333528 [details]
logs

Comment 2 Allon Mureinik 2017-10-03 12:44:06 UTC
Unfortunately, the log isn't too useful, as the error in VDSM just indicates that the LV could not be created but doesn't indicate WHY this happened.

The attached patch, or some variant thereof, should improve the log to carry this information and help debugging this issue.

Comment 3 Allon Mureinik 2017-10-16 13:26:31 UTC
Lilach - now that the attached patch is merged, we should have some better logs if such a failure occurs.
Could you please rerun your scenario with a build that includes commit 00c30e7be (If you wait for **today**'s nightly, you could probably use it tomorrow morning) and attach the logs here?

Thanks!

Comment 4 Lilach Zitnitski 2017-10-18 08:48:14 UTC
Created attachment 1340046 [details]
new-logs

Adding new engine and vdsm logs

relevant migration correlation ID is: disks_syncAction_f30c8e95-2106-458c

Comment 5 Allon Mureinik 2017-10-18 08:51:45 UTC
From VDSM log:

2017-10-18 11:32:38,509+0300 DEBUG (tasks/4) [storage.Misc.excCmd] FAILED: <err> = '  Logical Volume "791f95fd-8965-432c-b197-23554c19dbdd" already exists in volume group "73461dc7-d144-4cb2-a016-e8c633b13630"\n'; <rc> = 5 (commands:94)

Benny, does this make any sense to you?

Comment 6 Benny Zlotnik 2017-11-02 16:04:43 UTC
2017-10-18 11:32:39,331+0300 ERROR (tasks/4) [storage.TaskManager.Task] (Task='128df0df-af53-45de-a3c8-35262b83eaf6') Unexpected error (task:872)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 879, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 333, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1936, in createVolume
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 770, in createVolume
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1201, in create
    (volUUID, e))
VolumeCreationError: Error creating a new volume: ('Volume creation 791f95fd-8965-432c-b197-23554c19dbdd failed: [Errno 2] No such file or directory',)

Looks like 791f95fd-8965-432c-b197-23554c19dbdd was removed earlier:
2017-10-18 11:32:39,324+0300 WARN  (tasks/4) [root] File: /rhev/data-center/mnt/blockSD/73461dc7-d144-4cb2-a016-e8c633b13630/images/490e84a1-3348-4095-b4e3-d2d201f8ed93/791f95fd-8965-432c-b197-23554c19dbdd already removed (fileutils:51)

2017-10-18 11:32:39,119+0300 DEBUG (tasks/5) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/blockSD/73461dc7-d144-4cb2-a016-e8c633b13630/images/490e84a1-3348-4095-b4e3-d2d201f8ed93 (blockSD:702)

I'll have a deeper look at this

Comment 7 Benny Zlotnik 2017-12-04 15:24:59 UTC
Updates: 
This looks like a race between the volume removal and volume creation when moving the same disk twice back to the source relatively fast.
Since the same disk is moved twice from and to the same SD, the cold move operation which consists of copy and remove might still not have finished the remove part, which can be observed in the log snippets below 


----- volume delete issued ---------
2017-10-18 11:32:29,347+0300 INFO  (jsonrpc/0) [vdsm.api] START deleteImage(sdUUID='73461dc7-d144-4cb2-a016-e8c633b13630', spUUID='7241d37d-72b2-4857-b522-0162a0872fd8', imgUUID='490e84a1-3348-4095-b4e3-d2d201f8ed93', postZero='true', force='false', discard=True) from=::ffff:10.35.69.27,58374, flow_id=disks_syncAction_7be3cc5b-2c0b-4fa0, task_id=e0c7c0c8-0579-4330-90b9-4a330c6cf574 (api:46)

---- volume create issued ---------
2017-10-18 11:32:37,431+0300 INFO  (jsonrpc/6) [vdsm.api] START createVolume(sdUUID='73461dc7-d144-4cb2-a016-e8c633b13630', spUUID='7241d37d-72b2-4857-b522-0162a0872fd8', imgUUID='490e84a1-3348-4095-b4e3-d2d201f8ed93', size='1073741824', volFormat=5, preallocate=1, diskType=2, volUUID='791f95fd-8965-432c-b197-23554c19dbdd', desc=None, srcImgUUID='00000000-0000-0000-0000-000000000000', srcVolUUID='00000000-0000-0000-0000-000000000000', initialSize=None) from=::ffff:10.35.69.27,58374, flow_id=disks_syncAction_f30c8e95-2106-458c, task_id=128df0df-af53-45de-a3c8-35262b83eaf6 (api:46)


2017-10-18 11:32:39,119+0300 DEBUG (tasks/5) [storage.StorageDomainManifest] removed: ['/rhev/data-center/mnt/blockSD/73461dc7-d144-4cb2-a016-e8c633b13630/images/490e84a1-3348-4095-b4e3-d2d201f8ed93/791f95fd-8965-432c-b197-23554c19dbdd'] (blockSD:691)
2017-10-18 11:32:39,119+0300 DEBUG (tasks/5) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/blockSD/73461dc7-d144-4cb2-a016-e8c633b13630/images/490e84a1-3348-4095-b4e3-d2d201f8ed93 (blockSD:702)

---- volume delete finished ----

2017-10-18 11:32:39,148+0300 INFO  (tasks/5) [storage.ThreadPool.WorkerThread] FINISH task e0c7c0c8-0579-4330-90b9-4a330c6cf574 (threadPool:210)

This leaves us without the 490e84a1-3348-4095-b4e3-d2d201f8ed93 image where 791f95fd-8965-432c-b197-23554c19dbdd is about to be created


So the steps to reproduce can be updated to:
1. Move a disk and make sure removal is delayed 
2. Move the same disk again before removal is complete

Comment 9 Benny Zlotnik 2018-04-11 11:07:38 UTC
*** Bug 1559929 has been marked as a duplicate of this bug. ***

Comment 10 Elad 2018-07-01 11:50:43 UTC
Created attachment 1455748 [details]
new-logs2

Encountered again in our automation, logs attached.


2018-06-29 15:49:19,728+0300 ERROR (tasks/4) [storage.TaskManager.Task] (Task='2195d959-9472-4a32-b4d7-b06a4b9b578e') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1943, in createVolume
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 801, in createVolume
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1203, in create
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 505, in _create
    initialTags=(sc.TAG_VOL_UNINIT,))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1153, in createLV
    raise se.CannotCreateLogicalVolume(vgName, lvName, err)
CannotCreateLogicalVolume: Cannot create Logical Volume: 'vgname=f54afc10-74ad-4ae5-99b5-32f1ac874c44 lvname=f28ca5d7-61ed-4337-b265-f03c3751296c err=[\'  Logical Volume "f28ca5d7-61ed-4337-b265-f03c3751296c" al
ready exists in volume group "f54afc10-74ad-4ae5-99b5-32f1ac874c44"\']'

Comment 11 Elad 2018-09-04 11:42:39 UTC
Created attachment 1480751 [details]
logs3

Occurs every time for me for a specific disk. 
Is there any progress here?


2018-09-04 13:49:50,105+0300 INFO  (tasks/9) [storage.ThreadPool.WorkerThread] START task de164555-6635-4f37-b431-c452f5c94057 (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7f7508782ef0
>>, args=None) (threadPool:208)
2018-09-04 13:49:50,561+0300 INFO  (tasks/9) [storage.Volume] Creating volume b35ae5df-994f-47a2-8bc0-06a883f273ab (volume:1185)
2018-09-04 13:49:50,632+0300 INFO  (tasks/9) [storage.LVM] Creating LV (vg=320dcacd-8372-4a4d-84dd-09b2fea91536, lv=b35ae5df-994f-47a2-8bc0-06a883f273ab, size=6784m, activate=True, contiguous=False, initialTags=
('OVIRT_VOL_INITIALIZING',)) (lvm:1139)
2018-09-04 13:49:51,006+0300 ERROR (tasks/9) [storage.Volume] Failed to create volume /rhev/data-center/mnt/blockSD/320dcacd-8372-4a4d-84dd-09b2fea91536/images/4998ee5e-a5db-42e0-bda7-4a5f9dcfbb1c/b35ae5df-994f-
47a2-8bc0-06a883f273ab: Cannot create Logical Volume: u'vgname=320dcacd-8372-4a4d-84dd-09b2fea91536 lvname=b35ae5df-994f-47a2-8bc0-06a883f273ab err=[\'  Logical Volume "b35ae5df-994f-47a2-8bc0-06a883f273ab" alre
ady exists in volume group "320dcacd-8372-4a4d-84dd-09b2fea91536"\']' (volume:1209)
2018-09-04 13:49:51,007+0300 ERROR (tasks/9) [storage.Volume] Unexpected error (volume:1246)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1206, in create
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 505, in _create
    initialTags=(sc.TAG_VOL_UNINIT,))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1153, in createLV
    raise se.CannotCreateLogicalVolume(vgName, lvName, err)
CannotCreateLogicalVolume: Cannot create Logical Volume: u'vgname=320dcacd-8372-4a4d-84dd-09b2fea91536 lvname=b35ae5df-994f-47a2-8bc0-06a883f273ab err=[\'  Logical Volume "b35ae5df-994f-47a2-8bc0-06a883f273ab" a
lready exists in volume group "320dcacd-8372-4a4d-84dd-09b2fea91536"\']'
2018-09-04 13:49:51,008+0300 ERROR (tasks/9) [storage.TaskManager.Task] (Task='de164555-6635-4f37-b431-c452f5c94057') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1943, in createVolume
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 799, in createVolume
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1206, in create
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 505, in _create
    initialTags=(sc.TAG_VOL_UNINIT,))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1153, in createLV
    raise se.CannotCreateLogicalVolume(vgName, lvName, err)
CannotCreateLogicalVolume: Cannot create Logical Volume: u'vgname=320dcacd-8372-4a4d-84dd-09b2fea91536 lvname=b35ae5df-994f-47a2-8bc0-06a883f273ab err=[\'  Logical Volume "b35ae5df-994f-47a2-8bc0-06a883f273ab" already exists in volume group "320dcacd-8372-4a4d-84dd-09b2fea91536"\']'
2018-09-04 13:49:51,115+0300 INFO  (tasks/9) [storage.Volume] startCreateVolumeRollback: sdUUID=320dcacd-8372-4a4d-84dd-09b2fea91536 imgUUID=4998ee5e-a5db-42e0-bda7-4a5f9dcfbb1c volUUID=b35ae5df-994f-47a2-8bc0-06a883f273ab  (volume:1094

Comment 12 Elad 2018-09-04 11:46:01 UTC
LV b35ae5df-994f-47a2-8bc0-06a883f273ab does not exist on VG 320dcacd-8372-4a4d-84dd-09b2fea91536:


[root@tigris04 ~]# lvs 320dcacd-8372-4a4d-84dd-09b2fea91536 |grep b35ae5df-994f-47a2-8bc0-06a883f273ab
[root@tigris04 ~]#

Comment 13 Sandro Bonazzola 2019-01-28 09:34:18 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 14 Nir Soffer 2019-02-02 01:20:59 UTC
The issue is the way deleted volumes are renamed.

On block storage, we change the image id to
__remove_<image-id> or _remove_me_ZERO_<image-id>

This removes the volume from the namespace, but creating the same
volume again will fail, since we don't rename the remove volume lv.

To fix this we must change management of deleted volumes, so both 
image-id (stored in lvm tag IU_<image-id>) and volume name are 
renamed.

The issue is doing both operation in a single lvm command, so the 
operation is atomic.

David, do we have a way to preform both:

    lvrename vg/lv vg/lv_new
    lvchange --addtag IU__removing__ --deltag IU_image-id vg/lv_new

in a single atomic operation?

This should be possible since both operation on inactive lvs are changing
the vg metadata which can be fully written to the metadata area as one
transaction.

If not we need to do 2 operations, and handle new failure more when
the first operation succeeded, but the second failed.

However we implement it, we need to change the flow so engine set the
new image id and volume id, and keep tracking deleted volumes. In the 
current code engine cannot track the image after calling Volume.delete()
or Image.delete(), since vdsm controls the deleted image id.

This sounds like RFE for 4.4.

Comment 16 Eyal Shenitzky 2021-08-30 14:00:29 UTC
This bug/RFE is more than 2 years old and it didn't get enough attention so far, and is now flagged as pending close. 
Please review if it is still relevant and provide additional details/justification/patches if you believe it should get more attention for the next oVirt release.

Comment 17 Michal Skrivanek 2021-09-29 11:33:06 UTC
This bug didn't get any attention in a long time, and it's not planned in foreseeable future. oVirt development team has no plans to work on it.
Please feel free to reopen if you have a plan how to contribute this feature/bug fix.


Note You need to log in before you can comment on or make changes to this bug.