Created attachment 1233179 [details] engine and vdsm logs Description of problem: When trying to remove a snapshot of powered off VM, the operation fails and the snapshot remains locked. From engine log: 2016-12-18 17:57:43,743+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (org.ovirt.thread.pool-6-thread-44) [56a5dcbd] Command 'GetVolumeInfoVDSCommand(HostName = host_mixed_3, GetVolumeInfoVDSCommandParameters:{runAsync='true', hostId='999cf93f-73bb-4372-94df-16a140930f96', storagePoolId='c3966125-fbdf-4d79-b567-233e8a4a5c15', storageDomainId='7e99c7d5-ce3c-4e3d-b966-629799302168', imageGroupId='546f49c5-2aa1-41c6-92a5-51a17c0ce162', imageId='45b8b47c-bf6b-494b-8553-7a0c35feed22'})' execution failed: VDSGenericException: VDSErrorException: Failed to GetVolumeInfoVDS, error = (-227, 'Unable to read resource owners', 'Sanlock exception'), code = 100 2016-12-18 17:57:43,743+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (org.ovirt.thread.pool-6-thread-44) [56a5dcbd] FINISH, GetVolumeInfoVDSCommand, log id: 2545368e 2016-12-18 17:57:43,743+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-6-thread-44) [56a5dcbd] Failed to get the volume information, marking as FAILED 2016-12-18 17:57:43,743+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-6-thread-44) [56a5dcbd] FINISH, GetImageInfoVDSCommand, log id: d01f919 2016-12-18 17:57:43,768+02 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (org.ovirt.thread.pool-6-thread-44) [56a5dcbd] transaction rolled back 2016-12-18 17:57:43,819+02 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (org.ovirt.thread.pool-6-thread-44) [] transaction rolled back 2016-12-18 17:57:43,834+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-44) [] Correlation ID: 4da4b15f-23b7-485d-ae64-6d46c463c7e2, Job ID: 1a5ff5d4-83bc-402e-a7c9-9503206d5562, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'sn1' for VM 'test'. 2016-12-18 17:57:43,835+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-44) [] [within thread]: endAction for action type RemoveSnapshot threw an exception.: javax.ejb.EJBTransactionRolledbackException from vdsm log: 2016-12-18 17:57:41,445 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='d0bac70c-2441-4650-9807-9feb89f54a06') Unexpected error (task:870) Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 877, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 3047, in getVolumeInfo volUUID=volUUID).getInfo() File "/usr/share/vdsm/storage/volume.py", line 1177, in getInfo return self._manifest.getInfo() File "/usr/share/vdsm/storage/volume.py", line 236, in getInfo info['lease'] = self.getLeaseStatus() File "/usr/share/vdsm/storage/volume.py", line 193, in getLeaseStatus self.volUUID) File "/usr/share/vdsm/storage/sd.py", line 457, in inquireVolumeLease return self._domainLock.inquire(lease) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 348, in inquire [(lease.path, lease.offset)]) SanlockException: (-227, 'Unable to read resource owners', 'Sanlock exception') 2016-12-18 17:57:41,470 INFO (jsonrpc/6) [storage.TaskManager.Task] (Task='d0bac70c-2441-4650-9807-9feb89f54a06') aborting: Task is aborted: u"(-227, 'Unable to read resource owners', 'Sanlock exception')" - code 100 (task:1175) 2016-12-18 17:57:41,471 ERROR (jsonrpc/6) [storage.Dispatcher] (-227, 'Unable to read resource owners', 'Sanlock exception') (dispatcher:80) Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 72, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 105, in wrapper return m(self, *a, **kw) File "/usr/share/vdsm/storage/task.py", line 1183, in prepare raise self.error SanlockException: (-227, 'Unable to read resource owners', 'Sanlock exception') Version-Release number of selected component (if applicable): vdsm-4.18.999-1162.gite95442e.el7.centos.x86_64 ovirt-engine-4.1.0-0.2.master.20161215123221.git751a3b6.el7.centos.noarch How reproducible: 100% Steps to Reproduce: Environment with a VM + disk and a snapshot 1. Remove the snapshot when the VM is powered off 2. 3. Actual results: Expected results: Live merge is working as expected Additional info:
sanlock log is probably relevant as well here.
Created attachment 1233351 [details] sanlock.log
Hi Raz, From the logs I see that the "old" cold merge functionality is used. Could you please confirm the DC level? In addition, please send me a details of the deployment you are using for tests, i.e. how many hosts, storage types, etc. Please note that it will be helpful if you mention DC level when opening bugs so we can focus on new or old implementation. Thanks.
Hi Ala, The DC level is 4.1 and the environment includes 3 hosts, 9 SDs (3 iscsi, 3 NFS, 3 glusterfs)
After some investigations with Raz, we found out that this current build, somehow, git wrong configuration regarding cold merge and actually the old flow was executed. We had an issue with old flow, where task remained stuck, and the issue is being fixed now. On a different environment with different build, new cold merge worked. This build has the wrong config: build: 4.1.0-0.2.master.20161215123221.git751a3b6.el7.centos This build has the correct config: build: 4.1.0-0.2.master.20161216212250.gitc040969.el7.centos
Raz, can you verify it works on the newer build so I can close the bug?
Sanlock issue occurs when running old cold merge flow on Vdsm 4.1. A fix will be in soon.
Raz, can you test the fix?
Verified on - ovirt-engine-4.2.0-0.0.master.20161219225535.git893d571.el7.centos.noarch Should I move to verified?
Raz, The issue here is only about old cold merge flow in 4.1 environment but using DC version 4.0. So to verify the fix you have to use Engine 4.1 and Vdsm 4.1 but on DC version 4.0. However, due to a different issue (fixed here: https://gerrit.ovirt.org/#/c/68757/), you cannot verify old cold merge flow because currently the new cold merge is always executed. So basically you verified that new cold merge actually works.
The root cause of this bug is very old bug in legacy cold merge. During that flow, we create a temporary volume for the merge, named "<vol_id>_MERGE". When we create a volume, we create also a volume lease, and the volume lease is named after the volume (same name: "<volo_id>_MERGE"). At the end of the legacy cold merge flow, we name the volume from "<vol_id>_MERGE" to "<vol_id>", but this rename does not update the lease name. Practically, this volume lost its lease after the rename. So we have 3 problems: 1. Getting volume info on volume that was merged using legacy cold merge fail. This problem is fixed by the attached patch. 2. Legacy cold merge creates volumes without a lease. 3. Volumes that were merged using legacy cold merge will fail to run new storage jobs (e.g. amend_volume, sparsify). This bug is tracking only the first problem. We will create separate bugs for the other problems. For testing this patch, legacy cold merge must succeed although it will not have a volume lease. You can check if the volume has a volume lease using new vdsm client: vdsm-client Volume getInfo \ storagepoolID=sp_id \ storagedomainID=sd_id \ imageID=img_id \ volumeID=vol_id If the volume has a lease, it will be returned, it the volume does not have a volume lease or the lease is invalid, the lease is not returned in the response, and warning is logged in vdsm log: Cannot get lease status: Sanlock resource name vol_id_MERGE does not match lease name vol_id, this lease must be repaired.
verified on build oVirt Engine Version: 4.1.0-0.4.master.20170105161132.gitf4e2c11.el7.centos