Bug 1405817 - Remove snapshot fail to GetVolumeInfoVDS, error = (-227, 'Unable to read resource owners', 'Sanlock exception'), code = 100
Summary: Remove snapshot fail to GetVolumeInfoVDS, error = (-227, 'Unable to read reso...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.19.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.1.0-beta
: 4.19.2
Assignee: Ala Hino
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-18 16:18 UTC by Raz Tamir
Modified: 2017-02-01 14:52 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-01 14:52:02 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+
rule-engine: planning_ack+
amureini: devel_ack+
ratamir: testing_ack+


Attachments (Terms of Use)
engine and vdsm logs (426.53 KB, application/x-gzip)
2016-12-18 16:18 UTC, Raz Tamir
no flags Details
sanlock.log (6.18 KB, text/plain)
2016-12-19 09:46 UTC, Raz Tamir
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 68768 0 master MERGED clusterlock: Handle invalid lease name 2020-04-15 13:47:13 UTC
oVirt gerrit 68841 0 ovirt-engine-4.1 MERGED Cold Merge: Fix action type 2020-04-15 13:47:13 UTC

Description Raz Tamir 2016-12-18 16:18:40 UTC
Created attachment 1233179 [details]
engine and vdsm logs

Description of problem:
When trying to remove a snapshot of powered off VM, the operation fails and the snapshot remains locked.
From engine log:
2016-12-18 17:57:43,743+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (org.ovirt.thread.pool-6-thread-44) [56a5dcbd] Command 'GetVolumeInfoVDSCommand(HostName = host_mixed_3, GetVolumeInfoVDSCommandParameters:{runAsync='true', hostId='999cf93f-73bb-4372-94df-16a140930f96', storagePoolId='c3966125-fbdf-4d79-b567-233e8a4a5c15', storageDomainId='7e99c7d5-ce3c-4e3d-b966-629799302168', imageGroupId='546f49c5-2aa1-41c6-92a5-51a17c0ce162', imageId='45b8b47c-bf6b-494b-8553-7a0c35feed22'})' execution failed: VDSGenericException: VDSErrorException: Failed to GetVolumeInfoVDS, error = (-227, 'Unable to read resource owners', 'Sanlock exception'), code = 100
2016-12-18 17:57:43,743+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (org.ovirt.thread.pool-6-thread-44) [56a5dcbd] FINISH, GetVolumeInfoVDSCommand, log id: 2545368e
2016-12-18 17:57:43,743+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-6-thread-44) [56a5dcbd] Failed to get the volume information, marking as FAILED
2016-12-18 17:57:43,743+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-6-thread-44) [56a5dcbd] FINISH, GetImageInfoVDSCommand, log id: d01f919
2016-12-18 17:57:43,768+02 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (org.ovirt.thread.pool-6-thread-44) [56a5dcbd] transaction rolled back
2016-12-18 17:57:43,819+02 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (org.ovirt.thread.pool-6-thread-44) [] transaction rolled back
2016-12-18 17:57:43,834+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-44) [] Correlation ID: 4da4b15f-23b7-485d-ae64-6d46c463c7e2, Job ID: 1a5ff5d4-83bc-402e-a7c9-9503206d5562, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'sn1' for VM 'test'.
2016-12-18 17:57:43,835+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-44) [] [within thread]: endAction for action type RemoveSnapshot threw an exception.: javax.ejb.EJBTransactionRolledbackException

from vdsm log:
2016-12-18 17:57:41,445 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='d0bac70c-2441-4650-9807-9feb89f54a06') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3047, in getVolumeInfo
    volUUID=volUUID).getInfo()
  File "/usr/share/vdsm/storage/volume.py", line 1177, in getInfo
    return self._manifest.getInfo()
  File "/usr/share/vdsm/storage/volume.py", line 236, in getInfo
    info['lease'] = self.getLeaseStatus()
  File "/usr/share/vdsm/storage/volume.py", line 193, in getLeaseStatus
    self.volUUID)
  File "/usr/share/vdsm/storage/sd.py", line 457, in inquireVolumeLease
    return self._domainLock.inquire(lease)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 348, in inquire
    [(lease.path, lease.offset)])
SanlockException: (-227, 'Unable to read resource owners', 'Sanlock exception')
2016-12-18 17:57:41,470 INFO  (jsonrpc/6) [storage.TaskManager.Task] (Task='d0bac70c-2441-4650-9807-9feb89f54a06') aborting: Task is aborted: u"(-227, 'Unable to read resource owners', 'Sanlock exception')" - code 100 (task:1175)
2016-12-18 17:57:41,471 ERROR (jsonrpc/6) [storage.Dispatcher] (-227, 'Unable to read resource owners', 'Sanlock exception') (dispatcher:80)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 72, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 105, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1183, in prepare
    raise self.error
SanlockException: (-227, 'Unable to read resource owners', 'Sanlock exception')



Version-Release number of selected component (if applicable):
vdsm-4.18.999-1162.gite95442e.el7.centos.x86_64
ovirt-engine-4.1.0-0.2.master.20161215123221.git751a3b6.el7.centos.noarch

How reproducible:
100%

Steps to Reproduce:
Environment with a VM + disk and a snapshot
1. Remove the snapshot when the VM is powered off
2.
3.

Actual results:


Expected results:
Live merge is working as expected

Additional info:

Comment 1 Yaniv Kaul 2016-12-19 09:33:31 UTC
sanlock log is probably relevant as well here.

Comment 2 Raz Tamir 2016-12-19 09:46:21 UTC
Created attachment 1233351 [details]
sanlock.log

Comment 3 Ala Hino 2016-12-19 11:42:30 UTC
Hi Raz,

From the logs I see that the "old" cold merge functionality is used.
Could you please confirm the DC level?
In addition, please send me a details of the deployment you are using for tests, i.e. how many hosts, storage types, etc.

Please note that it will be helpful if you mention DC level when opening bugs so we can focus on new or old implementation.

Thanks.

Comment 4 Raz Tamir 2016-12-19 11:46:23 UTC
Hi Ala,
The DC level is 4.1 and the environment includes 3 hosts, 9 SDs (3 iscsi, 3 NFS, 3 glusterfs)

Comment 5 Ala Hino 2016-12-19 15:33:19 UTC
After some investigations with Raz, we found out that this current build, somehow, git wrong configuration regarding cold merge and actually the old flow was executed. We had an issue with old flow, where task remained stuck, and the issue is being fixed now.

On a different environment with different build, new cold merge worked.

This build has the wrong config: 
build: 4.1.0-0.2.master.20161215123221.git751a3b6.el7.centos

This build has the correct config:
build: 4.1.0-0.2.master.20161216212250.gitc040969.el7.centos

Comment 6 Tal Nisan 2016-12-19 17:03:54 UTC
Raz, can you verify it works on the newer build so I can close the bug?

Comment 7 Ala Hino 2016-12-19 21:49:55 UTC
Sanlock issue occurs when running old cold merge flow on Vdsm 4.1.
A fix will be in soon.

Comment 8 Nir Soffer 2016-12-19 22:57:15 UTC
Raz, can you test the fix?

Comment 10 Raz Tamir 2016-12-20 10:20:24 UTC
Verified on - ovirt-engine-4.2.0-0.0.master.20161219225535.git893d571.el7.centos.noarch

Should I move to verified?

Comment 11 Ala Hino 2016-12-20 10:41:32 UTC
Raz,

The issue here is only about old cold merge flow in 4.1 environment but using DC version 4.0.

So to verify the fix you have to use Engine 4.1 and Vdsm 4.1 but on DC version 4.0.
However, due to a different issue (fixed here: https://gerrit.ovirt.org/#/c/68757/), you cannot verify old cold merge flow because currently the new cold merge is always executed. So basically you verified that new cold merge actually works.

Comment 12 Nir Soffer 2016-12-24 20:50:59 UTC
The root cause of this bug is very old bug in legacy cold merge.  During that
flow, we create a temporary volume for the merge, named "<vol_id>_MERGE". When
we create a volume, we create also a volume lease, and the volume lease is
named after the volume (same name: "<volo_id>_MERGE").

At the end of the legacy cold merge flow, we name the volume from "<vol_id>_MERGE"
to "<vol_id>", but this rename does not update the lease name. Practically, this
volume lost its lease after the rename.

So we have 3 problems:

1. Getting volume info on volume that was merged using legacy cold merge
   fail.
   This problem is fixed by the attached patch.

2. Legacy cold merge creates volumes without a lease.

3. Volumes that were merged using legacy cold merge will fail to run new storage
   jobs (e.g. amend_volume, sparsify).

This bug is tracking only the first problem. We will create separate bugs for
the other problems.

For testing this patch, legacy cold merge must succeed although it will not have
a volume lease.

You can check if the volume has a volume lease using new vdsm client:

    vdsm-client Volume getInfo \
        storagepoolID=sp_id \
        storagedomainID=sd_id \
        imageID=img_id \
        volumeID=vol_id

If the volume has a lease, it will be returned, it the volume does not have
a volume lease or the lease is invalid, the lease is not returned in the
response, and warning is logged in vdsm log:

    Cannot get lease status: Sanlock resource name vol_id_MERGE does not match lease name vol_id, this lease must be repaired.

Comment 13 Avihai 2017-01-09 11:15:41 UTC
verified on build oVirt Engine Version:
4.1.0-0.4.master.20170105161132.gitf4e2c11.el7.centos


Note You need to log in before you can comment on or make changes to this bug.