Created attachment 1246194 [details] logs: engine, proxy, daemon, vdsm and the script used in tests. Description of problem: Trying to download an image when it's storage domain is in maintenance locks the image - and later on, when storage domain is active, image is still reported as locked Trying to download an image that Version-Release number of selected component (if applicable): vdsm-4.19.4-1.el7.centos.x86_64 ovirt-imageio-common-1.0.0-0.201701021040.git022783e.el7.centos.noarch ovirt-imageio-daemon-1.0.0-0.201701021040.git022783e.el7.centos.noarch ovirt-engine-4.1.0-0.4.master.20170111000229.git9ce0636.el7.centos.noarch ovirt-imageio-proxy-1.0.0-0.201701021040.git022783e.el7.centos.noarch How reproducible: 100% Steps to Reproduce: 1. Put storage domain to maintenance 2. When storage domain is in maintenance, try to download an image (using python sdk). 2. Activate domain. 4. When it shows up as active - try to download the same image (using python sdk) Actual results: When the storage domain is in maintenance - the download is blocked - which is the expected result. When the storage domain is active - download operation fails saying image is locked (does not change status at any time) Wait a few minutes, try again - but image is still locked ,UI shows paused by the system. Script output: # ./download.py Traceback (most recent call last): File "./download.py", line 85, in <module> proxy_url = urlparse.urlparse(transfer.proxy_url) File "/usr/lib64/python2.7/urlparse.py", line 142, in urlparse tuple = urlsplit(url, scheme, allow_fragments) File "/usr/lib64/python2.7/urlparse.py", line 181, in urlsplit i = url.find(':') AttributeError: 'NoneType' object has no attribute 'find' # ./download.py Traceback (most recent call last): File "./download.py", line 68, in <module> direction=types.ImageTransferDirection.DOWNLOAD File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 10701, in add self._check_fault(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 98, in _check_fault Service._raise_error(response, fault) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 71, in _raise_error raise Error(msg) ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot ${action} Storage: The following disks are locked: 4-gb-upload. Please try again in a few minutes.]". HTTP response code is 409. # ./download.py Traceback (most recent call last): File "./download.py", line 68, in <module> direction=types.ImageTransferDirection.DOWNLOAD File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 10701, in add self._check_fault(response) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 98, in _check_fault Service._raise_error(response, fault) File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 71, in _raise_error raise Error(msg) ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot ${action} Storage: The following disks are locked: 4-gb-upload. Please try again in a few minutes.]". HTTP response code is 409. engine log: [root@ngavrilo1 ~]# tail -f /var/log/ovirt-engine/engine.log| grep ERROR 2017-01-31 11:53:44,071+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-6) [76a5c223-b1cf-4225-985a-38b50b9b69d6] Failed in 'PrepareImageVDS' method, for vds: 'green-vdsb.qa.lab.tlv.redhat.com'; host: 'green-vdsb.qa.lab.tlv.redhat.com': null 2017-01-31 11:53:44,072+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-6) [76a5c223-b1cf-4225-985a-38b50b9b69d6] Command 'PrepareImageVDSCommand(HostName = green-vdsb.qa.lab.tlv.redhat.com, PrepareImageVDSCommandParameters:{runAsync='true', hostId='225bee74-e4c3-44ec-9d71-e539cf5901d8'})' execution failed: null 2017-01-31 11:53:44,072+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-6) [76a5c223-b1cf-4225-985a-38b50b9b69d6] Failed to prepare image for transfer session: {}: org.ovirt.engine.core.common.errors.EngineException: EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001) 2017-01-31 11:53:44,176+02 ERROR [org.ovirt.engine.api.restapi.util.LinkHelper] (default task-6) [] Can't find relative path for class "org.ovirt.engine.api.resource.ImagesResource", will return null 2017-01-31 11:54:15,148+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-16) [] Operation Failed: [Cannot ${action} Storage: The following disks are locked: test-upload-17-01-v2. Please try again in a few minutes.] 2017-01-31 12:08:17,942+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-8) [] Operation Failed: [Cannot ${action} Storage: The following disks are locked: test-upload-17-01-v2. Please try again in a few minutes.] Expected results: For the second and third download attempts to succeed.
@Nir - according to the vdsm logs [1], there was a Sanlock exception on prepare. Should we catch it on engine or vdsm could handle it? [1] 2017-01-31 12:13:03,905 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='96f15e89-7404-4736-9069-5b3ea259d4ed') Unexpected error (task:870) Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 877, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 620, in getSpmStatus status = self._getSpmStatusInfo(pool) File "/usr/share/vdsm/storage/hsm.py", line 614, in _getSpmStatusInfo (pool.spmRole,) + pool.getSpmStatus())) File "/usr/share/vdsm/storage/sp.py", line 141, in getSpmStatus return self._backend.getSpmStatus() File "/usr/share/vdsm/storage/spbackends.py", line 433, in getSpmStatus lVer, spmId = self.masterDomain.inquireClusterLock() File "/usr/share/vdsm/storage/sd.py", line 816, in inquireClusterLock return self._manifest.inquireDomainLock() File "/usr/share/vdsm/storage/sd.py", line 521, in inquireDomainLock return self._domainLock.inquire(self.getDomainLease()) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 377, in inquire [(lease.path, lease.offset)]) SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception') 2017-01-31 12:13:03,909 INFO (jsonrpc/4) [storage.TaskManager.Task] (Task='96f15e89-7404-4736-9069-5b3ea259d4ed') aborting: Task is aborted: u"(-202, 'Unable to read resource owners', 'Sanlock exception')" - code 100 (task:1175) 2017-01-31 12:13:03,910 ERROR (jsonrpc/4) [storage.Dispatcher] (-202, 'Unable to read resource owners', 'Sanlock exception') (dispatcher:80) Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 72, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 105, in wrapper return m(self, *a, **kw) File "/usr/share/vdsm/storage/task.py", line 1183, in prepare raise self.error SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception')
@Natalie - can you please attach the sanlock log?
The suggested patch ensures a graceful failure of PrepareImage (i.e. the disk should get unlocked). The sanlock error is another unrelated issue. @Natalie - can you please attach sanlock logs?
(In reply to Daniel Erez from comment #1) > @Nir - according to the vdsm logs [1], there was a Sanlock exception on > prepare. The exception in comment 1 is during getSpmStatus - how is this related to prepare? > Should we catch it on engine or vdsm could handle it? Engine does not get proper error from getSpmStatus?
(In reply to Nir Soffer from comment #4) > (In reply to Daniel Erez from comment #1) > > @Nir - according to the vdsm logs [1], there was a Sanlock exception on > > prepare. > > The exception in comment 1 is during getSpmStatus - how is this related to > prepare? According to the engine log the failure originated in PrepareImageVDSCommand: 2017-01-31 11:53:44,071+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-6) [76a5c223-b1cf-4225-985a-38b50b9b69d6] Failed in 'PrepareImageVDS' method, for vds: 'green-vdsb.qa.lab.tlv.redhat.com'; host: 'green-vdsb.qa.lab.tlv.redhat.com': null 2017-01-31 11:53:44,072+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-6) [76a5c223-b1cf-4225-985a-38b50b9b69d6] Command 'PrepareImageVDSCommand(HostName = green-vdsb.qa.lab.tlv.redhat.com, PrepareImageVDSCommandParameters:{runAsync='true', hostId='225bee74-e4c3-44ec-9d71-e539cf5901d8'})' execution failed: null > > > Should we catch it on engine or vdsm could handle it? > > Engine does not get proper error from getSpmStatus? Seems like it does.
Verified using: rhevm-4.1.1.4-0.1.el7.noarch Now disk switches state to "Paused by the system".