1417903 – Trying to download an image when it's storage domain is in maintenance locks the image for good

Bug 1417903 - Trying to download an image when it's storage domain is in maintenance locks the image for good

Summary: Trying to download an image when it's storage domain is in maintenance locks ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.1.1
Target Release:	4.1.1.2
Assignee:	Daniel Erez
QA Contact:	Natalie Gavrielov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-31 11:46 UTC by Natalie Gavrielov
Modified:	2017-04-21 09:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-04-21 09:32:51 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ rule-engine: ovirt-4.2+

Attachments	(Terms of Use)
logs: engine, proxy, daemon, vdsm and the script used in tests. (2.02 MB, application/x-gzip) 2017-01-31 11:46 UTC, Natalie Gavrielov	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	71894	0	'None'	MERGED	core: PrepareImageVDSCommand - handle return value failure	2020-10-22 14:24:08 UTC
oVirt gerrit	72034	0	'None'	MERGED	core: PrepareImageVDSCommand - handle return value failure	2020-10-22 14:24:08 UTC

Description Natalie Gavrielov 2017-01-31 11:46:51 UTC

Created attachment 1246194 [details]
logs: engine, proxy, daemon, vdsm and the script used in tests.

Description of problem:

Trying to download an image when it's storage domain is in maintenance locks the image - and later on, when storage domain is active, image is still reported as locked 

Trying to download an image that 

Version-Release number of selected component (if applicable):
vdsm-4.19.4-1.el7.centos.x86_64
ovirt-imageio-common-1.0.0-0.201701021040.git022783e.el7.centos.noarch
ovirt-imageio-daemon-1.0.0-0.201701021040.git022783e.el7.centos.noarch
ovirt-engine-4.1.0-0.4.master.20170111000229.git9ce0636.el7.centos.noarch
ovirt-imageio-proxy-1.0.0-0.201701021040.git022783e.el7.centos.noarch

How reproducible:
100%

Steps to Reproduce:
1. Put storage domain to maintenance
2. When storage domain is in maintenance, try to download an image (using python sdk).
2. Activate domain.
4. When it shows up as active - try to download the same image (using python sdk)

Actual results:

When the storage domain is in maintenance - the download is blocked - which is the expected result.

When the storage domain is active - download operation fails saying image is locked (does not change status at any time)

Wait a few minutes, try again - but image is still locked ,UI shows paused by the system.

Script output:

# ./download.py 
Traceback (most recent call last):
  File "./download.py", line 85, in <module>
    proxy_url = urlparse.urlparse(transfer.proxy_url)
  File "/usr/lib64/python2.7/urlparse.py", line 142, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "/usr/lib64/python2.7/urlparse.py", line 181, in urlsplit
    i = url.find(':')
AttributeError: 'NoneType' object has no attribute 'find'

# ./download.py 
Traceback (most recent call last):
  File "./download.py", line 68, in <module>
    direction=types.ImageTransferDirection.DOWNLOAD
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 10701, in add
    self._check_fault(response)
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 98, in _check_fault
    Service._raise_error(response, fault)
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 71, in _raise_error
    raise Error(msg)
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot ${action} Storage: The following disks are locked: 4-gb-upload. Please try again in a few minutes.]". HTTP response code is 409.

# ./download.py 
Traceback (most recent call last):
  File "./download.py", line 68, in <module>
    direction=types.ImageTransferDirection.DOWNLOAD
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 10701, in add
    self._check_fault(response)
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 98, in _check_fault
    Service._raise_error(response, fault)
  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 71, in _raise_error
    raise Error(msg)
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot ${action} Storage: The following disks are locked: 4-gb-upload. Please try again in a few minutes.]". HTTP response code is 409.

engine log:

[root@ngavrilo1 ~]# tail -f /var/log/ovirt-engine/engine.log| grep ERROR
2017-01-31 11:53:44,071+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-6) [76a5c223-b1cf-4225-985a-38b50b9b69d6] Failed in 'PrepareImageVDS' method, for vds: 'green-vdsb.qa.lab.tlv.redhat.com'; host: 'green-vdsb.qa.lab.tlv.redhat.com': null
2017-01-31 11:53:44,072+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-6) [76a5c223-b1cf-4225-985a-38b50b9b69d6] Command 'PrepareImageVDSCommand(HostName = green-vdsb.qa.lab.tlv.redhat.com, PrepareImageVDSCommandParameters:{runAsync='true', hostId='225bee74-e4c3-44ec-9d71-e539cf5901d8'})' execution failed: null
2017-01-31 11:53:44,072+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-6) [76a5c223-b1cf-4225-985a-38b50b9b69d6] Failed to prepare image for transfer session: {}: org.ovirt.engine.core.common.errors.EngineException: EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)
2017-01-31 11:53:44,176+02 ERROR [org.ovirt.engine.api.restapi.util.LinkHelper] (default task-6) [] Can't find relative path for class "org.ovirt.engine.api.resource.ImagesResource", will return null
2017-01-31 11:54:15,148+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-16) [] Operation Failed: [Cannot ${action} Storage: The following disks are locked: test-upload-17-01-v2. Please try again in a few minutes.]
2017-01-31 12:08:17,942+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-8) [] Operation Failed: [Cannot ${action} Storage: The following disks are locked: test-upload-17-01-v2. Please try again in a few minutes.]

Expected results:
For the second and third download attempts to succeed.

Comment 1 Daniel Erez 2017-02-05 09:32:20 UTC

@Nir - according to the vdsm logs [1], there was a Sanlock exception on prepare.
Should we catch it on engine or vdsm could handle it?

[1]
2017-01-31 12:13:03,905 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='96f15e89-7404-4736-9069-5b3ea259d4ed') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 620, in getSpmStatus
    status = self._getSpmStatusInfo(pool)
  File "/usr/share/vdsm/storage/hsm.py", line 614, in _getSpmStatusInfo
    (pool.spmRole,) + pool.getSpmStatus()))
  File "/usr/share/vdsm/storage/sp.py", line 141, in getSpmStatus
    return self._backend.getSpmStatus()
  File "/usr/share/vdsm/storage/spbackends.py", line 433, in getSpmStatus
    lVer, spmId = self.masterDomain.inquireClusterLock()
  File "/usr/share/vdsm/storage/sd.py", line 816, in inquireClusterLock
    return self._manifest.inquireDomainLock()
  File "/usr/share/vdsm/storage/sd.py", line 521, in inquireDomainLock
    return self._domainLock.inquire(self.getDomainLease())
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 377, in inquire
    [(lease.path, lease.offset)])
SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception')
2017-01-31 12:13:03,909 INFO  (jsonrpc/4) [storage.TaskManager.Task] (Task='96f15e89-7404-4736-9069-5b3ea259d4ed') aborting: Task is aborted: u"(-202, 'Unable to read resource owners', 'Sanlock exception')" - code 100 (task:1175)
2017-01-31 12:13:03,910 ERROR (jsonrpc/4) [storage.Dispatcher] (-202, 'Unable to read resource owners', 'Sanlock exception') (dispatcher:80)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 72, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 105, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1183, in prepare
    raise self.error
SanlockException: (-202, 'Unable to read resource owners', 'Sanlock exception')

Comment 2 Daniel Erez 2017-02-07 10:43:02 UTC

@Natalie - can you please attach the sanlock log?

Comment 3 Daniel Erez 2017-02-09 15:42:09 UTC

The suggested patch ensures a graceful failure of PrepareImage (i.e. the disk should get unlocked). The sanlock error is another unrelated issue.
@Natalie - can you please attach sanlock logs?

Comment 4 Nir Soffer 2017-03-06 22:20:53 UTC

(In reply to Daniel Erez from comment #1)
> @Nir - according to the vdsm logs [1], there was a Sanlock exception on
> prepare.

The exception in comment 1 is during getSpmStatus - how is this related to prepare?

> Should we catch it on engine or vdsm could handle it?

Engine does not get proper error from getSpmStatus?

Comment 5 Daniel Erez 2017-03-07 07:20:33 UTC

(In reply to Nir Soffer from comment #4)
> (In reply to Daniel Erez from comment #1)
> > @Nir - according to the vdsm logs [1], there was a Sanlock exception on
> > prepare.
> 
> The exception in comment 1 is during getSpmStatus - how is this related to
> prepare?

According to the engine log the failure originated in PrepareImageVDSCommand:

2017-01-31 11:53:44,071+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-6) [76a5c223-b1cf-4225-985a-38b50b9b69d6] Failed in 'PrepareImageVDS' method, for vds: 'green-vdsb.qa.lab.tlv.redhat.com'; host: 'green-vdsb.qa.lab.tlv.redhat.com': null
2017-01-31 11:53:44,072+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-6) [76a5c223-b1cf-4225-985a-38b50b9b69d6] Command 'PrepareImageVDSCommand(HostName = green-vdsb.qa.lab.tlv.redhat.com, PrepareImageVDSCommandParameters:{runAsync='true', hostId='225bee74-e4c3-44ec-9d71-e539cf5901d8'})' execution failed: null

> 
> > Should we catch it on engine or vdsm could handle it?
> 
> Engine does not get proper error from getSpmStatus?

Seems like it does.

Comment 6 Natalie Gavrielov 2017-03-14 13:14:18 UTC

Verified using: rhevm-4.1.1.4-0.1.el7.noarch
Now disk switches state to "Paused by the system".

Note You need to log in before you can comment on or make changes to this bug.