Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1724009

Summary: Storage Pool lock could not be acquired until vdsm restarted (ResourceManager)
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED WORKSFORME QA Contact: Avihai <aefrat>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: ahadas, lsurette, mkalinin, sfishbai, srevivo, tnisan, ycui
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-24 19:55:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2019-06-26 02:46:33 UTC
Description of problem:

1. attachStorageDomain is constatly blocked on ResourceManager to acquire the Storage Pool lock Exclusively:

2019-06-24 10:49:07,825-0400 INFO  (jsonrpc/3) [vdsm.api] START attachStorageDomain(sdUUID=u'bad6fe7c-c890-4aa6-82fd-52eaaf9cc3ba', spUUID=u'7c2cab4a-f859-4649-b25a-23fcd0d001af', options=None) from=::ffff:10.52.223.168,49512, flow_id=654698b5-247e-435e-ba15-0732e402d643, task_id=287d9048-b232-4be9-933b-04b78f952f49 (api:48)
2019-06-24 10:51:07,826-0400 ERROR (jsonrpc/3) [storage.TaskManager.Task] (Task='287d9048-b232-4be9-933b-04b78f952f49') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in attachStorageDomain
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1200, in attachStorageDomain
    vars.task.getExclusiveLock(STORAGE, spUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1373, in getExclusiveLock
    timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 793, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()

2. Just two seconds earlier, a getInfo should have released a shared lock for the same resource:
2019-06-24 10:49:05,039-0400 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call StoragePool.getInfo succeeded in 0.01 seconds (__init__:312)

3. The subsequent getInfo is now blocked on acquiring the lock (this acquires as SHARED, so assume there is something holding it exclusively?):

2019-06-24 10:51:07,855-0400 INFO  (jsonrpc/0) [vdsm.api] START getStoragePoolInfo(spUUID=u'7c2cab4a-f859-4649-b25a-23fcd0d001af', options=None) from=::ffff:10.52.223.168,49512, task_id=a42a4b1d-c92e-4e96-a17e-cd437a4ba567 (api:48)
2019-06-24 10:52:07,850-0400 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'storagepoolID': u'7c2cab4a-f859-4649-b25a-23fcd0d001af'}, 'jsonrpc': '2.0', 'method': u'StoragePool.getInfo', 'id': u'9d177363-b78f-4418-b0e5-488ef0547745'} at 0x7ff1542fe610> timeout=60, duration=60.00 at 0x7ff1542fe410> task#=16162 at 0x7ff1f83f1610>, traceback:
  ...
File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2543, in getStoragePoolInfo
  vars.task.getSharedLock(STORAGE, spUUID)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1383, in getSharedLock
  timeout)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 784, in acquire
  timeout)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1026, in acquireResource
  return _manager.acquireResource(namespace, name, lockType, timeout=timeout)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 465, in acquireResource
  request.wait(timeout)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 235, in wait
  return self._doneEvent.wait(timeout)
File: "/usr/lib64/python2.7/threading.py", line 622, in wait
  self.__cond.wait(timeout, balancing)
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 125, in wait
  return self.__cond.timedwait(abstime)
File: "/usr/lib/python2.7/site-packages/pthread.py", line 136, in timedwait
  C.pointer(abstime)) (executor:363)

4. The last command that acquired the SP lock exclusively was a destachStorageDomain 9 hours before:
2019-06-24 01:40:14,349-0400 INFO  (jsonrpc/0) [vdsm.api] FINISH detachStorageDomain return=None from=::ffff:A.B.C.D,E, flow_id=18e80479-d750-4ffe-9383-0fa62648f5ed, task_id=a13ee5cc-c329-47f8-b34a-dd13506ed7c3 (api:54)

In the logs, there are no obvious hung tasks printing warnings that could be holding this SP lock. For reference:

SP LOCK EXCLUSIVE:
spmStart
spmStop
forcedDetachStorageDomain
detachStorageDomain
createStoragePool
disconnectStoragePool
destroyStoragePool
attachStorageDomain
deactivateStorageDomain
activateStorageDomain
reconstructMaster
fenceSpmStorage
upgradeStoragePool

SP LOCK SHARED:
updateVM
removeVM
getStoragePoolInfo
getImageDomainsList

Version-Release number of selected component (if applicable):
vdsm-4.30.17-1.el7ev.x86_64

How reproducible:
Unknown, simple detach + attach loop does not seem to trigger it for me

Additional info:
There was some storage slowness/outage, that perhaps could have triggered this?

Comment 9 Daniel Gur 2019-08-28 13:11:34 UTC
sync2jira

Comment 10 Daniel Gur 2019-08-28 13:15:46 UTC
sync2jira

Comment 11 Nir Soffer 2020-06-17 12:54:40 UTC
Having inaccessible storage can block operations that take exclusive lock.

With block storage and good configuration, operations will fail after 20-30
seconds when multipath fails the I/O. But if a user is using private 
multipath configuration with high value of no_path_retry (or worse "queue")
this can take much more time.

With NFS storage inaccessible storage should not block for more than 60
seconds, since we do all I/O via ioprocess. If storage blocks for 20-30
minutes (typically pre 4.4) ioprocess may be blocked but vdsm will time out,
fail the operation and continue normally.

So this may be an issue with some code not using ioprocess, or another 
issue that we don't know about.

Can we have more info on this system? which storage was used at the time
of the issue?

In 4.4 we stopped using pthreading, which replaced python locking with ctypes
code calling underlying pthread_mutex_xxx and pthred_cond_xxx functions. This
code is not entirely correct and quite complex, so it is possible that this
issue is solved in 4.4.

I think the best way to avoid such issues is to use systemd watchdog to kill
stuck vdsm. Vdsm has very complex locking and we cannot assume that we will
never have bugs in this code.

Do we have any way to reproduce this?

Comment 12 Germano Veit Michel 2020-07-21 03:43:38 UTC
(In reply to Nir Soffer from comment #11)
> Having inaccessible storage can block operations that take exclusive lock.
> 
> With block storage and good configuration, operations will fail after 20-30
> seconds when multipath fails the I/O. But if a user is using private 
> multipath configuration with high value of no_path_retry (or worse "queue")
> this can take much more time.
> 
> With NFS storage inaccessible storage should not block for more than 60
> seconds, since we do all I/O via ioprocess. If storage blocks for 20-30
> minutes (typically pre 4.4) ioprocess may be blocked but vdsm will time out,
> fail the operation and continue normally.
> 
> So this may be an issue with some code not using ioprocess, or another 
> issue that we don't know about.
> 
> Can we have more info on this system? which storage was used at the time
> of the issue?

Happened during an activity of copying VMs from FC SD to FC SD (as a Backup).
All we saw was some mild slowness due to the heavy copying.

Restarting VDSM fixed the issue. Likely something didn't release the SP lock.

> 
> In 4.4 we stopped using pthreading, which replaced python locking with ctypes
> code calling underlying pthread_mutex_xxx and pthred_cond_xxx functions. This
> code is not entirely correct and quite complex, so it is possible that this
> issue is solved in 4.4.
> 
> I think the best way to avoid such issues is to use systemd watchdog to kill
> stuck vdsm. Vdsm has very complex locking and we cannot assume that we will
> never have bugs in this code.
> 
> Do we have any way to reproduce this?

No, I tried at the time with no success. And we have no seen such problem again
in support as far as I am aware.

I think the bug can be closed as we do not have enough data to determine the
problem, a new bug can be raised if this is seen again in 4.4+

Comment 13 Marina Kalinin 2021-05-24 19:55:12 UTC
IF we cannot reproduce this, I suggest CLOSE with WORKSFORME and reopen once we have logs from the reproducer or a reproducer.