Bug 1724009
| Summary: | Storage Pool lock could not be acquired until vdsm restarted (ResourceManager) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
| Component: | vdsm | Assignee: | Nir Soffer <nsoffer> |
| Status: | CLOSED WORKSFORME | QA Contact: | Avihai <aefrat> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.3.0 | CC: | ahadas, lsurette, mkalinin, sfishbai, srevivo, tnisan, ycui |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-24 19:55:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
sync2jira sync2jira Having inaccessible storage can block operations that take exclusive lock. With block storage and good configuration, operations will fail after 20-30 seconds when multipath fails the I/O. But if a user is using private multipath configuration with high value of no_path_retry (or worse "queue") this can take much more time. With NFS storage inaccessible storage should not block for more than 60 seconds, since we do all I/O via ioprocess. If storage blocks for 20-30 minutes (typically pre 4.4) ioprocess may be blocked but vdsm will time out, fail the operation and continue normally. So this may be an issue with some code not using ioprocess, or another issue that we don't know about. Can we have more info on this system? which storage was used at the time of the issue? In 4.4 we stopped using pthreading, which replaced python locking with ctypes code calling underlying pthread_mutex_xxx and pthred_cond_xxx functions. This code is not entirely correct and quite complex, so it is possible that this issue is solved in 4.4. I think the best way to avoid such issues is to use systemd watchdog to kill stuck vdsm. Vdsm has very complex locking and we cannot assume that we will never have bugs in this code. Do we have any way to reproduce this? (In reply to Nir Soffer from comment #11) > Having inaccessible storage can block operations that take exclusive lock. > > With block storage and good configuration, operations will fail after 20-30 > seconds when multipath fails the I/O. But if a user is using private > multipath configuration with high value of no_path_retry (or worse "queue") > this can take much more time. > > With NFS storage inaccessible storage should not block for more than 60 > seconds, since we do all I/O via ioprocess. If storage blocks for 20-30 > minutes (typically pre 4.4) ioprocess may be blocked but vdsm will time out, > fail the operation and continue normally. > > So this may be an issue with some code not using ioprocess, or another > issue that we don't know about. > > Can we have more info on this system? which storage was used at the time > of the issue? Happened during an activity of copying VMs from FC SD to FC SD (as a Backup). All we saw was some mild slowness due to the heavy copying. Restarting VDSM fixed the issue. Likely something didn't release the SP lock. > > In 4.4 we stopped using pthreading, which replaced python locking with ctypes > code calling underlying pthread_mutex_xxx and pthred_cond_xxx functions. This > code is not entirely correct and quite complex, so it is possible that this > issue is solved in 4.4. > > I think the best way to avoid such issues is to use systemd watchdog to kill > stuck vdsm. Vdsm has very complex locking and we cannot assume that we will > never have bugs in this code. > > Do we have any way to reproduce this? No, I tried at the time with no success. And we have no seen such problem again in support as far as I am aware. I think the bug can be closed as we do not have enough data to determine the problem, a new bug can be raised if this is seen again in 4.4+ IF we cannot reproduce this, I suggest CLOSE with WORKSFORME and reopen once we have logs from the reproducer or a reproducer. |
Description of problem: 1. attachStorageDomain is constatly blocked on ResourceManager to acquire the Storage Pool lock Exclusively: 2019-06-24 10:49:07,825-0400 INFO (jsonrpc/3) [vdsm.api] START attachStorageDomain(sdUUID=u'bad6fe7c-c890-4aa6-82fd-52eaaf9cc3ba', spUUID=u'7c2cab4a-f859-4649-b25a-23fcd0d001af', options=None) from=::ffff:10.52.223.168,49512, flow_id=654698b5-247e-435e-ba15-0732e402d643, task_id=287d9048-b232-4be9-933b-04b78f952f49 (api:48) 2019-06-24 10:51:07,826-0400 ERROR (jsonrpc/3) [storage.TaskManager.Task] (Task='287d9048-b232-4be9-933b-04b78f952f49') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in attachStorageDomain File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1200, in attachStorageDomain vars.task.getExclusiveLock(STORAGE, spUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1373, in getExclusiveLock timeout) File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 793, in acquire raise se.ResourceTimeout() ResourceTimeout: Resource timeout: () 2. Just two seconds earlier, a getInfo should have released a shared lock for the same resource: 2019-06-24 10:49:05,039-0400 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call StoragePool.getInfo succeeded in 0.01 seconds (__init__:312) 3. The subsequent getInfo is now blocked on acquiring the lock (this acquires as SHARED, so assume there is something holding it exclusively?): 2019-06-24 10:51:07,855-0400 INFO (jsonrpc/0) [vdsm.api] START getStoragePoolInfo(spUUID=u'7c2cab4a-f859-4649-b25a-23fcd0d001af', options=None) from=::ffff:10.52.223.168,49512, task_id=a42a4b1d-c92e-4e96-a17e-cd437a4ba567 (api:48) 2019-06-24 10:52:07,850-0400 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'storagepoolID': u'7c2cab4a-f859-4649-b25a-23fcd0d001af'}, 'jsonrpc': '2.0', 'method': u'StoragePool.getInfo', 'id': u'9d177363-b78f-4418-b0e5-488ef0547745'} at 0x7ff1542fe610> timeout=60, duration=60.00 at 0x7ff1542fe410> task#=16162 at 0x7ff1f83f1610>, traceback: ... File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2543, in getStoragePoolInfo vars.task.getSharedLock(STORAGE, spUUID) File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1383, in getSharedLock timeout) File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 784, in acquire timeout) File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1026, in acquireResource return _manager.acquireResource(namespace, name, lockType, timeout=timeout) File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 465, in acquireResource request.wait(timeout) File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 235, in wait return self._doneEvent.wait(timeout) File: "/usr/lib64/python2.7/threading.py", line 622, in wait self.__cond.wait(timeout, balancing) File: "/usr/lib/python2.7/site-packages/pthreading.py", line 125, in wait return self.__cond.timedwait(abstime) File: "/usr/lib/python2.7/site-packages/pthread.py", line 136, in timedwait C.pointer(abstime)) (executor:363) 4. The last command that acquired the SP lock exclusively was a destachStorageDomain 9 hours before: 2019-06-24 01:40:14,349-0400 INFO (jsonrpc/0) [vdsm.api] FINISH detachStorageDomain return=None from=::ffff:A.B.C.D,E, flow_id=18e80479-d750-4ffe-9383-0fa62648f5ed, task_id=a13ee5cc-c329-47f8-b34a-dd13506ed7c3 (api:54) In the logs, there are no obvious hung tasks printing warnings that could be holding this SP lock. For reference: SP LOCK EXCLUSIVE: spmStart spmStop forcedDetachStorageDomain detachStorageDomain createStoragePool disconnectStoragePool destroyStoragePool attachStorageDomain deactivateStorageDomain activateStorageDomain reconstructMaster fenceSpmStorage upgradeStoragePool SP LOCK SHARED: updateVM removeVM getStoragePoolInfo getImageDomainsList Version-Release number of selected component (if applicable): vdsm-4.30.17-1.el7ev.x86_64 How reproducible: Unknown, simple detach + attach loop does not seem to trigger it for me Additional info: There was some storage slowness/outage, that perhaps could have triggered this?