Created attachment 598374 [details] Logs vdsm, rhevm, screen-shots Description of problem: DC enter to "Content" mode, and stuck there forever Version-Release number of selected component (if applicable): RHEVM 3.1 - SI10 RHEV-M: rhevm-3.1.0-6.el6ev.noarch VDSM: vdsm-4.9.6-21.0.el6_3.x86_64 LIBVIRT: libvirt-0.9.10-21.el6_3.1.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6.x86_64 How reproducible: Not always. Steps to Reproduce: 1. Create DC with 2 hosts and 2 iSCSI Storage Domain 2. Enter to Maintenance mode first Storage Domain (not Master) 3. Enter to Maintenance mode Master Storage Domain Actual results: 1. DC enter to "Content" mode (see print-screen attached) 2. In “Storage” tab, when you select Master Storage Domain, on sub-tab “Data Center”, you see that DC on “Unknown” domain status (see print-screen attached). 3. In “Host” tab, you find that one of hosts gets “Contending” SpmStatus. And every 3 minutes, another host get “Contending” SpmStatus (see print-screen attached). Expected results: After you enter all Storage Domain (in relevant DC) to “Maintenance” mode, DC also need enter to “Maintenance” mode Additional info:
Test on RHEVM 3.1 – SI15, and not succeed reproduce it.
It looks like the disconnectStoragePool command never ended: Thread-1550::INFO::2012-07-15 16:02:49,662::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='613967bc-6e1e-4ada-b187-3aa0dad2f83d', hostID=1, scsiKey='613967bc-6e1e-4ada-b187-3aa0dad2f83d', remove=False, options=None) Thread-1550::DEBUG::2012-07-15 16:02:49,663::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`c3785dfc-c755-4767-9e26-b57a40ccb911`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource' Thread-1550::DEBUG::2012-07-15 16:02:49,663::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' for lock type 'exclusive' Thread-1550::DEBUG::2012-07-15 16:02:49,664::resourceManager::528::ResourceManager::(registerResource) Resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' is free. Now locking as 'exclusive' (1 active user) Thread-1550::DEBUG::2012-07-15 16:02:49,664::resourceManager::212::ResourceManager.Request::(grant) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`c3785dfc-c755-4767-9e26-b57a40ccb911`::Granted request Thread-1550::DEBUG::2012-07-15 16:02:49,664::task::817::TaskManager.Task::(resourceAcquired) Task=`b0346a4d-4963-4491-9fb1-d9c43c2a3f7f`::_resourcesAcquired: Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d (exclusive) Thread-1550::DEBUG::2012-07-15 16:02:49,665::task::978::TaskManager.Task::(_decref) Task=`b0346a4d-4963-4491-9fb1-d9c43c2a3f7f`::ref 1 aborting False Thread-1550::INFO::2012-07-15 16:02:49,665::sp::668::Storage.StoragePool::(disconnect) Disconnect from the storage pool 613967bc-6e1e-4ada-b187-3aa0dad2f83d (There are no other messages referring to Thread-1550 and/or task b0346a4d-4963-4491-9fb1-d9c43c2a3f7f). After this the resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' remained locked (exclusive) and the subsequent actions failed on timeout: Eg: Thread-1556::INFO::2012-07-15 16:10:04,621::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='613967bc-6e1e-4ada-b187-3aa0dad2f83d', hostID=1, scsiKey='613967bc-6e1e-4ada-b187-3aa0dad2f83d', msdUUID='fafd1ee5-74d7-4adb-bed7-941955fda4e9', masterVersion=1, options=None) Thread-1556::DEBUG::2012-07-15 16:10:04,621::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`36d87005-1742-404e-9cfa-a6259c04b1c7`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource' Thread-1556::DEBUG::2012-07-15 16:10:04,622::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' for lock type 'shared' Thread-1556::DEBUG::2012-07-15 16:10:04,622::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' is currently locked, Entering queue (1 in queue) [...] Thread-1678::INFO::2012-07-15 16:23:16,537::logUtils::37::dispatcher::(wrapper) Run and protect: spmStart(spUUID='613967bc-6e1e-4ada-b187-3aa0dad2f83d', prevID=2, prevLVER='1', recoveryMode=None, scsiFencing='false', maxHostID=250, domVersion='2', options=None) Thread-1678::DEBUG::2012-07-15 16:23:16,538::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`17bd3abd-b75e-402f-b7f4-d03e0068f4c9`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource' Thread-1678::DEBUG::2012-07-15 16:23:16,538::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' for lock type 'exclusive' Thread-1678::DEBUG::2012-07-15 16:23:16,538::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' is currently locked, Entering queue (3 in queue) [...] Thread-1678::DEBUG::2012-07-15 16:25:16,540::resourceManager::705::ResourceManager.Owner::(acquire) 07a6f833-b77d-4509-aa88-71571777c06e: request for 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' timed out after '120.000000' seconds Thread-1678::ERROR::2012-07-15 16:25:16,540::task::853::TaskManager.Task::(_setError) Task=`07a6f833-b77d-4509-aa88-71571777c06e`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 493, in spmStart vars.task.getExclusiveLock(STORAGE, spUUID) File "/usr/share/vdsm/storage/task.py", line 1301, in getExclusiveLock self.resOwner.acquire(namespace, resName, resourceManager.LockType.exclusive, timeout) File "/usr/share/vdsm/storage/resourceManager.py", line 706, in acquire raise se.ResourceTimeout() ResourceTimeout: Resource timeout: () I already checked that if VDSM is stuck in the HSM.spmStart() method (same effect of the getExclusiveLock on an already acquired resource) the engine reports the host as in the "Contending" state (consistent with the bug report). Maybe we can also investigate here if the engine was supposed to take any different action (eg: switch to a different host). After the logging message: Thread-1550::INFO::2012-07-15 16:02:49,665::sp::668::Storage.StoragePool::(disconnect) Disconnect from the storage pool 613967bc-6e1e-4ada-b187-3aa0dad2f83d there's no other sign of advancement in the task and I suppose that we're stuck trying to stop the hsmMailer: 659 @unsecured 660 def disconnect(self): [...] 674 if self.hsmMailer: 675 self.hsmMailer.stop() 676 self.hsmMailer = None In fact the message "HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail" is never found in the logs.
Responding on that bug, because of for some reason I am in cc on it. From the description of bug: u send almost at the same time (or same time) deactivate of first domain, and after that deactivate of master domain? I am correct?
... > I already checked that if VDSM is stuck in the HSM.spmStart() method (same > effect of the getExclusiveLock on an already acquired resource) the engine > reports the host as in the "Contending" state (consistent with the bug > report). Maybe we can also investigate here if the engine was supposed to > take any different action (eg: switch to a different host). does getSpmStatus report contending in this state? or engine just relies on the task and doesn't run getSpmStatus? In any event, to rely on getSpmStatus need to move parts of the code to the synchronous stage in hsm.py [1] Otherwise, the only safe thing to do is fence the host in this case after a certain timeout. This also means that we'd need to allow user to mark host as manually fenced in GUI in this scenario. > there's no other sign of advancement in the task and I suppose that we're > stuck trying to stop the hsmMailer: ... > In fact the message "HSM_MailboxMonitor - Incoming mail monitoring thread > stopped, clearing outgoing mail" is never found in the logs. [1] Code to move oldlver = self.getSpmLver() oldid = self.getSpmId() masterDomVersion = self.getVersion() # If no specific domain version was specified use current master domain version if expectedDomVersion is None: expectedDomVersion = masterDomVersion if masterDomVersion > expectedDomVersion: raise se.CurrentVersionTooAdvancedError(self.masterDomain.sdUUID, curVer=masterDomVersion, expVer=expectedDomVersion) if int(oldlver) != int(prevLVER) or int(oldid) != int(prevID): self.log.info("expected previd:%s lver:%s got request for previd:%s lver:%s" % (oldid, oldlver, prevID, prevLVER)) self.spmRole = SPM_CONTEND
(In reply to comment #4) > Test on RHEVM 3.1 – SI15, and not succeed reproduce it. Closing due to complexity of change above and the fact that this doesn't reproduce. Please reopen if you manage to reproduce.