Bug 840357 - [rhevm] [engine-core] DC enter to "Contend" mode, and stuck there forever
Summary: [rhevm] [engine-core] DC enter to "Contend" mode, and stuck there forever
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.1.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.1.0
Assignee: Federico Simoncelli
QA Contact: Haim
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-07-16 06:29 UTC by vvyazmin@redhat.com
Modified: 2016-02-10 16:54 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-09-02 09:24:26 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Logs vdsm, rhevm, screen-shots (1.14 MB, application/x-gzip)
2012-07-16 06:29 UTC, vvyazmin@redhat.com
no flags Details

Description vvyazmin@redhat.com 2012-07-16 06:29:34 UTC
Created attachment 598374 [details]
Logs vdsm, rhevm, screen-shots

Description of problem:
DC enter to "Content" mode, and stuck there forever

Version-Release number of selected component (if applicable):
RHEVM 3.1 - SI10

RHEV-M: rhevm-3.1.0-6.el6ev.noarch
VDSM: vdsm-4.9.6-21.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.1.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6.x86_64

How reproducible:
Not always. 

Steps to Reproduce:
1. Create DC with 2 hosts and 2 iSCSI Storage Domain
2. Enter to Maintenance mode first Storage Domain (not Master)
3. Enter to Maintenance mode Master Storage Domain

  
Actual results:
1. DC enter to "Content" mode (see print-screen attached) 
2. In “Storage” tab, when you select Master Storage Domain, on sub-tab “Data Center”, you see that DC on “Unknown” domain status (see print-screen attached).
3. In “Host” tab, you find that one of hosts gets “Contending” SpmStatus. And every 3 minutes, another host get “Contending” SpmStatus (see print-screen attached).

Expected results:
After you enter all Storage Domain (in relevant DC) to “Maintenance” mode, DC also need enter to “Maintenance” mode 

Additional info:

Comment 4 vvyazmin@redhat.com 2012-08-26 10:55:12 UTC
Test on RHEVM 3.1 – SI15, and not succeed reproduce it.

Comment 5 Federico Simoncelli 2012-08-31 15:16:04 UTC
It looks like the disconnectStoragePool command never ended:

Thread-1550::INFO::2012-07-15 16:02:49,662::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='613967bc-6e1e-4ada-b187-3aa0dad2f83d', hostID=1, scsiKey='613967bc-6e1e-4ada-b187-3aa0dad2f83d', remove=False, options=None)
Thread-1550::DEBUG::2012-07-15 16:02:49,663::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`c3785dfc-c755-4767-9e26-b57a40ccb911`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource'
Thread-1550::DEBUG::2012-07-15 16:02:49,663::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' for lock type 'exclusive'
Thread-1550::DEBUG::2012-07-15 16:02:49,664::resourceManager::528::ResourceManager::(registerResource) Resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' is free. Now locking as 'exclusive' (1 active user)
Thread-1550::DEBUG::2012-07-15 16:02:49,664::resourceManager::212::ResourceManager.Request::(grant) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`c3785dfc-c755-4767-9e26-b57a40ccb911`::Granted request
Thread-1550::DEBUG::2012-07-15 16:02:49,664::task::817::TaskManager.Task::(resourceAcquired) Task=`b0346a4d-4963-4491-9fb1-d9c43c2a3f7f`::_resourcesAcquired: Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d (exclusive)
Thread-1550::DEBUG::2012-07-15 16:02:49,665::task::978::TaskManager.Task::(_decref) Task=`b0346a4d-4963-4491-9fb1-d9c43c2a3f7f`::ref 1 aborting False
Thread-1550::INFO::2012-07-15 16:02:49,665::sp::668::Storage.StoragePool::(disconnect) Disconnect from the storage pool 613967bc-6e1e-4ada-b187-3aa0dad2f83d

(There are no other messages referring to Thread-1550 and/or task b0346a4d-4963-4491-9fb1-d9c43c2a3f7f).

After this the resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' remained locked (exclusive) and the subsequent actions failed on timeout:

Eg:

Thread-1556::INFO::2012-07-15 16:10:04,621::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='613967bc-6e1e-4ada-b187-3aa0dad2f83d', hostID=1, scsiKey='613967bc-6e1e-4ada-b187-3aa0dad2f83d', msdUUID='fafd1ee5-74d7-4adb-bed7-941955fda4e9', masterVersion=1, options=None)
Thread-1556::DEBUG::2012-07-15 16:10:04,621::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`36d87005-1742-404e-9cfa-a6259c04b1c7`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource'
Thread-1556::DEBUG::2012-07-15 16:10:04,622::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' for lock type 'shared'
Thread-1556::DEBUG::2012-07-15 16:10:04,622::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' is currently locked, Entering queue (1 in queue)

[...]

Thread-1678::INFO::2012-07-15 16:23:16,537::logUtils::37::dispatcher::(wrapper) Run and protect: spmStart(spUUID='613967bc-6e1e-4ada-b187-3aa0dad2f83d', prevID=2, prevLVER='1', recoveryMode=None, scsiFencing='false', maxHostID=250, domVersion='2', options=None)
Thread-1678::DEBUG::2012-07-15 16:23:16,538::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`17bd3abd-b75e-402f-b7f4-d03e0068f4c9`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource'
Thread-1678::DEBUG::2012-07-15 16:23:16,538::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' for lock type 'exclusive'
Thread-1678::DEBUG::2012-07-15 16:23:16,538::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' is currently locked, Entering queue (3 in queue)

[...]

Thread-1678::DEBUG::2012-07-15 16:25:16,540::resourceManager::705::ResourceManager.Owner::(acquire) 07a6f833-b77d-4509-aa88-71571777c06e: request for 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' timed out after '120.000000' seconds
Thread-1678::ERROR::2012-07-15 16:25:16,540::task::853::TaskManager.Task::(_setError) Task=`07a6f833-b77d-4509-aa88-71571777c06e`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 493, in spmStart
    vars.task.getExclusiveLock(STORAGE, spUUID)
  File "/usr/share/vdsm/storage/task.py", line 1301, in getExclusiveLock
    self.resOwner.acquire(namespace, resName, resourceManager.LockType.exclusive, timeout)
  File "/usr/share/vdsm/storage/resourceManager.py", line 706, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()

I already checked that if VDSM is stuck in the HSM.spmStart() method (same effect of the getExclusiveLock on an already acquired resource) the engine reports the host as in the "Contending" state (consistent with the bug report). Maybe we can also investigate here if the engine was supposed to take any different action (eg: switch to a different host).

After the logging message:

Thread-1550::INFO::2012-07-15 16:02:49,665::sp::668::Storage.StoragePool::(disconnect) Disconnect from the storage pool 613967bc-6e1e-4ada-b187-3aa0dad2f83d

there's no other sign of advancement in the task and I suppose that we're stuck trying to stop the hsmMailer:

 659     @unsecured
 660     def disconnect(self):
[...]
 674         if self.hsmMailer:
 675             self.hsmMailer.stop()
 676             self.hsmMailer = None

In fact the message "HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail" is never found in the logs.

Comment 6 mkublin 2012-09-02 06:48:37 UTC
Responding on that bug, because of for some reason I am in cc on it.
From the description of bug: u send almost at the same time (or same time) deactivate of first domain, and after that deactivate of master domain? I am correct?

Comment 7 Ayal Baron 2012-09-02 09:14:40 UTC
...

> I already checked that if VDSM is stuck in the HSM.spmStart() method (same
> effect of the getExclusiveLock on an already acquired resource) the engine
> reports the host as in the "Contending" state (consistent with the bug
> report). Maybe we can also investigate here if the engine was supposed to
> take any different action (eg: switch to a different host).

does getSpmStatus report contending in this state? or engine just relies on the task and doesn't run getSpmStatus?
In any event, to rely on getSpmStatus need to move parts of the code to the synchronous stage in hsm.py [1]
Otherwise, the only safe thing to do is fence the host in this case after a certain timeout.
This also means that we'd need to allow user to mark host as manually fenced in GUI in this scenario.

> there's no other sign of advancement in the task and I suppose that we're
> stuck trying to stop the hsmMailer:
...
> In fact the message "HSM_MailboxMonitor - Incoming mail monitoring thread
> stopped, clearing outgoing mail" is never found in the logs.

[1] Code to move
            oldlver = self.getSpmLver()
            oldid = self.getSpmId()
            masterDomVersion = self.getVersion()
            # If no specific domain version was specified use current master domain version
            if expectedDomVersion is None:
                expectedDomVersion = masterDomVersion

            if masterDomVersion > expectedDomVersion:
                raise se.CurrentVersionTooAdvancedError(self.masterDomain.sdUUID,
                        curVer=masterDomVersion, expVer=expectedDomVersion)

            if int(oldlver) != int(prevLVER) or int(oldid) != int(prevID):
                self.log.info("expected previd:%s lver:%s got request for previd:%s lver:%s" % (oldid, oldlver, prevID, prevLVER))

            self.spmRole = SPM_CONTEND

Comment 8 Ayal Baron 2012-09-02 09:24:26 UTC
(In reply to comment #4)
> Test on RHEVM 3.1 – SI15, and not succeed reproduce it.

Closing due to complexity of change above and the fact that this doesn't reproduce.
Please reopen if you manage to reproduce.


Note You need to log in before you can comment on or make changes to this bug.