Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 840357

Summary:

[rhevm] [engine-core] DC enter to "Contend" mode, and stuck there forever

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

vvyazmin <vvyazmin>

Component:

ovirt-engine

Assignee:

Federico Simoncelli <fsimonce>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Haim <hateya>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.1.0

CC:

abaron, amureini, dyasny, hateya, iheim, lpeer, mkublin, Rhev-m-bugs, sgrinber, yeylon, ykaul

Target Milestone:

---

Keywords:

Regression

Target Release:

3.1.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

storage

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-09-02 09:24:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Logs vdsm, rhevm, screen-shots	none

Description vvyazmin@redhat.com 2012-07-16 06:29:34 UTC

Created attachment 598374 [details]
Logs vdsm, rhevm, screen-shots

Description of problem:
DC enter to "Content" mode, and stuck there forever

Version-Release number of selected component (if applicable):
RHEVM 3.1 - SI10

RHEV-M: rhevm-3.1.0-6.el6ev.noarch
VDSM: vdsm-4.9.6-21.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.1.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6.x86_64

How reproducible:
Not always. 

Steps to Reproduce:
1. Create DC with 2 hosts and 2 iSCSI Storage Domain
2. Enter to Maintenance mode first Storage Domain (not Master)
3. Enter to Maintenance mode Master Storage Domain

  
Actual results:
1. DC enter to "Content" mode (see print-screen attached) 
2. In “Storage” tab, when you select Master Storage Domain, on sub-tab “Data Center”, you see that DC on “Unknown” domain status (see print-screen attached).
3. In “Host” tab, you find that one of hosts gets “Contending” SpmStatus. And every 3 minutes, another host get “Contending” SpmStatus (see print-screen attached).

Expected results:
After you enter all Storage Domain (in relevant DC) to “Maintenance” mode, DC also need enter to “Maintenance” mode 

Additional info:

Comment 4 vvyazmin@redhat.com 2012-08-26 10:55:12 UTC

Test on RHEVM 3.1 – SI15, and not succeed reproduce it.

Comment 5 Federico Simoncelli 2012-08-31 15:16:04 UTC

It looks like the disconnectStoragePool command never ended:

Thread-1550::INFO::2012-07-15 16:02:49,662::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='613967bc-6e1e-4ada-b187-3aa0dad2f83d', hostID=1, scsiKey='613967bc-6e1e-4ada-b187-3aa0dad2f83d', remove=False, options=None)
Thread-1550::DEBUG::2012-07-15 16:02:49,663::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`c3785dfc-c755-4767-9e26-b57a40ccb911`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource'
Thread-1550::DEBUG::2012-07-15 16:02:49,663::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' for lock type 'exclusive'
Thread-1550::DEBUG::2012-07-15 16:02:49,664::resourceManager::528::ResourceManager::(registerResource) Resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' is free. Now locking as 'exclusive' (1 active user)
Thread-1550::DEBUG::2012-07-15 16:02:49,664::resourceManager::212::ResourceManager.Request::(grant) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`c3785dfc-c755-4767-9e26-b57a40ccb911`::Granted request
Thread-1550::DEBUG::2012-07-15 16:02:49,664::task::817::TaskManager.Task::(resourceAcquired) Task=`b0346a4d-4963-4491-9fb1-d9c43c2a3f7f`::_resourcesAcquired: Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d (exclusive)
Thread-1550::DEBUG::2012-07-15 16:02:49,665::task::978::TaskManager.Task::(_decref) Task=`b0346a4d-4963-4491-9fb1-d9c43c2a3f7f`::ref 1 aborting False
Thread-1550::INFO::2012-07-15 16:02:49,665::sp::668::Storage.StoragePool::(disconnect) Disconnect from the storage pool 613967bc-6e1e-4ada-b187-3aa0dad2f83d

(There are no other messages referring to Thread-1550 and/or task b0346a4d-4963-4491-9fb1-d9c43c2a3f7f).

After this the resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' remained locked (exclusive) and the subsequent actions failed on timeout:

Eg:

Thread-1556::INFO::2012-07-15 16:10:04,621::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='613967bc-6e1e-4ada-b187-3aa0dad2f83d', hostID=1, scsiKey='613967bc-6e1e-4ada-b187-3aa0dad2f83d', msdUUID='fafd1ee5-74d7-4adb-bed7-941955fda4e9', masterVersion=1, options=None)
Thread-1556::DEBUG::2012-07-15 16:10:04,621::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`36d87005-1742-404e-9cfa-a6259c04b1c7`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource'
Thread-1556::DEBUG::2012-07-15 16:10:04,622::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' for lock type 'shared'
Thread-1556::DEBUG::2012-07-15 16:10:04,622::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' is currently locked, Entering queue (1 in queue)

[...]

Thread-1678::INFO::2012-07-15 16:23:16,537::logUtils::37::dispatcher::(wrapper) Run and protect: spmStart(spUUID='613967bc-6e1e-4ada-b187-3aa0dad2f83d', prevID=2, prevLVER='1', recoveryMode=None, scsiFencing='false', maxHostID=250, domVersion='2', options=None)
Thread-1678::DEBUG::2012-07-15 16:23:16,538::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d`ReqID=`17bd3abd-b75e-402f-b7f4-d03e0068f4c9`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource'
Thread-1678::DEBUG::2012-07-15 16:23:16,538::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' for lock type 'exclusive'
Thread-1678::DEBUG::2012-07-15 16:23:16,538::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' is currently locked, Entering queue (3 in queue)

[...]

Thread-1678::DEBUG::2012-07-15 16:25:16,540::resourceManager::705::ResourceManager.Owner::(acquire) 07a6f833-b77d-4509-aa88-71571777c06e: request for 'Storage.613967bc-6e1e-4ada-b187-3aa0dad2f83d' timed out after '120.000000' seconds
Thread-1678::ERROR::2012-07-15 16:25:16,540::task::853::TaskManager.Task::(_setError) Task=`07a6f833-b77d-4509-aa88-71571777c06e`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 493, in spmStart
    vars.task.getExclusiveLock(STORAGE, spUUID)
  File "/usr/share/vdsm/storage/task.py", line 1301, in getExclusiveLock
    self.resOwner.acquire(namespace, resName, resourceManager.LockType.exclusive, timeout)
  File "/usr/share/vdsm/storage/resourceManager.py", line 706, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()

I already checked that if VDSM is stuck in the HSM.spmStart() method (same effect of the getExclusiveLock on an already acquired resource) the engine reports the host as in the "Contending" state (consistent with the bug report). Maybe we can also investigate here if the engine was supposed to take any different action (eg: switch to a different host).

After the logging message:

Thread-1550::INFO::2012-07-15 16:02:49,665::sp::668::Storage.StoragePool::(disconnect) Disconnect from the storage pool 613967bc-6e1e-4ada-b187-3aa0dad2f83d

there's no other sign of advancement in the task and I suppose that we're stuck trying to stop the hsmMailer:

 659     @unsecured
 660     def disconnect(self):
[...]
 674         if self.hsmMailer:
 675             self.hsmMailer.stop()
 676             self.hsmMailer = None

In fact the message "HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail" is never found in the logs.

Comment 6 mkublin 2012-09-02 06:48:37 UTC

Responding on that bug, because of for some reason I am in cc on it.
From the description of bug: u send almost at the same time (or same time) deactivate of first domain, and after that deactivate of master domain? I am correct?

Comment 7 Ayal Baron 2012-09-02 09:14:40 UTC

...

> I already checked that if VDSM is stuck in the HSM.spmStart() method (same
> effect of the getExclusiveLock on an already acquired resource) the engine
> reports the host as in the "Contending" state (consistent with the bug
> report). Maybe we can also investigate here if the engine was supposed to
> take any different action (eg: switch to a different host).

does getSpmStatus report contending in this state? or engine just relies on the task and doesn't run getSpmStatus?
In any event, to rely on getSpmStatus need to move parts of the code to the synchronous stage in hsm.py [1]
Otherwise, the only safe thing to do is fence the host in this case after a certain timeout.
This also means that we'd need to allow user to mark host as manually fenced in GUI in this scenario.

> there's no other sign of advancement in the task and I suppose that we're
> stuck trying to stop the hsmMailer:
...
> In fact the message "HSM_MailboxMonitor - Incoming mail monitoring thread
> stopped, clearing outgoing mail" is never found in the logs.

[1] Code to move
            oldlver = self.getSpmLver()
            oldid = self.getSpmId()
            masterDomVersion = self.getVersion()
            # If no specific domain version was specified use current master domain version
            if expectedDomVersion is None:
                expectedDomVersion = masterDomVersion

            if masterDomVersion > expectedDomVersion:
                raise se.CurrentVersionTooAdvancedError(self.masterDomain.sdUUID,
                        curVer=masterDomVersion, expVer=expectedDomVersion)

            if int(oldlver) != int(prevLVER) or int(oldid) != int(prevID):
                self.log.info("expected previd:%s lver:%s got request for previd:%s lver:%s" % (oldid, oldlver, prevID, prevLVER))

            self.spmRole = SPM_CONTEND

Comment 8 Ayal Baron 2012-09-02 09:24:26 UTC

(In reply to comment #4)
> Test on RHEVM 3.1 – SI15, and not succeed reproduce it.

Closing due to complexity of change above and the fact that this doesn't reproduce.
Please reopen if you manage to reproduce.