Bug 1098769 - [vdsm] Reconstruct master fails with ResourceTimeout on an EMC-VNX machine
Summary: [vdsm] Reconstruct master fails with ResourceTimeout on an EMC-VNX machine
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.4.0
Hardware: x86_64
OS: Unspecified
urgent
high
Target Milestone: ---
: 3.5.0
Assignee: Federico Simoncelli
QA Contact: lkuchlan
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-05-18 10:22 UTC by Elad
Modified: 2016-02-10 18:44 UTC (History)
13 users (show)

Fixed In Version: vdsm-4.16.7.6-1.el6ev
Doc Type: Bug Fix
Doc Text:
Previously, the default multipath configurations for EMC-VNX storage were conflicting with those required for VDSM. This caused recovery operations in the event of a failure to be too slow or, in certain cases, to never end. Now, VDSM is shipping with a new multipath configuration that overrides the previous defaults from the multipath package. As a result, recovery operations are now on par with all the other storage solutions.
Clone Of:
Environment:
Last Closed: 2015-02-11 21:11:05 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Logs from engine and vdsm (1.11 MB, application/x-gzip)
2014-05-18 10:22 UTC, Elad
no flags Details
engine vdsm and server logs (769.95 KB, application/x-gzip)
2014-10-06 08:17 UTC, Kevin Alon Goldblatt
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0159 0 normal SHIPPED_LIVE vdsm 3.5.0 - bug fix and enhancement update 2015-02-12 01:35:58 UTC
oVirt gerrit 35072 0 master MERGED multipath: add specific fail entry for emc-vnx 2020-10-01 19:54:11 UTC
oVirt gerrit 35179 0 ovirt-3.5 MERGED multipath: add specific fail entry for emc-vnx 2020-10-01 19:54:18 UTC

Description Elad 2014-05-18 10:22:28 UTC
Created attachment 896765 [details]
Logs from engine and vdsm

Description of problem:
While master domain in the pool was from a block type (iSCSI), I blocked connectivity to its storage server. Other active domain in the DC was from a file type (NFS). Reconstruct failed with a ResourceTimeout on VDSM. 
I tested the reverse scenario (from file to block) with the same storage domains and reconstruct succeeded, the DC was recovered. 
Also from file to file succeeded. Those 3 scenarios were tested in 2 different RHEV environments.


Version-Release number of selected component (if applicable):
av9.1
rhevm-3.4.0-0.20.el6ev.noarch
vdsm-4.14.7-2.el6ev.x86_64


How reproducible:
Always

Steps to Reproduce: 
On a shared DC with 2 storage domains from different types (NFS and iSCSI), make sure the iSCSI domain is the master:
1. Block connectivity from SPM host to master domain (iSCSI)


Actual results:

Reconstruct master from domain af045372-b42f-4866-afee-7c4f118393f6 begins:

From engine.log:

2014-05-18 12:13:03,813 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-4-thread-41) [59860773] START, ReconstructMasterVDSCommand(HostName = green-vdsa, HostId = f2954f23-a6a0-4c98-b80c-6c77231e432e, vdsSpmId = 1, storagePoolId = 2afb1571-1fc2-4906-b9ea-33984997ecc2, storagePoolName = dc2, masterDomainId = af045372-b42f-4866-afee-7c4f118393f6, masterVersion = 4, domainsList = [{ domainId: 5588721a-9d48-4525-b632-14d794114605, status: Maintenance };{ domainId: 89353e13-07bf-4f15-9787-3d7828d395aa, status: Maintenance };{ domainId: af045372-b42f-4866-afee-7c4f118393f6, status: Unknown };{ domainId: 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14, status: InActive };]), log id: 47b66656


It fails with an error from VDSM:

2014-05-18 10:51:27,631 ERROR [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-4-thread-44) [58b391d5] Command org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to ReconstructMasterVDS, error = Resource timeout: () (Failed with error ResourceTimeout and code 851)


The error in vdsm.log:

Thread-25::ERROR::2014-05-18 12:15:00,382::task::866::TaskManager.Task::(_setError) Task=`e3d7d49d-9108-455a-8d37-5c83ee54e7c9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1878, in reconstructMaster
    vars.task.getExclusiveLock(STORAGE, spUUID)
  File "/usr/share/vdsm/storage/task.py", line 1329, in getExclusiveLock
    timeout)
  File "/usr/share/vdsm/storage/resourceManager.py", line 820, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()


=============

Tried the same scenario, only reversed (while NFS domain is the master and the target domain to take the master is iSCSI):

2014-05-18 11:42:16,644 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-4-thread-44) [fa1a415] START, ReconstructMasterVDSCommand(HostName = green-vdsa, HostId = f2954f23-a6a0-4c98-b80c-6c77231e432e, vdsSpmId = 1, storagePoolId = 2afb1571-1fc2-4906-b9ea-33984997ecc2, storagePoolName = dc2, masterDomainId = 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14, masterVersion = 3, domainsList = [{ domainId: 89353e13-07bf-4f15-9787-3d7828d395aa, status: Maintenance };{ domainId: 5588721a-9d48-4525-b632-14d794114605, status: Maintenance };{ domainId: af045372-b42f-4866-afee-7c4f118393f6, status: InActive };{ domainId: 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14, status: Unknown };]), log id: 47125889




2014-05-18 11:46:19,074 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-4-thread-44) [fa1a415] Correlation ID: fa1a415, Job ID: 54d8160f-94e7-492f-aa56-5764a5594
79a, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center dc2 completed.


Expected results:
Reconstruct master, triggered because iSCSI domain is unreachable, should succeed.

Additional info: Logs from engine and vdsm

Comment 2 Federico Simoncelli 2014-05-19 10:12:23 UTC
The reconstructMaster command failed to acquire the resource because of a previous connectStoragePool command that got stuck 10 minutes for an lvm command to complete:

Thread-17::INFO::2014-05-18 12:04:59,986::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='2afb1571-1fc2-4906-b9ea-33984997ecc2', ...

Thread-17::DEBUG::2014-05-18 12:05:00,358::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs ...

(10 minutes, and reconstructMaster call)

Thread-17::DEBUG::2014-05-18 12:15:02,000::lvm::296::Storage.Misc.excCmd::(cmd) FAILED: <err> = '  /dev/mapper/360060160f4a03000de85fcfa90dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n ...


I don't think this is a mixed pool regression, it's rather an lvm regression.
If it's not a very rare lvm race then it should be reproducible also with two iscsi domains (on different server).


All relevant logs:

Thread-17::INFO::2014-05-18 12:04:59,986::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='2afb1571-1fc2-4906-b9ea-33984997ecc2', hostID=1, msdUUID='7dd40ede-91e0-4fe6-b05f-29d6e22a7b14', masterVersion=3, domainsMap=None, options=None)

Thread-17::DEBUG::2014-05-18 12:05:00,358::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/3514f0c546260002e|/dev/mapper/3514f0c546260002f|/dev/mapper/3514f0c5462600030|/dev/mapper/3514f0c5462600031|/dev/mapper/3514f0c5462600039|/dev/mapper/3514f0c546260003a|/dev/mapper/3514f0c546260003b|/dev/mapper/3514f0c546260003c|/dev/mapper/360060160f4a0300018e9960191dbe311|/dev/mapper/360060160f4a03000db85fcfa90dbe311|/dev/mapper/360060160f4a03000dc85fcfa90dbe311|/dev/mapper/360060160f4a03000dd85fcfa90dbe311|/dev/mapper/360060160f4a03000de85fcfa90dbe311|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14' (cwd None)


Thread-25::INFO::2014-05-18 12:13:00,347::logUtils::44::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='2afb1571-1fc2-4906-b9ea-33984997ecc2', poolName='dc2', masterDom='af045372-b42f-4866-afee-7c4f118393f6', domDict={'89353e13-07bf-4f15-9787-3d7828d395aa': 'attached', '7dd40ede-91e0-4fe6-b05f-29d6e22a7b14': 'active', 'af045372-b42f-4866-afee-7c4f118393f6': 'active', '5588721a-9d48-4525-b632-14d794114605': 'attached'}, masterVersion=4, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, hostId=1, options=None)

Thread-25::ERROR::2014-05-18 12:15:00,382::task::866::TaskManager.Task::(_setError) Task=`e3d7d49d-9108-455a-8d37-5c83ee54e7c9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1878, in reconstructMaster
    vars.task.getExclusiveLock(STORAGE, spUUID)
  File "/usr/share/vdsm/storage/task.py", line 1329, in getExclusiveLock
    timeout)
  File "/usr/share/vdsm/storage/resourceManager.py", line 820, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()


Thread-17::DEBUG::2014-05-18 12:15:02,000::lvm::296::Storage.Misc.excCmd::(cmd) FAILED: <err> = '  /dev/mapper/360060160f4a03000de85fcfa90dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n  /dev/mapper/360060160f4a03000de85fcfa90dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n  /dev/mapper/360060160f4a03000de85fcfa90dbe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000de85fcfa90dbe311 was disabled\n  /dev/mapper/360060160f4a03000dc85fcfa90dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n  /dev/mapper/360060160f4a03000dc85fcfa90dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n  /dev/mapper/360060160f4a03000dc85fcfa90dbe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000dc85fcfa90dbe311 was disabled\n  /dev/mapper/360060160f4a03000db85fcfa90dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n  /dev/mapper/360060160f4a03000db85fcfa90dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n  /dev/mapper/360060160f4a03000db85fcfa90dbe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000db85fcfa90dbe311 was disabled\n  /dev/mapper/360060160f4a03000dd85fcfa90dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n  /dev/mapper/360060160f4a03000dd85fcfa90dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n  /dev/mapper/360060160f4a03000dd85fcfa90dbe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000dd85fcfa90dbe311 was disabled\n  /dev/mapper/360060160f4a0300018e9960191dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n  /dev/mapper/360060160f4a0300018e9960191dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n  /dev/mapper/360060160f4a0300018e9960191dbe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a0300018e9960191dbe311 was disabled\n  Volume group "7dd40ede-91e0-4fe6-b05f-29d6e22a7b14" not found\n  Skipping volume group 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14\n'; <rc> = 5

Thread-17::ERROR::2014-05-18 12:15:02,058::dispatcher::65::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot find master domain: 'spUUID=2afb1571-1fc2-4906-b9ea-33984997ecc2, msdUUID=7dd40ede-91e0-4fe6-b05f-29d6e22a7b14'", 'code': 304}}

Comment 3 Allon Mureinik 2014-05-19 12:04:11 UTC
(In reply to Federico Simoncelli from comment #2)
> I don't think this is a mixed pool regression, it's rather an lvm regression.
> If it's not a very rare lvm race then it should be reproducible also with
> two iscsi domains (on different server).
In that case, this should also reproduce from one iSCSI to another.
Elad - can you verify this?

In any event, this cannot block 3.4.0 (since it's already out there).
Moving to 3.4.1 to continue research.

Comment 5 Allon Mureinik 2014-05-29 08:31:25 UTC
Elad, please see question in comment 3.

Comment 6 Elad 2014-05-29 08:59:48 UTC
(In reply to Allon Mureinik from comment #5)
> Elad, please see question in comment 3.

Yes, it reproduces also from block to block. It seems that the issue occurs when the source domain is located on a specific storage server (EMC-VNX).
When tested it with a different storage server as the source storage domain, reconstruct succeeded.

Comment 7 Allon Mureinik 2014-05-29 10:54:17 UTC
Reducing severity base on the fact it's a specific storage machine

Comment 9 Kevin Alon Goldblatt 2014-10-06 08:17:23 UTC
Created attachment 944167 [details]
engine vdsm and server logs

I have reproduced this scenario using V3.5 vt4:
rhevm-3.5.0-0.13.beta.el6ev.noarch
vdsm-4.16.5-2.el6ev.x86_64


This is a serious issue and this defect was detected in June. I had to delete my setup as a result. Adding new logs.

Comment 13 Aharon Canan 2014-10-29 08:05:49 UTC
This needinfo is for Kevin

Comment 15 Kevin Alon Goldblatt 2014-11-12 15:07:37 UTC
I helped Federico reproduce this problem on our systems connected to the VNX serve. Closing the Need more info.

Comment 16 lkuchlan 2014-11-26 13:35:05 UTC
The NFS domain has been activated and set as the new master domain
Verified on RHVEM 3.5 vt11

Comment 18 errata-xmlrpc 2015-02-11 21:11:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0159.html


Note You need to log in before you can comment on or make changes to this bug.