Created attachment 896765 [details] Logs from engine and vdsm Description of problem: While master domain in the pool was from a block type (iSCSI), I blocked connectivity to its storage server. Other active domain in the DC was from a file type (NFS). Reconstruct failed with a ResourceTimeout on VDSM. I tested the reverse scenario (from file to block) with the same storage domains and reconstruct succeeded, the DC was recovered. Also from file to file succeeded. Those 3 scenarios were tested in 2 different RHEV environments. Version-Release number of selected component (if applicable): av9.1 rhevm-3.4.0-0.20.el6ev.noarch vdsm-4.14.7-2.el6ev.x86_64 How reproducible: Always Steps to Reproduce: On a shared DC with 2 storage domains from different types (NFS and iSCSI), make sure the iSCSI domain is the master: 1. Block connectivity from SPM host to master domain (iSCSI) Actual results: Reconstruct master from domain af045372-b42f-4866-afee-7c4f118393f6 begins: From engine.log: 2014-05-18 12:13:03,813 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-4-thread-41) [59860773] START, ReconstructMasterVDSCommand(HostName = green-vdsa, HostId = f2954f23-a6a0-4c98-b80c-6c77231e432e, vdsSpmId = 1, storagePoolId = 2afb1571-1fc2-4906-b9ea-33984997ecc2, storagePoolName = dc2, masterDomainId = af045372-b42f-4866-afee-7c4f118393f6, masterVersion = 4, domainsList = [{ domainId: 5588721a-9d48-4525-b632-14d794114605, status: Maintenance };{ domainId: 89353e13-07bf-4f15-9787-3d7828d395aa, status: Maintenance };{ domainId: af045372-b42f-4866-afee-7c4f118393f6, status: Unknown };{ domainId: 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14, status: InActive };]), log id: 47b66656 It fails with an error from VDSM: 2014-05-18 10:51:27,631 ERROR [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-4-thread-44) [58b391d5] Command org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to ReconstructMasterVDS, error = Resource timeout: () (Failed with error ResourceTimeout and code 851) The error in vdsm.log: Thread-25::ERROR::2014-05-18 12:15:00,382::task::866::TaskManager.Task::(_setError) Task=`e3d7d49d-9108-455a-8d37-5c83ee54e7c9`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1878, in reconstructMaster vars.task.getExclusiveLock(STORAGE, spUUID) File "/usr/share/vdsm/storage/task.py", line 1329, in getExclusiveLock timeout) File "/usr/share/vdsm/storage/resourceManager.py", line 820, in acquire raise se.ResourceTimeout() ResourceTimeout: Resource timeout: () ============= Tried the same scenario, only reversed (while NFS domain is the master and the target domain to take the master is iSCSI): 2014-05-18 11:42:16,644 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-4-thread-44) [fa1a415] START, ReconstructMasterVDSCommand(HostName = green-vdsa, HostId = f2954f23-a6a0-4c98-b80c-6c77231e432e, vdsSpmId = 1, storagePoolId = 2afb1571-1fc2-4906-b9ea-33984997ecc2, storagePoolName = dc2, masterDomainId = 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14, masterVersion = 3, domainsList = [{ domainId: 89353e13-07bf-4f15-9787-3d7828d395aa, status: Maintenance };{ domainId: 5588721a-9d48-4525-b632-14d794114605, status: Maintenance };{ domainId: af045372-b42f-4866-afee-7c4f118393f6, status: InActive };{ domainId: 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14, status: Unknown };]), log id: 47125889 2014-05-18 11:46:19,074 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-4-thread-44) [fa1a415] Correlation ID: fa1a415, Job ID: 54d8160f-94e7-492f-aa56-5764a5594 79a, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center dc2 completed. Expected results: Reconstruct master, triggered because iSCSI domain is unreachable, should succeed. Additional info: Logs from engine and vdsm
The reconstructMaster command failed to acquire the resource because of a previous connectStoragePool command that got stuck 10 minutes for an lvm command to complete: Thread-17::INFO::2014-05-18 12:04:59,986::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='2afb1571-1fc2-4906-b9ea-33984997ecc2', ... Thread-17::DEBUG::2014-05-18 12:05:00,358::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs ... (10 minutes, and reconstructMaster call) Thread-17::DEBUG::2014-05-18 12:15:02,000::lvm::296::Storage.Misc.excCmd::(cmd) FAILED: <err> = ' /dev/mapper/360060160f4a03000de85fcfa90dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n ... I don't think this is a mixed pool regression, it's rather an lvm regression. If it's not a very rare lvm race then it should be reproducible also with two iscsi domains (on different server). All relevant logs: Thread-17::INFO::2014-05-18 12:04:59,986::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='2afb1571-1fc2-4906-b9ea-33984997ecc2', hostID=1, msdUUID='7dd40ede-91e0-4fe6-b05f-29d6e22a7b14', masterVersion=3, domainsMap=None, options=None) Thread-17::DEBUG::2014-05-18 12:05:00,358::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/3514f0c546260002e|/dev/mapper/3514f0c546260002f|/dev/mapper/3514f0c5462600030|/dev/mapper/3514f0c5462600031|/dev/mapper/3514f0c5462600039|/dev/mapper/3514f0c546260003a|/dev/mapper/3514f0c546260003b|/dev/mapper/3514f0c546260003c|/dev/mapper/360060160f4a0300018e9960191dbe311|/dev/mapper/360060160f4a03000db85fcfa90dbe311|/dev/mapper/360060160f4a03000dc85fcfa90dbe311|/dev/mapper/360060160f4a03000dd85fcfa90dbe311|/dev/mapper/360060160f4a03000de85fcfa90dbe311|\', \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14' (cwd None) Thread-25::INFO::2014-05-18 12:13:00,347::logUtils::44::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='2afb1571-1fc2-4906-b9ea-33984997ecc2', poolName='dc2', masterDom='af045372-b42f-4866-afee-7c4f118393f6', domDict={'89353e13-07bf-4f15-9787-3d7828d395aa': 'attached', '7dd40ede-91e0-4fe6-b05f-29d6e22a7b14': 'active', 'af045372-b42f-4866-afee-7c4f118393f6': 'active', '5588721a-9d48-4525-b632-14d794114605': 'attached'}, masterVersion=4, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, hostId=1, options=None) Thread-25::ERROR::2014-05-18 12:15:00,382::task::866::TaskManager.Task::(_setError) Task=`e3d7d49d-9108-455a-8d37-5c83ee54e7c9`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1878, in reconstructMaster vars.task.getExclusiveLock(STORAGE, spUUID) File "/usr/share/vdsm/storage/task.py", line 1329, in getExclusiveLock timeout) File "/usr/share/vdsm/storage/resourceManager.py", line 820, in acquire raise se.ResourceTimeout() ResourceTimeout: Resource timeout: () Thread-17::DEBUG::2014-05-18 12:15:02,000::lvm::296::Storage.Misc.excCmd::(cmd) FAILED: <err> = ' /dev/mapper/360060160f4a03000de85fcfa90dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/360060160f4a03000de85fcfa90dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/360060160f4a03000de85fcfa90dbe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000de85fcfa90dbe311 was disabled\n /dev/mapper/360060160f4a03000dc85fcfa90dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/360060160f4a03000dc85fcfa90dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/360060160f4a03000dc85fcfa90dbe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000dc85fcfa90dbe311 was disabled\n /dev/mapper/360060160f4a03000db85fcfa90dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/360060160f4a03000db85fcfa90dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/360060160f4a03000db85fcfa90dbe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000db85fcfa90dbe311 was disabled\n /dev/mapper/360060160f4a03000dd85fcfa90dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/360060160f4a03000dd85fcfa90dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/360060160f4a03000dd85fcfa90dbe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000dd85fcfa90dbe311 was disabled\n /dev/mapper/360060160f4a0300018e9960191dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/360060160f4a0300018e9960191dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/360060160f4a0300018e9960191dbe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a0300018e9960191dbe311 was disabled\n Volume group "7dd40ede-91e0-4fe6-b05f-29d6e22a7b14" not found\n Skipping volume group 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14\n'; <rc> = 5 Thread-17::ERROR::2014-05-18 12:15:02,058::dispatcher::65::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot find master domain: 'spUUID=2afb1571-1fc2-4906-b9ea-33984997ecc2, msdUUID=7dd40ede-91e0-4fe6-b05f-29d6e22a7b14'", 'code': 304}}
(In reply to Federico Simoncelli from comment #2) > I don't think this is a mixed pool regression, it's rather an lvm regression. > If it's not a very rare lvm race then it should be reproducible also with > two iscsi domains (on different server). In that case, this should also reproduce from one iSCSI to another. Elad - can you verify this? In any event, this cannot block 3.4.0 (since it's already out there). Moving to 3.4.1 to continue research.
Elad, please see question in comment 3.
(In reply to Allon Mureinik from comment #5) > Elad, please see question in comment 3. Yes, it reproduces also from block to block. It seems that the issue occurs when the source domain is located on a specific storage server (EMC-VNX). When tested it with a different storage server as the source storage domain, reconstruct succeeded.
Reducing severity base on the fact it's a specific storage machine
Created attachment 944167 [details] engine vdsm and server logs I have reproduced this scenario using V3.5 vt4: rhevm-3.5.0-0.13.beta.el6ev.noarch vdsm-4.16.5-2.el6ev.x86_64 This is a serious issue and this defect was detected in June. I had to delete my setup as a result. Adding new logs.
This needinfo is for Kevin
I helped Federico reproduce this problem on our systems connected to the VNX serve. Closing the Need more info.
The NFS domain has been activated and set as the new master domain Verified on RHVEM 3.5 vt11
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0159.html