| Summary: | Failed reconstruct master storage domain | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | vvyazmin <vvyazmin> | ||||||
| Component: | vdsm | Assignee: | Nir Soffer <nsoffer> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Aharon Canan <acanan> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 3.3.0 | CC: | acanan, amureini, bazulay, fsimonce, laravot, lpeer, nsoffer, scohen, sgotliv, tnisan, yeylon, ylavi | ||||||
| Target Milestone: | ovirt-3.6.0-rc | ||||||||
| Target Release: | 3.6.0 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | Type: | Bug | |||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
# multipath -ll 36006016012823000121584ca540ae311 dm-3 DGC,VRAID size=30G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw `-+- policy='round-robin 0' prio=0 status=enabled `- 3:0:0:0 sda 8:0 failed faulty running 3514f0c54d5c00bac dm-11 XtremIO,XtremApp size=31G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active `- 4:0:0:1 sdc 8:32 active ready running 3514f0c54d5c00bab dm-10 XtremIO,XtremApp size=31G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active `- 4:0:0:0 sdb 8:16 active ready running Referring to the timeout exception only - it seems like that the reconstruct takes on vdsm side executes command that hangs, leading to timeout exception in the engine on the different operations.
1. reconstruct execution start
Thread-1850::INFO::2013-08-21 13:11:11,794::logUtils::44::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='9000d9d2-26b9-4d0f-
--- when executing:
Thread-1850::DEBUG::2013-08-21 13:11:42,990::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"
] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/3514f0c54d5c00bab|/dev/mapper/3514f0c54d5c00bac|/dev/mapper/360060160
12823000121584ca540ae311|\', \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadin
gs --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 04421dfa-a44d-49
dc-a4c5-702e49361264' (cwd None)
-- (5 minutes later)
hread-1850::DEBUG::2013-08-21 13:16:42,947::lvm::314::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' /dev/mapper/36006016012823000121584ca540ae311: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/36006016012823000121584ca540ae311: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/36006016012823000121584ca540ae311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/36006016012823000121584ca540ae311 was disabled\n'; <rc> = 0
same goes for connectStoragePool -
Thread-37::DEBUG::2013-08-21 13:26:28,729::lvm::314::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"]
ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/3514f0c54d5c00bab|/dev/mapper/3514f0c54d5c00bac|/dev/mapper/36006016012
823000121584ca540ae311|\', \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings
--units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 245825c2-ed4b-45c9
-98d6-18475ba761e4' (cwd None)
Thread-37::DEBUG::2013-08-21 13:31:28,458::lvm::314::Storage.Misc.excCmd::(cmd) FAILED: <err> = ' /dev/mapper/36006016012823000121584ca540ae311: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/36006016012823000121584ca540ae311: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/36006016012823000121584ca540ae311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/36006016012823000121584ca540ae311 was disabled\n Volume group "245825c2-ed4b-45c9-98d6-18475ba761e4" not found\n'; <rc> = 5
Fede, can you take a look?
according to Sergey and Ayal, this fix won't be part of rhel 6.5 - nothing to do then. Sergey, following our discussion, It shouldn't be related to the multipath. please check as it is critical. suspected to be a duplicate bug of https://bugzilla.redhat.com/865034 Nir - worth considering along the other multipath issues you're looking into. Liron - now that we don't store the pool metadata on the storage, what's left to do from our side? Allon, the issue isn't related to the fact the pool metadata was on storage (due having it in mem obviously short the whole operation), but to the different storage refreshes we perform on that flow (_reloadvg() on that case). Unless some of the relevant configurations were changed..we can still run on the same issue, at least until reconstruct will be deprecated. Can we reproduce this with 3.5 on rhel 7.1? (In reply to Nir Soffer from comment #14) > Can we reproduce this with 3.5 on rhel 7.1? Do we consider fixing it in 3.5.x ? why do you need reproduction on 3.5 and 7.1 ? (In reply to Aharon Canan from comment #15) > (In reply to Nir Soffer from comment #14) > > Can we reproduce this with 3.5 on rhel 7.1? > > Do we consider fixing it in 3.5.x ? why do you need reproduction on 3.5 and > 7.1 ? 3.5 and 7.1 are the stable versions. If we can not reproduce it there this is not interesting. 3.6 is also ok for testing, but 7.2 is not ready yet, see bug 1253756. I think that this patch should solve this issue: commit 50fc8a08ea558a19b8fc4eecfaad59c874225ef9 Author: Federico Simoncelli <fsimonce> Date: Wed Nov 12 10:22:29 2014 +0000 multipath: add specific fail entry for emc-vnx As few other storage arrays also the emc-vnx is forcing the queue_if_no_path option on the multipath device. This patch adds a specific section for the device maintaining the emc defaults but forcing to fail the io instead of queuing. In case of queuing we observed that the processes were blocked on io and among other things it prevented the storage pool to be reconstructed. Bug-Url: https://bugzilla.redhat.com/show_bug.cgi?id=1098769 Change-Id: Icf5f4c89eac1273bf3a8e1afa300fec7b5b68eae Signed-off-by: Federico Simoncelli <fsimonce> Reviewed-on: http://gerrit.ovirt.org/35072 Reviewed-by: Nir Soffer <nsoffer> Reviewed-by: Dan Kenigsberg <danken> (cherry picked from commit 9880fdf9989718a097bb05571a6344e28c2f258b) Reviewed-on: http://gerrit.ovirt.org/35179 Reviewed-by: Yaniv Bronhaim <ybronhei> It is available in ovirt/rhev 3.5. Created attachment 1072037 [details] Logs01 (In reply to Nir Soffer from comment #14) > Can we reproduce this with 3.5 on rhel 7.1? I retested like describe below - 1. set one rhel7.1 hypervisor 2. set 2 storage domains, different arrays 3. block connectivity (iptable) to the master domain results - It took some time but finally the reconstruct worked but - 1. DC is up (ok) 2. Host is non-operational (why??) host can access the new master and can't access the second domain 3. Blocked SD appears as active in the GUI but sizes are N/A (should be marked as inactive) 4. after few more minutes, both SDs went to UNKNOWN anyway, everything looks very strange here. [root@camel-vdsc ~]# rpm -qa |grep vdsm vdsm-python-zombiereaper-4.16.26-1.el7ev.noarch vdsm-yajsonrpc-4.16.26-1.el7ev.noarch vdsm-xmlrpc-4.16.26-1.el7ev.noarch vdsm-python-4.16.26-1.el7ev.noarch vdsm-cli-4.16.26-1.el7ev.noarch vdsm-4.16.26-1.el7ev.x86_64 vdsm-jsonrpc-4.16.26-1.el7ev.noarch [root@camel-vdsc ~]# rpm -qa |grep multi device-mapper-multipath-0.4.9-77.el7_1.1.x86_64 device-mapper-multipath-libs-0.4.9-77.el7_1.1.x86_64 [root@camel-vdsc ~]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED ACCEPT icmp -- anywhere anywhere ACCEPT all -- anywhere anywhere ACCEPT tcp -- anywhere anywhere tcp dpt:54321 ACCEPT tcp -- anywhere anywhere tcp dpt:sunrpc ACCEPT udp -- anywhere anywhere udp dpt:sunrpc ACCEPT tcp -- anywhere anywhere tcp dpt:ssh ACCEPT udp -- anywhere anywhere udp dpt:snmp ACCEPT tcp -- anywhere anywhere tcp dpt:16514 ACCEPT tcp -- anywhere anywhere multiport dports rfb:6923 ACCEPT tcp -- anywhere anywhere multiport dports 49152:49216 REJECT all -- anywhere anywhere reject-with icmp-host-prohibited DROP all -- monster-spa-p0.scl.lab.tlv.redhat.com anywhere DROP all -- monster-spa-p1.scl.lab.tlv.redhat.com anywhere DROP all -- monster-spb-p0.scl.lab.tlv.redhat.com anywhere DROP all -- monster-spb-p1.scl.lab.tlv.redhat.com anywhere Chain FORWARD (policy ACCEPT) target prot opt source destination REJECT all -- anywhere anywhere PHYSDEV match ! --physdev-is-bridged reject-with icmp-host-prohibited Chain OUTPUT (policy ACCEPT) target prot opt source destination DROP all -- anywhere monster-spb-p1.scl.lab.tlv.redhat.com DROP all -- anywhere monster-spb-p0.scl.lab.tlv.redhat.com DROP all -- anywhere monster-spa-p1.scl.lab.tlv.redhat.com DROP all -- anywhere monster-spa-p0.scl.lab.tlv.redhat.com more info - I waited few more minutes, and now everything looks ok (DC, host, master domain and blocked domain) I think we are good - it is probably a matter of time until everything finally happen I will keep the setup for one more hour to check if something change. Reconstruction does indeed take a while. Setting to ON_QA based on the last comment. If you decide it works acceptably, set it to VERIFIED. If not, reopen it and specify the exact issue that should be fixed. (In reply to Aharon Canan from comment #18) > I will keep the setup for one more hour to check if something change. I guess the hour has passed :-) can you update us on the result? If you are not sure, you can test this with 3.5. If there is a significat difference in behavior between 3.5 and 3.6 we can investigate. (In reply to Allon Mureinik from comment #19) > Reconstruction does indeed take a while. > > Setting to ON_QA based on the last comment. If you decide it works > acceptably, set it to VERIFIED. If not, reopen it and specify the exact > issue that should be fixed. Verified (versions on comment #17) RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE |
Created attachment 788836 [details] ## Logs rhevm, vdsm, libvirt, thread dump, superVdsm Description of problem: Failed reconstruct master storage domain Version-Release number of selected component (if applicable): RHEVM 3.3 - IS10 environment: RHEVM: rhevm-3.3.0-0.15.master.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.10-1.el6ev.noarch VDSM: vdsm-4.12.0-61.git8178ec2.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Crete iSCSI Data Center (DC) with 5 Hosts connecter to 3 Storage Domain (SD) * On first Storage Server (EMC) - one SD * On second Storage Server (XIO) - two SD’s see diagram below: ------------------------------------------------------------------------------- [V] Host_01 (SPM) _____ connected _________ SD_01 (EMC) - Master [V] Host_02 _________| |_______ SD_02 (XIO) [V] Host_03 _________| |_______ SD_03 (XIO) [V] Host_04 _________| [V] Host_05 _________| ------------------------------------------------------------------------------- 2. From all Hosts block connectivity to Master SD via iptables Actual results: Failed reconstruct Master Storage Domain Expected results: Succeed reconstruct Master Storage Domain Impact on user: Can’t work with DC Workaround: none Additional info: /var/log/ovirt-engine/engine.log 2013-08-21 13:14:09,913 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-5-thread-8147) Command ReconstructMasterVDS execution failed. Excepti on: VDSNetworkException: java.util.concurrent.TimeoutException 2013-08-21 13:14:09,913 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (pool-5-thread-8147) FINISH, ReconstructMasterVDSCommand, log id: 2c6c5ef9 2013-08-21 13:14:09,913 ERROR [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-5-thread-8147) Command org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) 2013-08-21 13:14:09,915 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-5-thread-8147) Command [id=44719e85-bb25-4d81-870d-7eac2d2be1f9]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: EntityStatusSnapshot [id=storagePoolId = 9000d9d2-26b9-4d0f-89dc-c0d34ef25d25, storageId = 245825c2-ed4b-45c9-98d6-18475ba761e4, status=Unknown]. 2013-08-21 13:14:09,916 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-5-thread-8147) Command [id=44719e85-bb25-4d81-870d-7eac2d2be1f9]: Compensating CHANGED_ENTITY of org.ovirt.engine.core.common.businessentities.StorageDomainStatic; snapshot: id=245825c2-ed4b-45c9-98d6-18475ba761e4. 2013-08-21 13:14:09,918 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-5-thread-8147) Command [id=44719e85-bb25-4d81-870d-7eac2d2be1f9]: Compensating CHANGED_ENTITY of org.ovirt.engine.core.common.businessentities.StorageDomainStatic; snapshot: id=04421dfa-a44d-49dc-a4c5-702e49361264. /var/log/vdsm/vdsm.log