Bug 1001750
Summary: | Wrong behavior Storage Domains during block connectivity to one of Storage Servers | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | vvyazmin <vvyazmin> | ||||||
Component: | ovirt-engine | Assignee: | Liron Aravot <laravot> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | Aharon Canan <acanan> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.3.0 | CC: | abaron, acanan, acathrow, amureini, fsimonce, hateya, iheim, laravot, lpeer, Rhev-m-bugs, scohen, tnisan, yeylon | ||||||
Target Milestone: | --- | Keywords: | Regression, Reopened, Triaged | ||||||
Target Release: | 3.3.0 | Flags: | abaron:
Triaged+
|
||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | storage | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2013-11-05 16:33:26 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
IMHO, root cause for this behavior come from influence Reconstruct Storage Domain feature on all environment. vvyazmin, please provide more info here so it'll be easier for people to follow with inspecting the logs - 1. which one of the domains was the master when the scenario started? 2. missing vdsm logs (they are till 17:13 for tigris01). 3. if you gonna to reproduce, please do it with one dc - few dc's adds a lot of noise to the logs. *** Bug 1002049 has been marked as a duplicate of this bug. *** vvyazmin, what about the additional logs Liron asked for? [V] Host_01 (SPM) __ connected ____________ SD_01 (EMC) |_______ SD_02 (XIO) |_______ SD_03 (XIO) - Master SD_03 - Master Block connectivity via iptables 2-Oct-2013 13:28 Logs Attached Created attachment 806377 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
Failed, tested on RHEVM 3.3 - IS16 environment: Host OS: RHEL 6.5 RHEVM: rhevm-3.3.0-0.22.master.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.14-1.el6ev.noarch VDSM: vdsm-4.12.0-156.git6e499d6.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-27.el6.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.407.el6.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64 the blocked domain moves to inactive as expected after 5 minutes: 2013-10-02 13:31:00,325 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-5-thread-50) starting ProcessDomainRecovery for domain c8212f07-ef0a-4e09-b69e-3339cab7554a:SD-iSCSI-A-XIO 2013-10-02 13:31:00,988 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-5-thread-50) Domain c8212f07-ef0a-4e09-b69e-3339cab7554a:SD-iSCSI-A-XIO was reported by all hosts in status UP as problematic. Moving the domain to NonOperational. 2013-10-02 13:31:01,006 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (pool-5-thread-50) Lock Acquired to object EngineLock [exclusiveLocks= key: c8212f07-ef0a-4e09-b69e-3339cab7554a value: STORAGE , sharedLocks= key: 1fc64d56-bead-4c29-8cee-a8cedaf581a7 value: POOL ] 2013-10-02 13:31:01,056 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (pool-5-thread-50) Running command: DeactivateStorageDomainCommand internal: true. Entities affected : ID: c8212f07-ef0a-4e09-b69e-3339cab7554a Type: Storage 2013-10-02 13:31:01,132 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (pool-5-thread-50) Lock freed to object EngineLock [exclusiveLocks= key: c8212f07-ef0a-4e09-b69e-3339cab7554a value: STORAGE , sharedLocks= key: 1fc64d56-bead-4c29-8cee-a8cedaf581a7 value: POOL ] ---------------------------------------------------------------------------- later on GetDeviceList starts to fail to TimeoutException: -------------------------------------------------------------- --------------------------------------------------------------- (vdsm side:) Thread-74566::DEBUG::2013-10-02 11:28:26,521::BindingXMLRPC::177::vds::(wrapper) client [10.35.161.29] Thread-74566::DEBUG::2013-10-02 11:28:26,522::task::579::TaskManager.Task::(_updateState) Task=`60fbbf8c-6eb8-4988-8062-d28c2d6ca4ff`::moving from state init -> state prepar ing Thread-74566::INFO::2013-10-02 11:28:26,523::logUtils::44::dispatcher::(wrapper) Run and protect: getDeviceList(storageType=3, options={}) Thread-74566::DEBUG::2013-10-02 11:28:56,514::lvm::509::OperationMutex::(_invalidateAllPvs) Operation 'lvm reload operation' is holding the operation mutex, waiting... *we'll wait now for 5 mins which obviously causes to the timeout Thread-74566::DEBUG::2013-10-02 11:33:58,992::lvm::509::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex -------------------------------------------------------------------------------- the 2 other domains are reported as in problem because we wait for the refresh to be executed (the thread hangs): *vdsm attempts to refresh the domain at: Thread-74025::DEBUG::2013-10-02 11:34:08,901::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) Refreshing domain c486bf3b-d60a-45a0-b635-835b9825865a Thread-74024::DEBUG::2013-10-02 11:34:08,902::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) Refreshing domain 98401f2a-29a2-4016-9323-424dca956e6d Thread-74025::DEBUG::2013-10-02 11:34:08,904::misc::809::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage) Thread-74024::DEBUG::2013-10-02 11:34:08,905::misc::809::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage) -- 10 minutes passed --- Thread-74025::DEBUG::2013-10-02 11:44:27,698::misc::817::SamplingMethod::(__call__) Some one got in for me when looking on vdsm side report during the wait we'll see the following: Thread-74578::INFO::2013-10-02 11:37:27,168::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {u'98401f2a-29a2-4016-9323-424dca956e6d': {'delay': '0.000539994', 'lastCheck': '208.3', 'code': 0, 'valid': True, 'version': 3}, u'c486bf3b-d60a-45a0-b635-835b9825865a': {'delay': '0.000534723', 'lastCheck': '208.3', 'code': 0, 'valid': True, 'version': 3}, u'c8212f07-ef0a-4e09-b69e-3339cab7554a': {'delay': '0', 'lastCheck': '210.0', 'code': 2001, 'valid': False, 'version': -1}} Thread-74578::DEBUG::2013-10-02 11:37:27,169::task::1168::TaskManager.Task::(prepare) Task=`2e2f1b4f-a619-42ce-940c-fe132a49dbb4`::finished: {u'98401f2a-29a2-4016-9323-424dca956e6d': {'delay': '0.000539994', 'lastCheck': '208.3', 'code': 0, 'valid': True, 'version': 3}, u'c486bf3b-d60a-45a0-b635-835b9825865a': {'delay': '0.000534723', 'lastCheck': '208.3', 'code': 0, 'valid': True, 'version': 3}, u'c8212f07-ef0a-4e09-b69e-3339cab7554a': {'delay': '0', 'lastCheck': '210.0', 'code': 2001, 'valid': False, 'version': and in the engine - the domains are considered problematic and eventually changes status the inactive as the lastCheck is greater then defined in vdc_options --------------------------------------------- 2013-10-02 13:37:24,280 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-5-thread-49) domain 98401f2a-29a2-4016-9323-424dca956e6d:SD-iSCSI-C-XIO in problem. vds: tigris01.scl.lab.tlv.redhat.com 2013-10-02 13:37:24,282 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-5-thread-49) domain c486bf3b-d60a-45a0-b635-835b9825865a:SD-iSCSI-B-XIO in problem. vds: tigris01.scl.lab.tlv.redhat.com --------------------------------------------- after the 10 minutes - the domains reported as active and move back to active status: 2013-10-02 13:44:40,140 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-22) Storage Domain c486bf3b-d60a-45a0-b635-835b9825865a:SD-iSCSI-B-XIO was reported by Host tigris01.scl.lab.tlv.redhat.com as Active in Pool 1fc64d56-bead-4c29-8cee-a8cedaf581a7, moving to active status 2013-10-02 13:44:40,148 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-22) Storage Domain 98401f2a-29a2-4016-9323-424dca956e6d:SD-iSCSI-C-XIO was reported by Host tigris01.scl.lab.tlv.redhat.com as Active in Pool 1fc64d56-bead-4c29-8cee-a8cedaf581a7, moving to active status but the scenario seems to repeat itself. seems like a vdsm issue to me, Fede, can you take a look please? . can we somehow avoid that scenario? We have to investigate what's causing _reloadvgs to be so slow (~5 minutes execution time): Thread-74024::DEBUG::2013-10-02 11:28:42,148::lvm::385::OperationMutex::(_reloadvgs) Got the operational mutex Thread-74024::DEBUG::2013-10-02 11:28:42,150::lvm::308::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/3514f0c54d5c00010|/dev/mapper/3514f0c54d5c00011|/dev/mapper/3514f0c54d5c00012|/dev/mapper/3514f0c54d5c00013|/dev/mapper/3514f0c54d5c001fc|/dev/mapper/3514f0c54d5c001fd|/dev/mapper/3514f0c54d5c001fe|/dev/mapper/3514f0c54d5c001ff|/dev/mapper/3514f0c54d5c00928|/dev/mapper/3514f0c54d5c0092a|/dev/mapper/3514f0c54d5c00937|/dev/mapper/3514f0c54d5c00938|/dev/mapper/3514f0c54d5c00939|/dev/mapper/3514f0c54d5c0093a|/dev/mapper/3514f0c54d5c00b0d|/dev/mapper/3514f0c54d5c00b0e|/dev/mapper/3514f0c54d5c00b0f|/dev/mapper/3514f0c54d5c00c01|/dev/mapper/3514f0c54d5c00c02|/dev/mapper/3514f0c54d5c00c03|/dev/mapper/3514f0c54d5c00c04|/dev/mapper/3514f0c54d5c00c05|/dev/mapper/3514f0c54d5c01679|/dev/mapper/3514f0c54d5c01866|/dev/mapper/3514f0c54d5c01875|/dev/mapper/3514f0c54d5c01876|/dev/mapper/3514f0c54d5c01877|/dev/mapper/3514f0c54d5c01878|/dev/mapper/3514f0c54d5c01879|/dev/mapper/3600508b1001c19126249676fbb7165e5|/dev/mapper/360060160128230001a60db85320fe311|/dev/mapper/360060160128230001c60db85320fe311|/dev/mapper/360060160128230001e60db85320fe311|/dev/mapper/360060160128230002060db85320fe311|/dev/mapper/360060160128230002260db85320fe311|/dev/mapper/360060160128230002460db85320fe311|/dev/mapper/360060160128230002660db85320fe311|/dev/mapper/360060160128230008e76a68d5d2ae311|/dev/mapper/360060160128230009076a68d5d2ae311|/dev/mapper/360060160128230009276a68d5d2ae311|/dev/mapper/360060160128230009476a68d5d2ae311|\', \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 98401f2a-29a2-4016-9323-424dca956e6d' (cwd None) ... Thread-74024::DEBUG::2013-10-02 11:33:58,794::lvm::308::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' /dev/mapper/360060160128230002060db85320fe311: read failed after 0 of 4096 at 35433414656: Input/output error\n /dev/mapper/360060160128230002060db85320fe311: read failed after 0 of 4096 at 35433472000: Input/output error\n /dev/mapper/360060160128230002060db85320fe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160128230002060db85320fe311 was disabled\n /dev/mapper/360060160128230008e76a68d5d2ae311: read failed after 0 of 4096 at 91267989504: Input/output error\n /dev/mapper/360060160128230008e76a68d5d2ae311: read failed after 0 of 4096 at 91268046848: Input/output error\n /dev/mapper/360060160128230008e76a68d5d2ae311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160128230008e76a68d5d2ae311 was disabled\n /dev/mapper/360060160128230009276a68d5d2ae311: read failed after 0 of 4096 at 91267989504: Input/output error\n /dev/mapper/360060160128230009276a68d5d2ae311: read failed after 0 of 4096 at 91268046848: Input/output error\n /dev/mapper/360060160128230009276a68d5d2ae311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160128230009276a68d5d2ae311 was disabled\n /dev/mapper/360060160128230001c60db85320fe311: read failed after 0 of 4096 at 35433414656: Input/output error\n /dev/mapper/360060160128230001c60db85320fe311: read failed after 0 of 4096 at 35433472000: Input/output error\n /dev/mapper/360060160128230001c60db85320fe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160128230001c60db85320fe311 was disabled\n /dev/mapper/360060160128230002460db85320fe311: read failed after 0 of 4096 at 35433414656: Input/output error\n /dev/mapper/360060160128230002460db85320fe311: read failed after 0 of 4096 at 35433472000: Input/output error\n /dev/mapper/360060160128230002460db85320fe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160128230002460db85320fe311 was disabled\n'; <rc> = 0 This was before short filters patch was introduced, can you try to reproduce on is20? using is21, following above scenario, it works, blocked SD is inactive DC is still up and function. Master SD is up Ayal - if you think we should mark as verified instead let me know. |
Created attachment 791101 [details] ## Logs rhevm, vdsm, libvirt, thread dump, superVdsm Description of problem: Wrong behavior Storage Domains during block connectivity to one of Storage Servers Version-Release number of selected component (if applicable): RHEVM 3.3 - IS11 environment: RHEVM: rhevm-3.3.0-0.16.master.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.11-1.el6ev.noarch VDSM: vdsm-4.12.0-72.git287bb7e.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Crete iSCSI Data Center (DC) with one Host connecter to 3 Storage Domain (SD) * On first Storage Server (EMC) - one SD * On second Storage Server (XIO) - two SD’s see diagram below: ------------------------------------------------------------------------------------------ [V] Host_01 (SPM) _____ connected _________ SD_01 (EMC) |_______ SD_02 (XIO) |_______ SD_03 (XIO) - Master ------------------------------------------------------------------------------------------ 2. From Host block connectivity to SD_01 via iptables Actual results: There is all snap steps with SD’s 1. SD_01 - Unknow SD_02 - Unknow SD_03 - Unknow 2. SD_01 - Inactive SD_02 - Inactive SD_03 - Inactive 3. SD_01 - Inactive SD_02 - UP SD_03 - UP 4. SD_01 - Inactive SD_02 - Unknow SD_03 - Unknow 5. SD_01 - Inactive SD_02 - Inactive SD_03 - Inactive 6. then steps 3,4,5 repeats all times Expected results: Blocked Storage Domain should become “Inactive” Rest (unblocked) Storage Domain should be “UP” Impact on user: Failed work with this DC Workaround: none Additional info: 2013-Aug-27, 19:07 Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.). 2013-Aug-27, 18:37 Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: done). 2013-Aug-27, 18:31 Storage Domain SD-g-02 (Data Center DC-g-01) was deactivated by system because it's not visible by any of the hosts. 2013-Aug-27, 18:29 Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.). 2013-Aug-27, 18:09 Storage Domain SD-g-02 (Data Center DC-g-01) was deactivated by system because it's not visible by any of the hosts. 2013-Aug-27, 18:07 Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.). 2013-Aug-27, 17:37 Failed to Reconstruct Master Domain for Data Center DC-g-01. 2013-Aug-27, 17:36 Storage Domain SD-g-02 (Data Center DC-g-01) was deactivated by system because it's not visible by any of the hosts. 2013-Aug-27, 17:34 Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.). 2013-Aug-27, 17:29 Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.). 2013-Aug-27, 17:26 Storage Domain SD-g-03 (Data Center DC-g-01) was deactivated by system because it's not visible by any of the hosts. 2013-Aug-27, 17:26 Storage Domain SD-g-02 (Data Center DC-g-01) was deactivated by system because it's not visible by any of the hosts. 2013-Aug-27, 17:21 Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.). /var/log/ovirt-engine/engine.log /var/log/vdsm/vdsm.log