Bug 1001750

Summary: Wrong behavior Storage Domains during block connectivity to one of Storage Servers
Product: Red Hat Enterprise Virtualization Manager Reporter: vvyazmin <vvyazmin>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED WORKSFORME QA Contact: Aharon Canan <acanan>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: abaron, acanan, acathrow, amureini, fsimonce, hateya, iheim, laravot, lpeer, Rhev-m-bugs, scohen, tnisan, yeylon
Target Milestone: ---Keywords: Regression, Reopened, Triaged
Target Release: 3.3.0Flags: abaron: Triaged+
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-05 16:33:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
none
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm none

Description vvyazmin@redhat.com 2013-08-27 16:18:21 UTC
Created attachment 791101 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Description of problem:
Wrong behavior Storage Domains during block connectivity to one of Storage Servers

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS11 environment:

RHEVM:  rhevm-3.3.0-0.16.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.11-1.el6ev.noarch
VDSM:  vdsm-4.12.0-72.git287bb7e.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Crete iSCSI Data Center (DC) with one Host connecter to 3 Storage Domain (SD)
* On first Storage Server (EMC) - one SD
* On second Storage Server (XIO) - two SD’s
see diagram below:
------------------------------------------------------------------------------------------
[V] Host_01 (SPM) _____ connected _________ SD_01 (EMC)
                                                          |_______ SD_02 (XIO)
                                                          |_______ SD_03 (XIO) - Master
------------------------------------------------------------------------------------------
2. From Host block connectivity to SD_01 via iptables

Actual results:
There is all snap steps with SD’s
1. 
SD_01 - Unknow
SD_02 - Unknow
SD_03 - Unknow
2. 
SD_01 - Inactive
SD_02 - Inactive
SD_03 - Inactive
3. 
SD_01 - Inactive
SD_02 - UP
SD_03 - UP
4. 
SD_01 - Inactive
SD_02 - Unknow
SD_03 - Unknow
5.
SD_01 - Inactive
SD_02 - Inactive
SD_03 - Inactive

6. then steps 3,4,5 repeats all times

Expected results:
Blocked Storage Domain should become “Inactive”
Rest (unblocked) Storage Domain should be “UP”

Impact on user:
Failed work with this DC

Workaround:
none

Additional info:




2013-Aug-27, 19:07
Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.).




2013-Aug-27, 18:37
Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: done).




2013-Aug-27, 18:31
Storage Domain SD-g-02 (Data Center DC-g-01) was deactivated by system because it's not visible by any of the hosts.




2013-Aug-27, 18:29
Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.).




2013-Aug-27, 18:09
Storage Domain SD-g-02 (Data Center DC-g-01) was deactivated by system because it's not visible by any of the hosts.




2013-Aug-27, 18:07
Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.).




2013-Aug-27, 17:37
Failed to Reconstruct Master Domain for Data Center DC-g-01.




2013-Aug-27, 17:36
Storage Domain SD-g-02 (Data Center DC-g-01) was deactivated by system because it's not visible by any of the hosts.




2013-Aug-27, 17:34
Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.).




2013-Aug-27, 17:29
Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.).




2013-Aug-27, 17:26
Storage Domain SD-g-03 (Data Center DC-g-01) was deactivated by system because it's not visible by any of the hosts.




2013-Aug-27, 17:26
Storage Domain SD-g-02 (Data Center DC-g-01) was deactivated by system because it's not visible by any of the hosts.




2013-Aug-27, 17:21
Invalid status on Data Center DC-g-01. Setting Data Center status to Non Responsive (On host tigris01.scl.lab.tlv.redhat.com, Error: Network error during communication with the Host.).




/var/log/ovirt-engine/engine.log

/var/log/vdsm/vdsm.log

Comment 1 vvyazmin@redhat.com 2013-08-27 16:20:58 UTC
IMHO, root cause for this behavior come from influence Reconstruct Storage Domain feature on all environment.

Comment 3 Liron Aravot 2013-08-28 12:57:50 UTC
vvyazmin, please provide more info here so it'll be easier for people to follow with inspecting the logs -  
1. which one of the domains was the master when the scenario started?
2. missing vdsm logs (they are till 17:13 for tigris01).
3. if you gonna to reproduce, please do it with one dc - few dc's adds a lot of noise to the logs.

Comment 4 Allon Mureinik 2013-08-29 07:32:43 UTC
*** Bug 1002049 has been marked as a duplicate of this bug. ***

Comment 6 Tal Nisan 2013-09-15 10:08:18 UTC
vvyazmin, what about the additional logs Liron asked for?

Comment 7 vvyazmin@redhat.com 2013-10-02 10:57:11 UTC
[V] Host_01 (SPM) __ connected ____________ SD_01 (EMC)
                                   |_______ SD_02 (XIO)
                                   |_______ SD_03 (XIO) - Master

SD_03 - Master

Block connectivity via iptables 2-Oct-2013 13:28

Logs Attached

Comment 8 vvyazmin@redhat.com 2013-10-02 10:57:56 UTC
Created attachment 806377 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Comment 9 vvyazmin@redhat.com 2013-10-02 11:00:26 UTC
Failed, tested on RHEVM 3.3 - IS16 environment:

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.22.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.14-1.el6ev.noarch
VDSM:  vdsm-4.12.0-156.git6e499d6.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-27.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.407.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

Comment 10 Liron Aravot 2013-10-06 08:53:30 UTC
the blocked domain moves to inactive as expected after 5 minutes:
2013-10-02 13:31:00,325 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-5-thread-50) starting ProcessDomainRecovery for domain c8212f07-ef0a-4e09-b69e-3339cab7554a:SD-iSCSI-A-XIO
2013-10-02 13:31:00,988 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-5-thread-50) Domain c8212f07-ef0a-4e09-b69e-3339cab7554a:SD-iSCSI-A-XIO was reported by all hosts in status UP as problematic. Moving the domain to NonOperational.
2013-10-02 13:31:01,006 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (pool-5-thread-50) Lock Acquired to object EngineLock [exclusiveLocks= key: c8212f07-ef0a-4e09-b69e-3339cab7554a value: STORAGE
, sharedLocks= key: 1fc64d56-bead-4c29-8cee-a8cedaf581a7 value: POOL
]
2013-10-02 13:31:01,056 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (pool-5-thread-50) Running command: DeactivateStorageDomainCommand internal: true. Entities affected :  ID: c8212f07-ef0a-4e09-b69e-3339cab7554a Type: Storage
2013-10-02 13:31:01,132 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (pool-5-thread-50) Lock freed to object EngineLock [exclusiveLocks= key: c8212f07-ef0a-4e09-b69e-3339cab7554a value: STORAGE
, sharedLocks= key: 1fc64d56-bead-4c29-8cee-a8cedaf581a7 value: POOL
]

----------------------------------------------------------------------------
later on GetDeviceList starts to fail to TimeoutException:
--------------------------------------------------------------
---------------------------------------------------------------
(vdsm side:)
Thread-74566::DEBUG::2013-10-02 11:28:26,521::BindingXMLRPC::177::vds::(wrapper) client [10.35.161.29]
Thread-74566::DEBUG::2013-10-02 11:28:26,522::task::579::TaskManager.Task::(_updateState) Task=`60fbbf8c-6eb8-4988-8062-d28c2d6ca4ff`::moving from state init -> state prepar
ing
Thread-74566::INFO::2013-10-02 11:28:26,523::logUtils::44::dispatcher::(wrapper) Run and protect: getDeviceList(storageType=3, options={})
Thread-74566::DEBUG::2013-10-02 11:28:56,514::lvm::509::OperationMutex::(_invalidateAllPvs) Operation 'lvm reload operation' is holding the operation mutex, waiting...


*we'll wait now for 5 mins which obviously causes to the timeout

Thread-74566::DEBUG::2013-10-02 11:33:58,992::lvm::509::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex
--------------------------------------------------------------------------------


the 2 other domains are reported as in problem because we wait for the refresh to be executed (the thread hangs):

*vdsm attempts to refresh the domain at:
Thread-74025::DEBUG::2013-10-02 11:34:08,901::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) Refreshing domain c486bf3b-d60a-45a0-b635-835b9825865a
Thread-74024::DEBUG::2013-10-02 11:34:08,902::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) Refreshing domain 98401f2a-29a2-4016-9323-424dca956e6d
Thread-74025::DEBUG::2013-10-02 11:34:08,904::misc::809::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)
Thread-74024::DEBUG::2013-10-02 11:34:08,905::misc::809::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)

-- 10 minutes passed ---
Thread-74025::DEBUG::2013-10-02 11:44:27,698::misc::817::SamplingMethod::(__call__) Some one got in for me


when looking on vdsm side report during the wait we'll see the following:
Thread-74578::INFO::2013-10-02 11:37:27,168::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {u'98401f2a-29a2-4016-9323-424dca956e6d': {'delay': '0.000539994', 'lastCheck': '208.3', 'code': 0, 'valid': True, 'version': 3}, u'c486bf3b-d60a-45a0-b635-835b9825865a': {'delay': '0.000534723', 'lastCheck': '208.3', 'code': 0, 'valid': True, 'version': 3}, u'c8212f07-ef0a-4e09-b69e-3339cab7554a': {'delay': '0', 'lastCheck': '210.0', 'code': 2001, 'valid': False, 'version': -1}}
Thread-74578::DEBUG::2013-10-02 11:37:27,169::task::1168::TaskManager.Task::(prepare) Task=`2e2f1b4f-a619-42ce-940c-fe132a49dbb4`::finished: {u'98401f2a-29a2-4016-9323-424dca956e6d': {'delay': '0.000539994', 'lastCheck': '208.3', 'code': 0, 'valid': True, 'version': 3}, u'c486bf3b-d60a-45a0-b635-835b9825865a': {'delay': '0.000534723', 'lastCheck': '208.3', 'code': 0, 'valid': True, 'version': 3}, u'c8212f07-ef0a-4e09-b69e-3339cab7554a': {'delay': '0', 'lastCheck': '210.0', 'code': 2001, 'valid': False, 'version': 

and in the engine - the domains are considered problematic and eventually changes status the inactive as the lastCheck is greater then defined in vdc_options
---------------------------------------------
2013-10-02 13:37:24,280 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-5-thread-49) domain 98401f2a-29a2-4016-9323-424dca956e6d:SD-iSCSI-C-XIO in problem. vds: tigris01.scl.lab.tlv.redhat.com
2013-10-02 13:37:24,282 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-5-thread-49) domain c486bf3b-d60a-45a0-b635-835b9825865a:SD-iSCSI-B-XIO in problem. vds: tigris01.scl.lab.tlv.redhat.com
---------------------------------------------



after the 10 minutes - the domains reported as active and move back to active status:

2013-10-02 13:44:40,140 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-22) Storage Domain c486bf3b-d60a-45a0-b635-835b9825865a:SD-iSCSI-B-XIO was reported by Host tigris01.scl.lab.tlv.redhat.com as Active in Pool 1fc64d56-bead-4c29-8cee-a8cedaf581a7, moving to active status
2013-10-02 13:44:40,148 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-22) Storage Domain 98401f2a-29a2-4016-9323-424dca956e6d:SD-iSCSI-C-XIO was reported by Host tigris01.scl.lab.tlv.redhat.com as Active in Pool 1fc64d56-bead-4c29-8cee-a8cedaf581a7, moving to active status

but the scenario seems to repeat itself.

seems like a vdsm issue to me, Fede, can you take a look please? . can we somehow avoid that scenario?

Comment 13 Federico Simoncelli 2013-10-08 14:50:59 UTC
We have to investigate what's causing _reloadvgs to be so slow (~5 minutes execution time):

Thread-74024::DEBUG::2013-10-02 11:28:42,148::lvm::385::OperationMutex::(_reloadvgs) Got the operational mutex
Thread-74024::DEBUG::2013-10-02 11:28:42,150::lvm::308::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/3514f0c54d5c00010|/dev/mapper/3514f0c54d5c00011|/dev/mapper/3514f0c54d5c00012|/dev/mapper/3514f0c54d5c00013|/dev/mapper/3514f0c54d5c001fc|/dev/mapper/3514f0c54d5c001fd|/dev/mapper/3514f0c54d5c001fe|/dev/mapper/3514f0c54d5c001ff|/dev/mapper/3514f0c54d5c00928|/dev/mapper/3514f0c54d5c0092a|/dev/mapper/3514f0c54d5c00937|/dev/mapper/3514f0c54d5c00938|/dev/mapper/3514f0c54d5c00939|/dev/mapper/3514f0c54d5c0093a|/dev/mapper/3514f0c54d5c00b0d|/dev/mapper/3514f0c54d5c00b0e|/dev/mapper/3514f0c54d5c00b0f|/dev/mapper/3514f0c54d5c00c01|/dev/mapper/3514f0c54d5c00c02|/dev/mapper/3514f0c54d5c00c03|/dev/mapper/3514f0c54d5c00c04|/dev/mapper/3514f0c54d5c00c05|/dev/mapper/3514f0c54d5c01679|/dev/mapper/3514f0c54d5c01866|/dev/mapper/3514f0c54d5c01875|/dev/mapper/3514f0c54d5c01876|/dev/mapper/3514f0c54d5c01877|/dev/mapper/3514f0c54d5c01878|/dev/mapper/3514f0c54d5c01879|/dev/mapper/3600508b1001c19126249676fbb7165e5|/dev/mapper/360060160128230001a60db85320fe311|/dev/mapper/360060160128230001c60db85320fe311|/dev/mapper/360060160128230001e60db85320fe311|/dev/mapper/360060160128230002060db85320fe311|/dev/mapper/360060160128230002260db85320fe311|/dev/mapper/360060160128230002460db85320fe311|/dev/mapper/360060160128230002660db85320fe311|/dev/mapper/360060160128230008e76a68d5d2ae311|/dev/mapper/360060160128230009076a68d5d2ae311|/dev/mapper/360060160128230009276a68d5d2ae311|/dev/mapper/360060160128230009476a68d5d2ae311|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 98401f2a-29a2-4016-9323-424dca956e6d' (cwd None)

...

Thread-74024::DEBUG::2013-10-02 11:33:58,794::lvm::308::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  /dev/mapper/360060160128230002060db85320fe311: read failed after 0 of 4096 at 35433414656: Input/output error\n  /dev/mapper/360060160128230002060db85320fe311: read failed after 0 of 4096 at 35433472000: Input/output error\n  /dev/mapper/360060160128230002060db85320fe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160128230002060db85320fe311 was disabled\n  /dev/mapper/360060160128230008e76a68d5d2ae311: read failed after 0 of 4096 at 91267989504: Input/output error\n  /dev/mapper/360060160128230008e76a68d5d2ae311: read failed after 0 of 4096 at 91268046848: Input/output error\n  /dev/mapper/360060160128230008e76a68d5d2ae311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160128230008e76a68d5d2ae311 was disabled\n  /dev/mapper/360060160128230009276a68d5d2ae311: read failed after 0 of 4096 at 91267989504: Input/output error\n  /dev/mapper/360060160128230009276a68d5d2ae311: read failed after 0 of 4096 at 91268046848: Input/output error\n  /dev/mapper/360060160128230009276a68d5d2ae311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160128230009276a68d5d2ae311 was disabled\n  /dev/mapper/360060160128230001c60db85320fe311: read failed after 0 of 4096 at 35433414656: Input/output error\n  /dev/mapper/360060160128230001c60db85320fe311: read failed after 0 of 4096 at 35433472000: Input/output error\n  /dev/mapper/360060160128230001c60db85320fe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160128230001c60db85320fe311 was disabled\n  /dev/mapper/360060160128230002460db85320fe311: read failed after 0 of 4096 at 35433414656: Input/output error\n  /dev/mapper/360060160128230002460db85320fe311: read failed after 0 of 4096 at 35433472000: Input/output error\n  /dev/mapper/360060160128230002460db85320fe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160128230002460db85320fe311 was disabled\n'; <rc> = 0

Comment 14 Ayal Baron 2013-10-28 13:43:14 UTC
This was before short filters patch was introduced, can you try to reproduce on is20?

Comment 17 Aharon Canan 2013-11-05 16:33:26 UTC
using is21, 

following above scenario, it works, 
blocked SD is inactive
DC is still up and function.
Master SD is up

Ayal - if you think we should mark as verified instead let me know.