Red Hat Bugzilla – Bug 967604
engine: AutoRecovery of host fails and host is set as NonOperational when export domain continues to be reported with error code 358
Last modified: 2016-02-10 15:26:37 EST
Created attachment 753646 [details] logs from Autorecovery Description of problem: I blocked connectivity to all my domains from the hsm host only during a LSM. when I restored the storage export domain is still reporting error code 358 and we set host as Nonoperational. it takes about 10 minutes for the AutoRecovery to start the host Version-Release number of selected component (if applicable): sf17.1 How reproducible: not sure, from what I see it might be a cache issue on nfs domain so it does not happen each time. Steps to Reproduce: 1. in a two host cluster with iscsi storage and export domain, create and run a vm from template (as thin copy) 2. LSM the vm disk and block connectivity to all domains from the hsm host only when engine logs "SyncImageGroupDataVDSCommand" 3.when the vm pauses, destroy the vm and remove the iptables block from the hsm host Actual results: Autorecovery tries to recover the host but keeps getting StorageDomainDoesNotExist: Storage domain does not exist for the export domain. engine sets host as NonOperational even though only SPM actually needs the Export domain. since all the domains are located on the same storage server all domains are no longer blocked but the nfs domain is reported as problematic. Expected results: we should activate the host Additional info: vdsm: Thread-25::ERROR::2013-05-27 17:25:51,505::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 72ec1321-a114-451f-bee1-6790cbca1bc6 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/nfsSD.py", line 127, in findDomain return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/nfsSD.py", line 117, in findDomainPath raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'72ec1321-a114-451f-bee1-6790cbca1bc6',) Thread-24::DEBUG::2013-05-27 17:25:51,567::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct if=/dev/38755249-4bb3-4841-bf5b-05f4a521514d/metadata bs=4096 count=1' (cwd None) engine: 2013-05-27 17:26:01,100 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-48) Domain 72ec1321-a114-451f-bee1-6790cbca1bc6:New_Export was reported with error code 358 2013-05-27 17:26:01,100 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-4-thread-48) One of the Storage Domains of host cougar01 in pool iSCSI is problematic 2013-05-27 17:26:01,100 WARN [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-48) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase 2013-05-27 17:26:01,101 WARN [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-48) Unable to get value of property: vds for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase 2013-05-27 17:26:01,122 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-4) [480e4b96] Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: 4497d431-7c5e-4924-96e0-3f9cdbf826e5 Type: VDS 2013-05-27 17:26:01,125 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-4) [480e4b96] START, SetVdsStatusVDSCommand(HostName = cougar01, HostId = 4497d431-7c5e-4924-96e0-3f9cdbf826e5, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 6ea9d5a [root@cougar02 ~]# vdsClient -s 0 getStorageDomainInfo 72ec1321-a114-451f-bee1-6790cbca1bc6 uuid = 72ec1321-a114-451f-bee1-6790cbca1bc6 pool = ['7fd33b43-a9f4-4eb7-a885-e9583a929ceb'] lver = -1 version = 0 role = Regular remotePath = orion.qa.lab.tlv.redhat.com:/export/Dafna/Dafna_New_Export_0_nfs_71122241851338 spm_id = -1 type = NFS class = Backup master_ver = 0 name = New_Export
Created attachment 753647 [details] logs from failure logs from the iptables block
The vdsm logs from the problematic host are missing on the time of the issue are missing - please add it to fully see what happens on vdsm side. Regardless, when host report EXPORT/ISO domain as problematic while other hosts aren't - the host remains UP and doesn't move to NonOp, while if we will attempt to add another host that doesn't see the domain, it will move to non-operational - that behaviour might need to be unified. *Please attach the full logs to confirm what has happened here. *Allon, seems like it's infra related (host initalization/domain failover) - let me know how do we want to proceed with it.
*** Bug 1008990 has been marked as a duplicate of this bug. ***
Verified, tested on RHEVM 3.3 - IS18 environment: Host OS: RHEL 6.5 RHEVM: rhevm-3.3.0-0.25.beta1.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.15-1.el6ev.noarch VDSM: vdsm-4.13.0-0.2.beta1.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-27.el6.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.412.el6.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64
*** Bug 1030136 has been marked as a duplicate of this bug. ***
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0038.html