Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 967604 - engine: AutoRecovery of host fails and host is set as NonOperational when export domain continues to be reported with error code 358
engine: AutoRecovery of host fails and host is set as NonOperational when exp...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.2.0
x86_64 Linux
unspecified Severity high
: ---
: 3.3.0
Assigned To: Liron Aravot
Aharon Canan
storage
: ZStream
: 1008990 1030136 (view as bug list)
Depends On:
Blocks: 1022352 1031634
  Show dependency treegraph
 
Reported: 2013-05-27 10:45 EDT by Dafna Ron
Modified: 2016-02-10 15:26 EST (History)
16 users (show)

See Also:
Fixed In Version: is18
Doc Type: Bug Fix
Doc Text:
When the host reported the ISO or Export domain as problematic during the InitVdsOnUp flow, it did not move to status Up. When ISO or Export domains are reported as problematic by some of the hosts those hosts remain Up and do not move to NonOperational. The behaviour between these two flows have been unified, so when hosts report ISO or Export domain as a problem, it does not stop the host from moving to Up.
Story Points: ---
Clone Of:
: 1031634 (view as bug list)
Environment:
Last Closed: 2014-01-21 12:24:01 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
amureini: Triaged+


Attachments (Terms of Use)
logs from Autorecovery (1.64 MB, application/x-gzip)
2013-05-27 10:45 EDT, Dafna Ron
no flags Details
logs from failure (2.00 MB, application/x-gzip)
2013-05-27 10:46 EDT, Dafna Ron
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 17986 None None None Never
oVirt gerrit 19527 None None None Never
Red Hat Product Errata RHSA-2014:0038 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.3.0 update 2014-01-21 17:03:06 EST

  None (edit)
Description Dafna Ron 2013-05-27 10:45:41 EDT
Created attachment 753646 [details]
logs from Autorecovery

Description of problem:

I blocked connectivity to all my domains from the hsm host only during a LSM. 
when I restored the storage export domain is still reporting error code 358 and we set host as Nonoperational. 

it takes about 10 minutes for the AutoRecovery to start the host

Version-Release number of selected component (if applicable):

sf17.1

How reproducible:

not sure, from what I see it might be a cache issue on nfs domain so it does not happen each time. 

Steps to Reproduce:
1. in a two host cluster with iscsi storage and export domain, create and run a vm from template (as thin copy)
2. LSM the vm disk and block connectivity to all domains from the hsm host only when engine logs "SyncImageGroupDataVDSCommand" 
3.when the vm pauses, destroy the vm and remove the iptables block from the hsm host

Actual results:

Autorecovery tries to recover the host but keeps getting StorageDomainDoesNotExist: Storage domain does not exist for the export domain. 
engine sets host as NonOperational even though only SPM actually needs the Export domain. 
since all the domains are located on the same storage server all domains are no longer blocked but the nfs domain is reported as problematic. 

Expected results:

we should activate the host

Additional info:


vdsm:

Thread-25::ERROR::2013-05-27 17:25:51,505::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 72ec1321-a114-451f-bee1-6790cbca1bc6 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 127, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/nfsSD.py", line 117, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'72ec1321-a114-451f-bee1-6790cbca1bc6',)
Thread-24::DEBUG::2013-05-27 17:25:51,567::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct if=/dev/38755249-4bb3-4841-bf5b-05f4a521514d/metadata bs=4096 count=1' (cwd None)


engine: 

2013-05-27 17:26:01,100 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-48) Domain 72ec1321-a114-451f-bee1-6790cbca1bc6:New_Export was reported with error code 358
2013-05-27 17:26:01,100 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-4-thread-48) One of the Storage Domains of host cougar01 in pool iSCSI is problematic
2013-05-27 17:26:01,100 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-48) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase
2013-05-27 17:26:01,101 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-48) Unable to get value of property: vds for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase
2013-05-27 17:26:01,122 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-4) [480e4b96] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: 4497d431-7c5e-4924-96e0-3f9cdbf826e5 Type: VDS
2013-05-27 17:26:01,125 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-4) [480e4b96] START, SetVdsStatusVDSCommand(HostName = cougar01, HostId = 4497d431-7c5e-4924-96e0-3f9cdbf826e5, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 6ea9d5a


[root@cougar02 ~]# vdsClient -s 0 getStorageDomainInfo 72ec1321-a114-451f-bee1-6790cbca1bc6
	uuid = 72ec1321-a114-451f-bee1-6790cbca1bc6
	pool = ['7fd33b43-a9f4-4eb7-a885-e9583a929ceb']
	lver = -1
	version = 0
	role = Regular
	remotePath = orion.qa.lab.tlv.redhat.com:/export/Dafna/Dafna_New_Export_0_nfs_71122241851338
	spm_id = -1
	type = NFS
	class = Backup
	master_ver = 0
	name = New_Export
Comment 1 Dafna Ron 2013-05-27 10:46:54 EDT
Created attachment 753647 [details]
logs from failure

logs from the iptables block
Comment 2 Liron Aravot 2013-07-08 10:19:52 EDT
The vdsm logs from the problematic host are missing on the time of the issue are missing - please add it to fully see what happens on vdsm side.

Regardless, when host report EXPORT/ISO domain as problematic while other hosts aren't - the host remains UP and doesn't move to NonOp, while  if we will attempt to add another host that doesn't see the domain, it will move to non-operational - that behaviour might need to be unified.

*Please attach the full logs to confirm what has happened here.
*Allon, seems like it's infra related (host initalization/domain failover) - let me know how do we want to proceed with it.
Comment 7 Ayal Baron 2013-09-23 03:49:25 EDT
*** Bug 1008990 has been marked as a duplicate of this bug. ***
Comment 8 vvyazmin@redhat.com 2013-10-14 11:07:43 EDT
Verified, tested on RHEVM 3.3 - IS18 environment:

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.25.beta1.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.15-1.el6ev.noarch
VDSM:  vdsm-4.13.0-0.2.beta1.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-27.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.412.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64
Comment 9 David Gibson 2013-11-13 22:19:43 EST
*** Bug 1030136 has been marked as a duplicate of this bug. ***
Comment 12 Charlie 2013-11-27 19:17:33 EST
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.
Comment 13 errata-xmlrpc 2014-01-21 12:24:01 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html

Note You need to log in before you can comment on or make changes to this bug.