Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 967604

Summary: engine: AutoRecovery of host fails and host is set as NonOperational when export domain continues to be reported with error code 358
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED ERRATA QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: abaron, acanan, acathrow, akotov, amureini, bazulay, dgibson, iheim, jkt, laravot, lpeer, lyarwood, Rhev-m-bugs, scohen, tnisan, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 3.3.0Flags: amureini: Triaged+
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: is18 Doc Type: Bug Fix
Doc Text:
When the host reported the ISO or Export domain as problematic during the InitVdsOnUp flow, it did not move to status Up. When ISO or Export domains are reported as problematic by some of the hosts those hosts remain Up and do not move to NonOperational. The behaviour between these two flows have been unified, so when hosts report ISO or Export domain as a problem, it does not stop the host from moving to Up.
Story Points: ---
Clone Of:
: 1031634 (view as bug list) Environment:
Last Closed: 2014-01-21 17:24:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1022352, 1031634    
Attachments:
Description Flags
logs from Autorecovery
none
logs from failure none

Description Dafna Ron 2013-05-27 14:45:41 UTC
Created attachment 753646 [details]
logs from Autorecovery

Description of problem:

I blocked connectivity to all my domains from the hsm host only during a LSM. 
when I restored the storage export domain is still reporting error code 358 and we set host as Nonoperational. 

it takes about 10 minutes for the AutoRecovery to start the host

Version-Release number of selected component (if applicable):

sf17.1

How reproducible:

not sure, from what I see it might be a cache issue on nfs domain so it does not happen each time. 

Steps to Reproduce:
1. in a two host cluster with iscsi storage and export domain, create and run a vm from template (as thin copy)
2. LSM the vm disk and block connectivity to all domains from the hsm host only when engine logs "SyncImageGroupDataVDSCommand" 
3.when the vm pauses, destroy the vm and remove the iptables block from the hsm host

Actual results:

Autorecovery tries to recover the host but keeps getting StorageDomainDoesNotExist: Storage domain does not exist for the export domain. 
engine sets host as NonOperational even though only SPM actually needs the Export domain. 
since all the domains are located on the same storage server all domains are no longer blocked but the nfs domain is reported as problematic. 

Expected results:

we should activate the host

Additional info:


vdsm:

Thread-25::ERROR::2013-05-27 17:25:51,505::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 72ec1321-a114-451f-bee1-6790cbca1bc6 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 127, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/nfsSD.py", line 117, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'72ec1321-a114-451f-bee1-6790cbca1bc6',)
Thread-24::DEBUG::2013-05-27 17:25:51,567::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct if=/dev/38755249-4bb3-4841-bf5b-05f4a521514d/metadata bs=4096 count=1' (cwd None)


engine: 

2013-05-27 17:26:01,100 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-48) Domain 72ec1321-a114-451f-bee1-6790cbca1bc6:New_Export was reported with error code 358
2013-05-27 17:26:01,100 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-4-thread-48) One of the Storage Domains of host cougar01 in pool iSCSI is problematic
2013-05-27 17:26:01,100 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-48) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase
2013-05-27 17:26:01,101 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-48) Unable to get value of property: vds for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase
2013-05-27 17:26:01,122 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-4) [480e4b96] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: 4497d431-7c5e-4924-96e0-3f9cdbf826e5 Type: VDS
2013-05-27 17:26:01,125 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-4) [480e4b96] START, SetVdsStatusVDSCommand(HostName = cougar01, HostId = 4497d431-7c5e-4924-96e0-3f9cdbf826e5, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 6ea9d5a


[root@cougar02 ~]# vdsClient -s 0 getStorageDomainInfo 72ec1321-a114-451f-bee1-6790cbca1bc6
	uuid = 72ec1321-a114-451f-bee1-6790cbca1bc6
	pool = ['7fd33b43-a9f4-4eb7-a885-e9583a929ceb']
	lver = -1
	version = 0
	role = Regular
	remotePath = orion.qa.lab.tlv.redhat.com:/export/Dafna/Dafna_New_Export_0_nfs_71122241851338
	spm_id = -1
	type = NFS
	class = Backup
	master_ver = 0
	name = New_Export

Comment 1 Dafna Ron 2013-05-27 14:46:54 UTC
Created attachment 753647 [details]
logs from failure

logs from the iptables block

Comment 2 Liron Aravot 2013-07-08 14:19:52 UTC
The vdsm logs from the problematic host are missing on the time of the issue are missing - please add it to fully see what happens on vdsm side.

Regardless, when host report EXPORT/ISO domain as problematic while other hosts aren't - the host remains UP and doesn't move to NonOp, while  if we will attempt to add another host that doesn't see the domain, it will move to non-operational - that behaviour might need to be unified.

*Please attach the full logs to confirm what has happened here.
*Allon, seems like it's infra related (host initalization/domain failover) - let me know how do we want to proceed with it.

Comment 7 Ayal Baron 2013-09-23 07:49:25 UTC
*** Bug 1008990 has been marked as a duplicate of this bug. ***

Comment 8 vvyazmin@redhat.com 2013-10-14 15:07:43 UTC
Verified, tested on RHEVM 3.3 - IS18 environment:

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.25.beta1.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.15-1.el6ev.noarch
VDSM:  vdsm-4.13.0-0.2.beta1.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-27.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.412.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

Comment 9 David Gibson 2013-11-14 03:19:43 UTC
*** Bug 1030136 has been marked as a duplicate of this bug. ***

Comment 12 Charlie 2013-11-28 00:17:33 UTC
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 13 errata-xmlrpc 2014-01-21 17:24:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html