Bug 967749

Summary: [Backend] Host is put to non-operational when metadata are corrupted [NFS only]
Product: Red Hat Enterprise Virtualization Manager Reporter: Jakub Libosvar <jlibosva>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: abaron, acanan, acathrow, amureini, eedri, iheim, jkt, laravot, lpeer, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.3.0Flags: abaron: Triaged+
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: is24 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 22:15:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
backend vdsm logs none

Description Jakub Libosvar 2013-05-28 08:50:03 UTC
Created attachment 753799 [details]
backend vdsm logs

Description of problem:
When I have two storage domains in NFS datacenter and rewrite MASTER_VERSION in metadata file, single host in cluster is put to Non Operational state instead of Reconstruct on second storage domain. The master storage domain has disk of running VM

Version-Release number of selected component (if applicable):
rhevm-backend-3.2.0-11.28.el6ev.noarch
vdsm-4.10.2-21.0.el6ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Have NFS DC with two storage domains
2. Have a disk of running VM on master SD
3. Change MASTER_VERSION in metadata file of master domain

Actual results:
Host is put to non-operational

Expected results:
Second domain becomes master domain

Additional info:
Works OK on iSCSI
Logs are attached
Host cannot be activated until metadata file of master are fixed

Comment 1 Haim 2013-05-29 06:58:17 UTC
not sure its a bug. 
I think its the "correct" behavior, there is no point failing action when user wants to deactivate its domain,
deactivation process is only relevant for the master domain, so if master domain is fine, there should be no problem.

Comment 2 Liron Aravot 2013-07-08 12:09:16 UTC
it seems that upon the host activation - the host fails to connect to the storage pool because of checksum mismatch.

  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 940, in connectStoragePool
    masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 987, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 644, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1179, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1526, in getMasterDomain
    domain = sdCache.produce(msdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 98, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 127, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 142, in __init__
    sdUUID = metadata[sd.DMDK_SDUUID]
  File "/usr/share/vdsm/storage/persistentDict.py", line 89, in __getitem__
    return dec(self._dict[key])
  File "/usr/share/vdsm/storage/persistentDict.py", line 199, in __getitem__
    with self._accessWrapper():
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/usr/share/vdsm/storage/persistentDict.py", line 152, in _accessWrapper
    self.refresh()
  File "/usr/share/vdsm/storage/persistentDict.py", line 276, in refresh
    computedChecksum)
MetaDataSealIsBroken: Meta Data seal is broken (checksum mismatch): 'cksum = 8ee7ee56a6c995fdce68e3acf32a7823783c402c, computed_cksum = 6982f234d1825cc81a7a7c6d8226cd741b14a769'

This error is ignored in the engine and the flow continues, as the ConnectStoragePool operation doesn't fail - the repoStats of the domains is checked on the hosts, as it doesn't monitor any domains ..it moves to non operational.

2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) Domain 57f25104-911b-4507-bebd-4d75e4489b28 is not seen by Host
2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) Domain d2bf9ac2-0c68-4884-816a-a2e4e430667c is not seen by Host
2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-4-thread-44) One of the Storage Domains of host 10.34.63.135 in pool datacenter_storage_spm_negative is problematic
2013-05-27 17:16:50,657 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-44) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase
2013-05-27 17:16:50,658 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-44) Unable to get value of property: vds for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase
2013-05-27 17:16:50,696 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-8) [7cbb579f] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: c97f34aa-80af-46f8-adfa-63a156da7395 Type: VDS
2013-05-27 17:16:50,697 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-8) [7cbb579f] START, SetVdsStatusVDSCommand(HostName = 10.34.63.135, HostId = c97f34aa-80af-46f8-adfa-63a156da7395, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 3413254e


IMO - 
1. ConnectStoragePool might not be considered as succesful when there was an error - would fix the scenario in that log, other calling flows should be inspected carefully.

2. Regardless, the general fix in the engine should be that if the master domain is unknown/inactive - the domain monitoring data of the host shouldn't be checked during initvdsonup flow - would fix this scenario and other scenarios, reduces the need for #1 for this flow atleast.

Comment 3 Ayal Baron 2013-07-10 09:45:39 UTC
Acking fixing #1 in comment 2, wrt the 'general fix' we need to discuss that to understand implications in other flows

Comment 5 Ayal Baron 2013-08-05 11:20:47 UTC
*** Bug 987874 has been marked as a duplicate of this bug. ***

Comment 6 Liron Aravot 2013-10-07 08:49:16 UTC
Moving to modified as it should be solved by #2 from https://bugzilla.redhat.com/show_bug.cgi?id=967749#c2
 
Implementing #1 should be done regardless.

Comment 9 Aharon Canan 2013-12-15 14:09:46 UTC
Following my discussion with Allon and Liron - 

changing manually the metadata file cause chksum problem, 
they way to reproduce/verify below - 

1. create NFS DC with 1 host and 2 SDs
2. change host to maintenance
3. block the master SD (by IPtables, remove NFS share ...)
4. change host state to up

host became active and reconstruct master takes place

verified using 3.3 is27

Comment 10 Itamar Heim 2014-01-21 22:15:16 UTC
Closing - RHEV 3.3 Released

Comment 11 Itamar Heim 2014-01-21 22:22:35 UTC
Closing - RHEV 3.3 Released