Bug 967749 - [Backend] Host is put to non-operational when metadata are corrupted [NFS only]
[Backend] Host is put to non-operational when metadata are corrupted [NFS only]
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
x86_64 Linux
unspecified Severity high
: ---
: 3.3.0
Assigned To: Liron Aravot
Aharon Canan
: Triaged
: 987874 (view as bug list)
Depends On:
  Show dependency treegraph
Reported: 2013-05-28 04:50 EDT by Jakub Libosvar
Modified: 2016-02-10 11:38 EST (History)
12 users (show)

See Also:
Fixed In Version: is24
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2014-01-21 17:15:16 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
abaron: Triaged+

Attachments (Terms of Use)
backend vdsm logs (366.08 KB, application/gzip)
2013-05-28 04:50 EDT, Jakub Libosvar
no flags Details

External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 18165 None None None Never
oVirt gerrit 21368 None None None Never

  None (edit)
Description Jakub Libosvar 2013-05-28 04:50:03 EDT
Created attachment 753799 [details]
backend vdsm logs

Description of problem:
When I have two storage domains in NFS datacenter and rewrite MASTER_VERSION in metadata file, single host in cluster is put to Non Operational state instead of Reconstruct on second storage domain. The master storage domain has disk of running VM

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Have NFS DC with two storage domains
2. Have a disk of running VM on master SD
3. Change MASTER_VERSION in metadata file of master domain

Actual results:
Host is put to non-operational

Expected results:
Second domain becomes master domain

Additional info:
Works OK on iSCSI
Logs are attached
Host cannot be activated until metadata file of master are fixed
Comment 1 Haim 2013-05-29 02:58:17 EDT
not sure its a bug. 
I think its the "correct" behavior, there is no point failing action when user wants to deactivate its domain,
deactivation process is only relevant for the master domain, so if master domain is fine, there should be no problem.
Comment 2 Liron Aravot 2013-07-08 08:09:16 EDT
it seems that upon the host activation - the host fails to connect to the storage pool because of checksum mismatch.

  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 940, in connectStoragePool
    masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 987, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 644, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1179, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1526, in getMasterDomain
    domain = sdCache.produce(msdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 98, in produce
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 127, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 142, in __init__
    sdUUID = metadata[sd.DMDK_SDUUID]
  File "/usr/share/vdsm/storage/persistentDict.py", line 89, in __getitem__
    return dec(self._dict[key])
  File "/usr/share/vdsm/storage/persistentDict.py", line 199, in __getitem__
    with self._accessWrapper():
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/usr/share/vdsm/storage/persistentDict.py", line 152, in _accessWrapper
  File "/usr/share/vdsm/storage/persistentDict.py", line 276, in refresh
MetaDataSealIsBroken: Meta Data seal is broken (checksum mismatch): 'cksum = 8ee7ee56a6c995fdce68e3acf32a7823783c402c, computed_cksum = 6982f234d1825cc81a7a7c6d8226cd741b14a769'

This error is ignored in the engine and the flow continues, as the ConnectStoragePool operation doesn't fail - the repoStats of the domains is checked on the hosts, as it doesn't monitor any domains ..it moves to non operational.

2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) Domain 57f25104-911b-4507-bebd-4d75e4489b28 is not seen by Host
2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) Domain d2bf9ac2-0c68-4884-816a-a2e4e430667c is not seen by Host
2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-4-thread-44) One of the Storage Domains of host in pool datacenter_storage_spm_negative is problematic
2013-05-27 17:16:50,657 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-44) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase
2013-05-27 17:16:50,658 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-44) Unable to get value of property: vds for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase
2013-05-27 17:16:50,696 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-8) [7cbb579f] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: c97f34aa-80af-46f8-adfa-63a156da7395 Type: VDS
2013-05-27 17:16:50,697 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-8) [7cbb579f] START, SetVdsStatusVDSCommand(HostName =, HostId = c97f34aa-80af-46f8-adfa-63a156da7395, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 3413254e

IMO - 
1. ConnectStoragePool might not be considered as succesful when there was an error - would fix the scenario in that log, other calling flows should be inspected carefully.

2. Regardless, the general fix in the engine should be that if the master domain is unknown/inactive - the domain monitoring data of the host shouldn't be checked during initvdsonup flow - would fix this scenario and other scenarios, reduces the need for #1 for this flow atleast.
Comment 3 Ayal Baron 2013-07-10 05:45:39 EDT
Acking fixing #1 in comment 2, wrt the 'general fix' we need to discuss that to understand implications in other flows
Comment 5 Ayal Baron 2013-08-05 07:20:47 EDT
*** Bug 987874 has been marked as a duplicate of this bug. ***
Comment 6 Liron Aravot 2013-10-07 04:49:16 EDT
Moving to modified as it should be solved by #2 from https://bugzilla.redhat.com/show_bug.cgi?id=967749#c2
Implementing #1 should be done regardless.
Comment 9 Aharon Canan 2013-12-15 09:09:46 EST
Following my discussion with Allon and Liron - 

changing manually the metadata file cause chksum problem, 
they way to reproduce/verify below - 

1. create NFS DC with 1 host and 2 SDs
2. change host to maintenance
3. block the master SD (by IPtables, remove NFS share ...)
4. change host state to up

host became active and reconstruct master takes place

verified using 3.3 is27
Comment 10 Itamar Heim 2014-01-21 17:15:16 EST
Closing - RHEV 3.3 Released
Comment 11 Itamar Heim 2014-01-21 17:22:35 EST
Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.