967749 – [Backend] Host is put to non-operational when metadata are corrupted [NFS only]

Bug 967749 - [Backend] Host is put to non-operational when metadata are corrupted [NFS only]

Summary: [Backend] Host is put to non-operational when metadata are corrupted [NFS only]

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Liron Aravot
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Duplicates (1):	987874 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-28 08:50 UTC by Jakub Libosvar
Modified:	2016-02-10 16:38 UTC (History)
CC List:	12 users (show)
Fixed In Version:	is24
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-01-21 22:15:16 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	abaron: Triaged+

Attachments	(Terms of Use)
backend vdsm logs (366.08 KB, application/gzip) 2013-05-28 08:50 UTC, Jakub Libosvar	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	18165	0	None	None	None	Never
oVirt gerrit	21368	0	None	None	None	Never

Description Jakub Libosvar 2013-05-28 08:50:03 UTC

Created attachment 753799 [details]
backend vdsm logs

Description of problem:
When I have two storage domains in NFS datacenter and rewrite MASTER_VERSION in metadata file, single host in cluster is put to Non Operational state instead of Reconstruct on second storage domain. The master storage domain has disk of running VM

Version-Release number of selected component (if applicable):
rhevm-backend-3.2.0-11.28.el6ev.noarch
vdsm-4.10.2-21.0.el6ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Have NFS DC with two storage domains
2. Have a disk of running VM on master SD
3. Change MASTER_VERSION in metadata file of master domain

Actual results:
Host is put to non-operational

Expected results:
Second domain becomes master domain

Additional info:
Works OK on iSCSI
Logs are attached
Host cannot be activated until metadata file of master are fixed

Comment 1 Haim 2013-05-29 06:58:17 UTC

not sure its a bug. 
I think its the "correct" behavior, there is no point failing action when user wants to deactivate its domain,
deactivation process is only relevant for the master domain, so if master domain is fine, there should be no problem.

Comment 2 Liron Aravot 2013-07-08 12:09:16 UTC

it seems that upon the host activation - the host fails to connect to the storage pool because of checksum mismatch.

  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 940, in connectStoragePool
    masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 987, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 644, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1179, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1526, in getMasterDomain
    domain = sdCache.produce(msdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 98, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 127, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 142, in __init__
    sdUUID = metadata[sd.DMDK_SDUUID]
  File "/usr/share/vdsm/storage/persistentDict.py", line 89, in __getitem__
    return dec(self._dict[key])
  File "/usr/share/vdsm/storage/persistentDict.py", line 199, in __getitem__
    with self._accessWrapper():
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/usr/share/vdsm/storage/persistentDict.py", line 152, in _accessWrapper
    self.refresh()
  File "/usr/share/vdsm/storage/persistentDict.py", line 276, in refresh
    computedChecksum)
MetaDataSealIsBroken: Meta Data seal is broken (checksum mismatch): 'cksum = 8ee7ee56a6c995fdce68e3acf32a7823783c402c, computed_cksum = 6982f234d1825cc81a7a7c6d8226cd741b14a769'

This error is ignored in the engine and the flow continues, as the ConnectStoragePool operation doesn't fail - the repoStats of the domains is checked on the hosts, as it doesn't monitor any domains ..it moves to non operational.

2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) Domain 57f25104-911b-4507-bebd-4d75e4489b28 is not seen by Host
2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) Domain d2bf9ac2-0c68-4884-816a-a2e4e430667c is not seen by Host
2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-4-thread-44) One of the Storage Domains of host 10.34.63.135 in pool datacenter_storage_spm_negative is problematic
2013-05-27 17:16:50,657 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-44) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase
2013-05-27 17:16:50,658 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-44) Unable to get value of property: vds for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase
2013-05-27 17:16:50,696 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-8) [7cbb579f] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: c97f34aa-80af-46f8-adfa-63a156da7395 Type: VDS
2013-05-27 17:16:50,697 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-8) [7cbb579f] START, SetVdsStatusVDSCommand(HostName = 10.34.63.135, HostId = c97f34aa-80af-46f8-adfa-63a156da7395, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 3413254e


IMO - 
1. ConnectStoragePool might not be considered as succesful when there was an error - would fix the scenario in that log, other calling flows should be inspected carefully.

2. Regardless, the general fix in the engine should be that if the master domain is unknown/inactive - the domain monitoring data of the host shouldn't be checked during initvdsonup flow - would fix this scenario and other scenarios, reduces the need for #1 for this flow atleast.

Comment 3 Ayal Baron 2013-07-10 09:45:39 UTC

Acking fixing #1 in comment 2, wrt the 'general fix' we need to discuss that to understand implications in other flows

Comment 5 Ayal Baron 2013-08-05 11:20:47 UTC

*** Bug 987874 has been marked as a duplicate of this bug. ***

Comment 6 Liron Aravot 2013-10-07 08:49:16 UTC

Moving to modified as it should be solved by #2 from https://bugzilla.redhat.com/show_bug.cgi?id=967749#c2
 
Implementing #1 should be done regardless.

Comment 9 Aharon Canan 2013-12-15 14:09:46 UTC

Following my discussion with Allon and Liron - 

changing manually the metadata file cause chksum problem, 
they way to reproduce/verify below - 

1. create NFS DC with 1 host and 2 SDs
2. change host to maintenance
3. block the master SD (by IPtables, remove NFS share ...)
4. change host state to up

host became active and reconstruct master takes place

verified using 3.3 is27

Comment 10 Itamar Heim 2014-01-21 22:15:16 UTC

Closing - RHEV 3.3 Released

Comment 11 Itamar Heim 2014-01-21 22:22:35 UTC

Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.