Bug 967749
Summary: | [Backend] Host is put to non-operational when metadata are corrupted [NFS only] | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Jakub Libosvar <jlibosva> | ||||
Component: | ovirt-engine | Assignee: | Liron Aravot <laravot> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Aharon Canan <acanan> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.2.0 | CC: | abaron, acanan, acathrow, amureini, eedri, iheim, jkt, laravot, lpeer, Rhev-m-bugs, scohen, yeylon | ||||
Target Milestone: | --- | Keywords: | Triaged | ||||
Target Release: | 3.3.0 | Flags: | abaron:
Triaged+
|
||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | storage | ||||||
Fixed In Version: | is24 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2014-01-21 22:15:16 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
not sure its a bug. I think its the "correct" behavior, there is no point failing action when user wants to deactivate its domain, deactivation process is only relevant for the master domain, so if master domain is fine, there should be no problem. it seems that upon the host activation - the host fails to connect to the storage pool because of checksum mismatch. File "/usr/share/vdsm/logUtils.py", line 41, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 940, in connectStoragePool masterVersion, options) File "/usr/share/vdsm/storage/hsm.py", line 987, in _connectStoragePool res = pool.connect(hostID, scsiKey, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 644, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1179, in __rebuild self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1526, in getMasterDomain domain = sdCache.produce(msdUUID) File "/usr/share/vdsm/storage/sdc.py", line 98, in produce domain.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/nfsSD.py", line 127, in findDomain return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/fileSD.py", line 142, in __init__ sdUUID = metadata[sd.DMDK_SDUUID] File "/usr/share/vdsm/storage/persistentDict.py", line 89, in __getitem__ return dec(self._dict[key]) File "/usr/share/vdsm/storage/persistentDict.py", line 199, in __getitem__ with self._accessWrapper(): File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ return self.gen.next() File "/usr/share/vdsm/storage/persistentDict.py", line 152, in _accessWrapper self.refresh() File "/usr/share/vdsm/storage/persistentDict.py", line 276, in refresh computedChecksum) MetaDataSealIsBroken: Meta Data seal is broken (checksum mismatch): 'cksum = 8ee7ee56a6c995fdce68e3acf32a7823783c402c, computed_cksum = 6982f234d1825cc81a7a7c6d8226cd741b14a769' This error is ignored in the engine and the flow continues, as the ConnectStoragePool operation doesn't fail - the repoStats of the domains is checked on the hosts, as it doesn't monitor any domains ..it moves to non operational. 2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) Domain 57f25104-911b-4507-bebd-4d75e4489b28 is not seen by Host 2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-4-thread-44) Domain d2bf9ac2-0c68-4884-816a-a2e4e430667c is not seen by Host 2013-05-27 17:16:50,657 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-4-thread-44) One of the Storage Domains of host 10.34.63.135 in pool datacenter_storage_spm_negative is problematic 2013-05-27 17:16:50,657 WARN [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-44) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase 2013-05-27 17:16:50,658 WARN [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-44) Unable to get value of property: vds for class org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase 2013-05-27 17:16:50,696 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-8) [7cbb579f] Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: c97f34aa-80af-46f8-adfa-63a156da7395 Type: VDS 2013-05-27 17:16:50,697 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-8) [7cbb579f] START, SetVdsStatusVDSCommand(HostName = 10.34.63.135, HostId = c97f34aa-80af-46f8-adfa-63a156da7395, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 3413254e IMO - 1. ConnectStoragePool might not be considered as succesful when there was an error - would fix the scenario in that log, other calling flows should be inspected carefully. 2. Regardless, the general fix in the engine should be that if the master domain is unknown/inactive - the domain monitoring data of the host shouldn't be checked during initvdsonup flow - would fix this scenario and other scenarios, reduces the need for #1 for this flow atleast. Acking fixing #1 in comment 2, wrt the 'general fix' we need to discuss that to understand implications in other flows *** Bug 987874 has been marked as a duplicate of this bug. *** Moving to modified as it should be solved by #2 from https://bugzilla.redhat.com/show_bug.cgi?id=967749#c2 Implementing #1 should be done regardless. Following my discussion with Allon and Liron - changing manually the metadata file cause chksum problem, they way to reproduce/verify below - 1. create NFS DC with 1 host and 2 SDs 2. change host to maintenance 3. block the master SD (by IPtables, remove NFS share ...) 4. change host state to up host became active and reconstruct master takes place verified using 3.3 is27 Closing - RHEV 3.3 Released Closing - RHEV 3.3 Released |
Created attachment 753799 [details] backend vdsm logs Description of problem: When I have two storage domains in NFS datacenter and rewrite MASTER_VERSION in metadata file, single host in cluster is put to Non Operational state instead of Reconstruct on second storage domain. The master storage domain has disk of running VM Version-Release number of selected component (if applicable): rhevm-backend-3.2.0-11.28.el6ev.noarch vdsm-4.10.2-21.0.el6ev.x86_64 How reproducible: Always Steps to Reproduce: 1. Have NFS DC with two storage domains 2. Have a disk of running VM on master SD 3. Change MASTER_VERSION in metadata file of master domain Actual results: Host is put to non-operational Expected results: Second domain becomes master domain Additional info: Works OK on iSCSI Logs are attached Host cannot be activated until metadata file of master are fixed