Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 949454

Summary: [engine-backend] In a case of corruption in master LV file system, engine doesn't perform reconstruct and continues to fail in spm start action
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED WONTFIX QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acanan, acathrow, amureini, ebenahar, iheim, jkt, laravot, lpeer, Rhev-m-bugs, scohen, tnisan, yeylon
Target Milestone: ---Flags: amureini: Triaged+
Target Release: 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-29 06:41:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
logs-31.12.13 none

Description Elad 2013-04-08 08:31:48 UTC
Created attachment 732568 [details]
logs

Description of problem:

Reconstruction is not performed when master LV file system is corrupted. This happens after spmStart action. relates to BZ# 949310   

Version-Release number of selected component (if applicable):

RHEVM - rhevm-backend-3.2.0-10.18.beta2.el6ev.noarch
VDSM - vdsm-4.10.2-14.0.el6ev.x86_64
Libvirt - libvirt-0.10.2-18.el6_4.2.x86_64
Qemu-KVM - qemu-kvm-rhev-0.12.1.2-2.348.el6.x86_64
Sanlock - sanlock-2.6-2.el6.x86_64

How reproducible:

50%

Steps to Reproduce: 
1. Maintenance to SPM and activate it right after.

  
Actual results:
Engine identify corruption in master LV file system using fsck and does not perform reconstruction.

Expected results:
Engine should perform reconstruction when it identify corruption in master LV file system.


Additional info: see logs attached

Comment 1 Itamar Heim 2013-12-01 19:59:22 UTC
still relevant to fix?

Comment 2 Ayal Baron 2013-12-02 07:40:16 UTC
(In reply to Itamar Heim from comment #1)
> still relevant to fix?

Although we're working on getting rid of the master fs entirely, the sequence of operations is too simple to ignore.  I'm also not sure whether this happens for other types of problems (not related to master fs).
So for now I'm keeping this open.

Comment 4 Liron Aravot 2013-12-31 06:42:53 UTC
Elad, if it's still relevent (as 949310 was closed) please include full scenario (number of hosts/domains and their statuses) and time when the error occures for full RCA.

thanks.

Comment 5 Liron Aravot 2013-12-31 07:02:27 UTC
Sorry, got confused with another bug. 949310 on vdsm side isn't related to the engine in this case.

fixing my comment - Elad, please the please include full scenario (number of hosts/domains and their statuses) and time when the error occures/log snippet for full RCA.

Comment 6 Elad 2013-12-31 10:08:21 UTC
Created attachment 843718 [details]
logs-31.12.13

(In reply to Liron Aravot from comment #5)
> Sorry, got confused with another bug. 949310 on vdsm side isn't related to
> the engine in this case.
> 
> fixing my comment - Elad, please the please include full scenario (number of
> hosts/domains and their statuses) and time when the error occures/log
> snippet for full RCA.

Managed to reproduce:
1 host in cluster, iSCSI pool, 2 active storage domains, 1 master:

- writing zeros into the master LV of the master domain:
dd if=/dev/zero of=/dev/1547072e-8b25-422e-8309-1701f6141782/master bs=1M

- restart vdsm service

- spmStart fails with fsck error:

ed91c011-6e55-4759-ad77-f584b20098c6::DEBUG::2013-12-31 11:50:13,747::blockSD::1114::Storage.Misc.excCmd::(mountMaster) FAILED: <err> = 'fsck.ext2: Bad magic number in super-block while trying to open /dev/mapper/
1547072e--8b25--422e--8309--1701f6141782-master\n'; <rc> = 8
ed91c011-6e55-4759-ad77-f584b20098c6::ERROR::2013-12-31 11:50:13,747::sp::336::Storage.StoragePool::(startSpm) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 286, in startSpm
    self.masterDomain.mountMaster()
  File "/usr/share/vdsm/storage/blockSD.py", line 1129, in mountMaster
    raise se.BlockStorageDomainMasterFSCKError(masterfsdev, rc)
BlockStorageDomainMasterFSCKError: BlockSD master file system FSCK error: 'masterfsdev=/dev/1547072e-8b25-422e-8309-1701f6141782/master, rc=8'


Reconstruct doesn't take place

Attaching the logs (logs-31.12.13)

Comment 7 Allon Mureinik 2014-04-29 06:41:55 UTC
(In reply to Ayal Baron from comment #2)
> Although we're working on getting rid of the master fs entirely, the
> sequence of operations is too simple to ignore.  I'm also not sure whether
> this happens for other types of problems (not related to master fs).
> So for now I'm keeping this open.

This bug does not easily reproduce. The steps described in comment 0 (putting SPM to maintenance and re-activating it) are taken from the closed bug 949310. According to comment 6, the only way we were able to reproduce this is to manually corrupt the FS:

> Managed to reproduce:
> 1 host in cluster, iSCSI pool, 2 active storage domains, 1 master:
> 
> - writing zeros into the master LV of the master domain:
> dd if=/dev/zero of=/dev/1547072e-8b25-422e-8309-1701f6141782/master bs=1M
> 
> - restart vdsm service
> 
> - spmStart fails with fsck error:

Note that when SPM starts, it mounts the master FS, and fsck's it. If it's corrupted beyond repair (an issue we never encountered in the field, and were only able to manually reproduce), I'm fine with handling it manually, especially in light of the changes performed in 3.5 to remove all the OVF from there (i.e., 99% of the I/O), and the future work to completely decommission it.

Comment 8 Allon Mureinik 2014-04-29 06:42:53 UTC
Fixing flags changed by mistake, despite closing the bug. For bookeeping reasons, they should be correct.