Bug 949454
| Summary: | [engine-backend] In a case of corruption in master LV file system, engine doesn't perform reconstruct and continues to fail in spm start action | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Elad <ebenahar> | ||||||
| Component: | ovirt-engine | Assignee: | Liron Aravot <laravot> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | Elad <ebenahar> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 3.2.0 | CC: | acanan, acathrow, amureini, ebenahar, iheim, jkt, laravot, lpeer, Rhev-m-bugs, scohen, tnisan, yeylon | ||||||
| Target Milestone: | --- | Flags: | amureini:
Triaged+
|
||||||
| Target Release: | 3.4.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | storage | ||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2014-04-29 06:41:55 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Elad
2013-04-08 08:31:48 UTC
still relevant to fix? (In reply to Itamar Heim from comment #1) > still relevant to fix? Although we're working on getting rid of the master fs entirely, the sequence of operations is too simple to ignore. I'm also not sure whether this happens for other types of problems (not related to master fs). So for now I'm keeping this open. Elad, if it's still relevent (as 949310 was closed) please include full scenario (number of hosts/domains and their statuses) and time when the error occures for full RCA. thanks. Sorry, got confused with another bug. 949310 on vdsm side isn't related to the engine in this case. fixing my comment - Elad, please the please include full scenario (number of hosts/domains and their statuses) and time when the error occures/log snippet for full RCA. Created attachment 843718 [details] logs-31.12.13 (In reply to Liron Aravot from comment #5) > Sorry, got confused with another bug. 949310 on vdsm side isn't related to > the engine in this case. > > fixing my comment - Elad, please the please include full scenario (number of > hosts/domains and their statuses) and time when the error occures/log > snippet for full RCA. Managed to reproduce: 1 host in cluster, iSCSI pool, 2 active storage domains, 1 master: - writing zeros into the master LV of the master domain: dd if=/dev/zero of=/dev/1547072e-8b25-422e-8309-1701f6141782/master bs=1M - restart vdsm service - spmStart fails with fsck error: ed91c011-6e55-4759-ad77-f584b20098c6::DEBUG::2013-12-31 11:50:13,747::blockSD::1114::Storage.Misc.excCmd::(mountMaster) FAILED: <err> = 'fsck.ext2: Bad magic number in super-block while trying to open /dev/mapper/ 1547072e--8b25--422e--8309--1701f6141782-master\n'; <rc> = 8 ed91c011-6e55-4759-ad77-f584b20098c6::ERROR::2013-12-31 11:50:13,747::sp::336::Storage.StoragePool::(startSpm) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 286, in startSpm self.masterDomain.mountMaster() File "/usr/share/vdsm/storage/blockSD.py", line 1129, in mountMaster raise se.BlockStorageDomainMasterFSCKError(masterfsdev, rc) BlockStorageDomainMasterFSCKError: BlockSD master file system FSCK error: 'masterfsdev=/dev/1547072e-8b25-422e-8309-1701f6141782/master, rc=8' Reconstruct doesn't take place Attaching the logs (logs-31.12.13) (In reply to Ayal Baron from comment #2) > Although we're working on getting rid of the master fs entirely, the > sequence of operations is too simple to ignore. I'm also not sure whether > this happens for other types of problems (not related to master fs). > So for now I'm keeping this open. This bug does not easily reproduce. The steps described in comment 0 (putting SPM to maintenance and re-activating it) are taken from the closed bug 949310. According to comment 6, the only way we were able to reproduce this is to manually corrupt the FS: > Managed to reproduce: > 1 host in cluster, iSCSI pool, 2 active storage domains, 1 master: > > - writing zeros into the master LV of the master domain: > dd if=/dev/zero of=/dev/1547072e-8b25-422e-8309-1701f6141782/master bs=1M > > - restart vdsm service > > - spmStart fails with fsck error: Note that when SPM starts, it mounts the master FS, and fsck's it. If it's corrupted beyond repair (an issue we never encountered in the field, and were only able to manually reproduce), I'm fine with handling it manually, especially in light of the changes performed in 3.5 to remove all the OVF from there (i.e., 99% of the I/O), and the future work to completely decommission it. Fixing flags changed by mistake, despite closing the bug. For bookeeping reasons, they should be correct. |