| Summary: | LVM RAID: dev failure during first sync of upconvert can loose data | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jonathan Earl Brassow <jbrassow> |
| Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> |
| lvm2 sub component: | Mirroring and RAID | QA Contact: | cluster-qe <cluster-qe> |
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac |
| Version: | 7.2 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-06-14 14:01:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Jonathan Earl Brassow
2016-11-17 19:21:22 UTC
The MD raid1 personality behaves that way in case of multi-legged mirrors
(i.e. selecting a new primary when the current one dies).
The behavioural change you're requesting ("...to continue syncing from the original LV.") implies to first update any dirty regions on the recurring initial primary leg before restarting the previously interrupted resynchronization from where it got discontinued or any updates would be lost.
That wouldn't work though, because regions aren't mapped 1:1 to io payload sizes and offsets and in turn will typically not fully written over, thus replacing parts of regions on the recurred primary leg with uninitialized data causing data corruption.
To compensate that fact, we'd need a finer grained write intend bitmap (i.e. tiny region size) to make sure the whole region gets updated in this situation which imposes overhead and scalability issues on large linear LVs being up converted because of bitmap size limits.
It is important to note, that an initially synchronizing up converted linear -> raid1 LV isn't any better with respect to resilience right after the conversion than the previous linear one. Initial sync in this conversion just causes the resilience ratio to grow to 100% over time. We may only be able to work around the fact of a transiently failing primary leg (the previous linear LV containing user data) with a solution along the lines of comment #1. this bug will be deferred to 7.5, but needs a release note. Fixed in RHEL7.4 Fixed by: ddb14b6 lvconvert: Disallow removal of primary when up-converting (recovering) 4c0e908 RAID (lvconvert/dmeventd): Cleanly handle primary failure during 'recover' op d34d206 lvconvert: Don't require a 'force' option during RAID repair. c87907d lvconvert: linear -> raid1 upconvert should cause "recover" not "resync" |