Bug 1360372

Summary: [RFE] LVM RAID: Allow metadata for each PV for very large RAID LVs
Product: [Community] LVM and device-mapper Reporter: Jonathan Earl Brassow <jbrassow>
Component: lvm2Assignee: Heinz Mauelshagen <heinzm>
lvm2 sub component: Mirroring and RAID QA Contact: cluster-qe <cluster-qe>
Status: NEW --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: agk, heinzm, jbrassow, mcsontos, msnitzer, pasik, prajnoha, zkabelac
Version: unspecifiedKeywords: FutureFeature
Target Milestone: ---Flags: rule-engine: lvm-technical-solution?
rule-engine: lvm-test-coverage?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jonathan Earl Brassow 2016-07-26 14:24:17 UTC
Currently, there is only one metadata LV for each rimage sub-LV.  Very large RAID LVs may have multiple PVs per rimage.  This means that a single failed device could take out a whole rimage - perhaps several disks.  It would be better if only the one PV had to be removed and resynced.

Comment 1 Marian Csontos 2016-07-27 16:25:00 UTC
Looks like multisegment LV where each segment is a RAID volume.

Other considerations:

Should scrubbing only one "segment" work?
Use case: a disk holding PV is returning SMART errors.

Should repairing only specified "segment" work?
Use case: repairing LV holding VM image - I do not care much about /boot and /, but care about /home.

Should down/up converting only specified segment work?

Comment 2 Heinz Mauelshagen 2016-08-03 15:06:18 UTC
Yes, resynchronizing one area on a particular PV should do better,
because the proposal to have MetaLVs per PV a spread DataLV is allocated on would mean we had to mirror them in a more complex dm stack, because the MD kernel runtime requires _one_ metadata device per data device. Having to keep them in-sync would end in a 2-level raid consistency problem which I'd rather avoid.

We now only have the "rebuild #IDX" dm-raid constructor flag, which
resynchronizes the whole indexed DataLV, so that'd need to be enhanced to resynchronize part of the DataLV.

For instance could be done via "rebuild_seg #IDX #StartSector #EndSector" replacing the rebuild flag with index for that indexed DataLV, thus just doing that sector range mapped onto the replaced PV.
Obviously there's more complex multi-segment configurations complicating this further.

Comment 3 Heinz Mauelshagen 2016-08-11 22:54:29 UTC
Got patch for the kernel part implementing "rebuild_seg #IDX #StartSector #EndSector" (~100 LOC).  This flag is mutually exclusive with "rebuild #IDX" and can only be deployed once due to runtime restrictions in the MD kernel
resynchronization code, which only manages an MD array global resynchronization maximum sector (i.e. just one global resynchronization end sector defining the synchronization end on each of its data devices).  Rebuilding segments in parallel on multiple devices would require that to be a per device rather than a per array property.

Mind the MD restriction also implies to change table N (N > 1) times if N segments of a DataLV mapped to N failed PVs have to be repaired with each table switch moving the segment to rebuild in case such multi-segment repair is driven from userspace (prefered).  The other option would involve more dm-raid kernel enhancements includig a segment map addition to the superblock holding information about the N segments to rebuild thus allowing multiple rebuild_flag tupples to be passed into the dm-raid constructor (solution would have scalability issues).