Description of problem: When raid repair cannot reallocate new 'leg' it should still (probably in switched/better order') first drop 'D'ead leg so 'D' leg should be repaired to i.e. linear. This works for old mirror when 'mirror' can be repaired to 'linear' in case there are not resources for new replacement leg. With raid however lvm2 gets stopped on allocation error before trying to fix anything in dm table leaving 'raid' LV unrepaired (D in status) Version-Release number of selected component (if applicable): 2.02.172 How reproducible: Steps to Reproduce: 1. create 2leg 'raid1' on 2PV 2. kill 1PV 3. lvconvert --repair Actual results: Expected results: Additional info:
It actually means - user cannot recover in case he has no 'extra' disk as we will 'reject' operations like 'lvconvert -m0' when some PV is missing - only --repair has the 'priviledge' to work with incomplete VG.,
Also the 'activation/raid_fault_policy = "remove"' doesn't seem to work ATM as some allocation is tried first.
This 2 PV issues can be tackled by "vgreduce --removemissing --force $VG ; lvconvert -m0 $LV". User can add a new, second PV later and upconvert again.
We cannot users advice to use 'vgreduce --force' to repair rather a standard scenario where 2 legged raid1 lost one device. The --force may seriously harm other unrelated LVs to the raid LV wanted for repair. Also it leaves 2-legged raid1 LV in metadata with 'error' segments so invalid raid device and tables are not fixed as well. Following 'lvonvert -m0' in this particular case however cleans the table in some way. But this workaround solution is purely limitted to 2legged raid1 LV and doesn't apply when i.e. raid1 has more then 1 lost leg. 'lvconvert --repair' is the way to focus one here.
(In reply to Zdenek Kabelac from comment #4) > We cannot users advice to use 'vgreduce --force' to repair rather a > standard scenario where 2 legged raid1 lost one device. > > The --force may seriously harm other unrelated LVs to the raid LV wanted for > repair. > What important use cases are you talking about? For instance, sny linear LV with parts of it's segments on the missing PV has to be tried resuing before thre removal. > Also it leaves 2-legged raid1 LV in metadata with 'error' segments so > invalid raid device and tables are not fixed as well. Deactivation/activation works fine, no? Repair as is, presuming enough PVs are available, works, no? What's the problem with the error mappings to replace missing PVs here when they are changed by repair? > > Following 'lvonvert -m0' in this particular case however cleans the table in > some way. But this workaround solution is purely limitted to 2legged raid1 > LV and doesn't apply when i.e. raid1 has more then 1 lost leg. Yes, a direct "lvconvert -m0 Raid1LV" does not work. Reducing legs one by one does as a workaround, hence a fix in this regard is necessary. > > 'lvconvert --repair' is the way to focus one here. User set up a raid1 LV for the very purpose of gaining resilience. Any repair of degraded RaidLVs has to be about ensuring that resilience again. We should not call a downconvert of a degraded RaidLV a repair, because it does not aim at regaining resilience the user requested!
Allowing 'lvconvert -m N Raid1LV' to remove raid1 images on degraded VGs is a 2 line code change. Only this one would cause data loss on e.g. DDAA with 'lvconvert -m1 Raid1LV' leaving the 2 failed legs behind -> complete data loss. To enable removal to cope with variations of failed legs properly selecting intact ones (e.g. in a 4 leg example: DDAA, DADA or DAAD), those failed legs have to be removed shifting intact ones before 'lvconvert -mN $Raid1LV' (N > 0); this is an addtional 4 line patch. Comments on each add a few lines.
The trouble is - the only command that is eligible to work with VG with missing PV is 'lvconvert --repair' - it's the rule of lvm2. There are couple buggy piece of code breaking this rule (i.e. lvconvert --uncache is one such example), but this all needs to be fixed. So please do not commit any more hacks before analyzing the issue in depth. I'm very sure there doesn't exist any 2 or 3 line hack for this issue. Enabling 'handling_missing_pv' is unfortunately not well handled and the last thing we want here is to 'extend' this rule breaking even more.
(In reply to Zdenek Kabelac from comment #7) > > Enabling 'handling_missing_pv' is unfortunately not well handled and the > last thing we want here is to 'extend' this rule breaking even more. Let's redesign and fix it then. Doesn't better the point that a raid1 downconvert to linear is not a repair of the resilient RaidLV. I don't want a downconvert to be called repair.
(In reply to Heinz Mauelshagen from comment #8) > (In reply to Zdenek Kabelac from comment #7) > > > > Enabling 'handling_missing_pv' is unfortunately not well handled and the > > last thing we want here is to 'extend' this rule breaking even more. > > Let's redesign and fix it then. Well current design is reasonably good in fact on most levels - I don't think it needs redesigning in terms of lvm2. If you set new constrain rules it would likely had to be some other volume-manger - just no lvm2 - there is no point to break lvm2. > > Doesn't better the point that a raid1 downconvert to linear is not a repair > of the resilient RaidLV. I don't want a downconvert to be called repair. I still think you are misinterpreting something here: lvm2 user DEFINES policy - it will not be 'standalone' decision. When policy tells the raid1 can lose resilience in case the 'resilient' device is lost - it's perfectly valid repair to such raid1 to linear. And works this well for mirror. It's also reflecting the real-world. There is no raid1 when there are no 2 devices - you can't call this raid1 setup - it's already been degraded by raid target with Dead device. There is simply no resilience if user does not have new disk. So pretending user has resilience while in fact it's not is problem. It's really working as a linear device just have 'raid1' target between the real 'linear' device with data. Really don't try to look here for any magical wording. Of course we can discuss numerous policies - i.e. we can have leg devices getting online/offline - for such device it would NOT be best to throw them away when they are invisible for a moment - so user for this device will not set REMOVAL. But all this comes as the 'next' step. The main thing here is - lvm2 has to be able to fix on demand (with given Policy or Option) with single command 'lvconvert --repair' i.e. raid1 to linear without any service interruption - and any further thinking which way of downconversion needs to happen. We cannot use other calls as this needs to be fully automated for dmeventd usage. The reason is very simple - If you have 'stacked' raid device being a thin-pool _tdata device - we need to have this device running and OK. It's better (given we will provide REMOVE policy) to lose resilience if the user does not have new space and have fully operational and RESIZABLE thin-pool - then having things stuck. The only other option when not automated fix is in lvm2 here is to disallow usage of RAID in stack - no thin-pool, no caching, no snapshot and revert them back to old mirror for stability issue. If the user has luxury of having SPARE device - he will not have a problem - new space can be allocated - also we have to ensure old space is NOT reused in case of 2nd. corruption. There is also option/possibility to 're-implement' device raid tracking - so unlimited number of tracked images would need to be supported - together with complete support of any raid operation with tracked images. This way you can 'keep' virtual 'raid' target type and operate with device just as if tracked images are missing. This workflow can work - but it's order-of-magnitude more work....
(In reply to Zdenek Kabelac from comment #9) <SNIP> > > I still think you are misinterpreting something here: > > lvm2 user DEFINES policy - it will not be 'standalone' decision. > > When policy tells the raid1 can lose resilience in case the 'resilient' > device is lost - it's perfectly valid repair to such raid1 to linear. <SNIP> We seem to be at cross-purposes. I'm defining this from the use case point of view not policies or internals which you are. The latter have to be derived from requested use cases. The 2 use cases to distinguish are: - repair degraded raid1 to regain resilience - downgrade for any reason The downgrade use case should not be disguised as a repair. BTW: a degraded raid1 set with > 2 legs may still be resilient, hence the user may request to just drop the dead legs and keep > 1 intact legs.
(In reply to Heinz Mauelshagen from comment #10) > (In reply to Zdenek Kabelac from comment #9) > <SNIP> > > > > I still think you are misinterpreting something here: > > > > lvm2 user DEFINES policy - it will not be 'standalone' decision. > > > > When policy tells the raid1 can lose resilience in case the 'resilient' > > device is lost - it's perfectly valid repair to such raid1 to linear. > <SNIP> > > We seem to be at cross-purposes. > > I'm defining this from the use case point of view not > policies or internals which you are. The latter have to > be derived from requested use cases. > I'm still unsure why do you want to force users to analyze theirs device stacks and figure them out how to 'fix' given every individual different state of dm-table?? It's the work of lvm2 to fix device state in a way, user has configure it (set policy for it). lvm2 is highlevel tool abstraction solving problems here. Those who can live with 'dmsetup' surely can figure things on their own. There are many different cases - I hope you do not want me to open BZ for every such case like: 2leg Raid1 - lost PV1, PV2 works 2leg Raid1 - PV1 works, PV2 lost 2leg Raid1 - error sector PV1, PV2 works 2leg Raid1 - PV1 works, PV2 error sector 2leg Raid1 - error sector PV1 and PV1 is used by other LVs, PV2 works 2leg Raid1 - PV1 works, PV2 error sector and is used by other LVs 2leg Raid1 - lost PV1, PV2 works used in stack as _tdata 3leg Raid1 - lost PV1, PV2 works, PV3 works.... I could easily list hundreds of scenarios We are addressing generic solution not a 1 particular sub-case here, where you prioritize 'virtually named state' in lvm2 metadata over the 'real-word' state of device. > The 2 use cases to distinguish are: > - repair degraded raid1 to regain resilience > - downgrade for any reason lvconvert --repair is the tool used by dmeventd to fix found problem, or by user to resolve ugly looking error state reported by 'lvs'. After 'lvconvert --repair' operations like 'lvchange --refresh' shall pass! This tool fixes the device - otherwise we leave user with a device unusable for other operations like 'lvresize' being used as thin-pool. POLICY is how user configures how to solve problem in todays lvm2. If user doesn't want to automatically remove dead leg - and wants to retry i.e. 10 times to 'reload' target if it will not start to work before it will be removed - that's simply another policy. I'm trying to explain here we cannot 'extend' EVERY lvm2 command with power of repairing - i.e. you wand to resize thin-pool and you need to fix raid and so on. There is one defined 'lvconvert --repair' that FIXES dm table & lvm2 metadata to present consistent state. > The downgrade use case should not be disguised as a repair. Downgrade IS valid repair - it's option user can select and it's the way lvm2 works in thoughtful way - it's core design issue how we 'transition' from 1-table state to another state - please remember the discussion about PARTIAL_LV and its misusage... User can always opt-out and leave device invalid and deal with related troubles - but there has to be a defined automatic way how to use without ANY interruption 'raid1' - that's primary purpose - to have uninterrupted usage of device if some of drive has problems. Please note - I'm !NOT! talking about fact that 'raid1' kernel dm device continues to work with 'D' devices as long as there 'A' and disk is readable & writable - everyone knows it - What doesn't work is any further manipulation with such LV (and VG in case of PV was lost). lvconvert --repair fixes it and makes it further usable for other lvm2 commands. If you do still fail to see it we simply need conf call with more people here - since I'm getting out example to describe how important and serious this issue is. And if it's not addressed as result any stacked usage of raid volumes will have to be disabled. > > BTW: > a degraded raid1 set with > 2 legs may still be resilient, > hence the user may request to just drop the dead legs and keep > 1 > intact legs. Not sure how it relates to anything posted above. 3leg raid - loses 1 leg and remains resilient and thus there is no issue with losing any resilience when lost leg is automatically dropped to fix LV state. So please list the reasoning why 'raid1' is so special target to drop existing rules and polices in lvm2 and force every user to start completely learning new way of using it ?? So we can have them available for conf call and weight them with my reasoning.
(In reply to Zdenek Kabelac from comment #11) > (In reply to Heinz Mauelshagen from comment #10) > > (In reply to Zdenek Kabelac from comment #9) > > <SNIP> > > > > > > I still think you are misinterpreting something here: > > > > > > lvm2 user DEFINES policy - it will not be 'standalone' decision. > > > > > > When policy tells the raid1 can lose resilience in case the 'resilient' > > > device is lost - it's perfectly valid repair to such raid1 to linear. > > <SNIP> > > > > We seem to be at cross-purposes. > > > > I'm defining this from the use case point of view not > > policies or internals which you are. The latter have to > > be derived from requested use cases. > > > > I'm still unsure why do you want to force users to analyze theirs device > stacks and figure them out how to 'fix' given every individual different > state of dm-table?? <SNIP> You're confusing use cases with internals still, the latter being derivatives from the former. I'm arguing, that raid1 repair cases must not include a downconvert, because a user request to repair a resilient raid1 has to end in a resilient raid1. If that is impossible for any number of reasons, it shall not be hidden from the user by means of an automated repair. The user may still be forced to address this appropriately which _may_ include downgrading the LV. That's not an option for your thin example though where he requested resilience on hidden LVs and has to make a concious decision prioritizing on keeping resilience. Thus, downconverting Raid1LV is hardly an automated scenario, BTW: any variations of N-legged raid1 configurations don't add anything to this principle that a(n) (automated) repair shall result in a resilient Raid1LV and that loosing resilience has to be a concious decision
List of open questions to be completed we need to answer are (my take): Automated use cases (dmeventd): 1 if Raid1LV leg(s) is/are dead, automatically set the failing PV(s) causing this to be bad persistently (new PV state) which can be retrieved on reactivation (yes) 2 if any Raid1LV legs fail, automatically try replacing them to regain full configured resilience (yes) 3 if automatic repair of Raid1LV impossible but at least 2 operational legs remain, remove the ones which can't be automatically replaced; this can be a mixture of replacing some legs and removing others (yes) 4 if automatic repair of Raid1LV with one operational leg impossible, downconvert (no) Manual use cases: 1+2: same as automatic 3+4: manually possible to remove any failed or operational legs; enforce removing failed legs first (yes)
While not resolving any of the question from comment 13 - Typical user-case scenario for 'raid' usage is to protect service against given number of resilient devices failure. (i.e. 2leg raid1 protects service again 1 failing device) (IMHO no user will be surprised by the fact, that when one of 2 devices in raid1 is lost - resilience is lost - this is given by definition of raid - so I'm not sure why this is presented is some 'major problem' for a user??? - is there any pointer where user would be seeing this a problem? However as long as number of failed device are below 'resilience quorum' user expects all things will works in unlimited way (i.e. not just read/write to device, but also ALL the operation with LV volume) - with assumption he is fully informed about problem i.e. resilience is LOST. What I can still get here is - when user configured 'resilience' and resilience was lost - we refuse to do any more work - this IMHO degrades usage of current raids - since user will be stuck with problems in very same if he hadn't had any resilience at all (possibly even bigger set of problems). --- As mentioned in 'comment 9' the solution to keep 'raidX' even if there is no real raidX is to heavily extend rimage tracking - but as also said in mentioned comment - this is huuuuuge amount of work - and ATM I'm not convinced there is significant number of users ever taking care of this to basically justificate spent resources on development of this complex extension. We have 2 categories - enterprise users typically do have number of 'spare' devices to be used for replacement. 2nd. category uses low number of disks and just want to prevent single disk disaster - those users expect that losing 1 disk will NOT stop lvm2 for continual usage of raid devices. However lvm2 now with raids is in this area worse then with mirrors where this workflow path worked in simpler way - devices were fixed and were fully functional - unlike with raids where user is 'stuck' with inconsistent dm tables. ---- So what needs to be solved here is - User must always have !simple! and defined way how to solve problem. Nowhere during the process of regular workflow any command uses --force (lose/error/unavailability of 'device' in raid array which loses resilience is 'standard/expected/assumed' case solved by raid usage - when user uses --force he might lose data!. Using --force is acceptable only when 'data' are supposed to be lost. If raid is supposed to be used in stack (_tdata...) dmeventd need to know how to resolve expected problems automatically (otherwise stacked usage of existing raid solution needs to be disabled as unsupportable (without spare devices)) --- More technical limits on lvm2 sides: 1. Only limited number of commands have the privilege to solve complicated errors - i.e. lvresize is not a tool to resolve and repair missing devices by current design policy of lvm2 (i.e. lvresize cannot influence other LVs where it's only relation is sharing PV) 2. lvm2 metadata reflects reality - tables are activated by state of written metadata (PARTIAL_LV got misused over the time and will need to be again put into old bounds where it should be purely used for explicit activation for i.e. data-recovery). 3. lvm2 table manipulation knows 2 states (committed + pre-commited) on 'suspend' but only 1 state on resume (committed) - this constrain is important for correctness, but also makes it challenging on creation of correct sequences of table updates. 4. lvm2 is not outsourcing recovery steps to 3rd. party tools - so lvm2 is not a low-level simplistic tool (aka dmsetup) to provide low-level bricks to do basic recovery - i.e. we do not want tools like 'Cocpit' writing their 'recovery' sequences.... Also as the last part - how are other tools solving this ?
(In reply to Zdenek Kabelac from comment #14) > While not resolving any of the question from comment 13 - > > Typical user-case scenario for 'raid' usage is to protect service against > given number of resilient devices failure. (i.e. 2leg raid1 protects > service again 1 failing device) (IMHO no user will be surprised by the fact, > that when one of 2 devices in raid1 is lost - resilience is lost - this is > given by definition of raid Nope, he won't, he'll want to repair regaining reilience ASAP. > - so I'm not sure why this is presented is some > 'major problem' for a user??? - is there any pointer where user would be > seeing this a problem? Not following thatkind of question. Again: raid1 is configured to have reislient storage. This is what it is about. Ir it truns degraded for some reason, regain resilience ASAP is required. > > However as long as number of failed device are below 'resilience quorum' > user expects all things will works in unlimited way (i.e. not just > read/write to device, but also ALL the operation with LV volume) - with > assumption he is fully informed about problem i.e. resilience is LOST. > > What I can still get here is - when user configured 'resilience' and > resilience was lost - we refuse to do any more work - this IMHO degrades > usage of current raids - since user will be stuck with problems in very > same if he hadn't had any resilience at all (possibly even bigger set of > problems). I already addressed in comment 13, that a degraded Raid1LV which lost all resilience needs analysis to come to terms if it's suffering from transient issues which can be solved hence using MD's capabilities to recover from it and regain full resilience -or- if it's permanent and HW needs to be replaced to regain full resilience. This is admin analysis figuring out which HW components have failed transiently or permanently. In no case an automatic downgrade is an option, because if the former, transient error case results from the analysis the recoverable leg(s) would be gone. > > --- > > As mentioned in 'comment 9' the solution to keep 'raidX' even if there is > no real raidX is to heavily extend rimage tracking - but as also said in > mentioned comment - this is huuuuuge amount of work - and ATM I'm not > convinced there is significant number of users ever taking care of this to > basically justificate spent resources on development of this complex > extension. There's a reason why MD has a write intent log to cope with transient failures or interrupted uptdates on legs. No need to reinvent anything for it on the lvm2 side. > > We have 2 categories - enterprise users typically do have number of 'spare' > devices to be used for replacement. > > 2nd. category uses low number of disks and just want to prevent single disk > disaster - those users expect that losing 1 disk will NOT stop lvm2 for > continual usage of raid devices. It won't unless any raid type looses quorum (raid1 with no intact legs or raid0 with any number of legs lost or the rest of the raid types with > parity devices lost). I.e. a Raid1LV will work with a minimum of one accessible leg and any kind of user will still want to check for transient, repairable errors. > > However lvm2 now with raids is in this area worse then with mirrors where > this workflow path worked in simpler way - devices were fixed and were fully > functional - unlike with raids where user is 'stuck' with inconsistent dm > tables. > > ---- > > So what needs to be solved here is - > > User must always have !simple! and defined way how to solve problem. > > Nowhere during the process of regular workflow any command uses --force > (lose/error/unavailability of 'device' in raid array which loses resilience > is 'standard/expected/assumed' case solved by raid usage - when user uses > --force he might lose data!. Using --force is acceptable only when 'data' > are supposed to be lost. > > If raid is supposed to be used in stack (_tdata...) dmeventd need to know > how to resolve expected problems automatically (otherwise stacked usage of > existing raid solution needs to be disabled as unsupportable (without spare > devices)) It's a valid request to allow dmeventd to automate as many sensefull transitions as possible but you did not address handling of transient failures. dmeventd can't do that because it can't retrieve information about transient failure reasons only the admin can analyse. Thanks for the following overview. Noone requested lvresize to solve errors so we should limit this to lvconvert whewre it makes sense. Just leave downconversion alone, it's not a repair operation to automate, because it'll prevent any recovery of transient errors causing a raid leg to be set faulty which can be recovered from using MD's resynchronization capabilities. > > --- > > More technical limits on lvm2 sides: > > 1. Only limited number of commands have the privilege to solve complicated > errors - i.e. lvresize is not a tool to resolve and repair missing devices > by current design policy of lvm2 (i.e. lvresize cannot influence other LVs > where it's only relation is sharing PV) > > 2. lvm2 metadata reflects reality - tables are activated by state of written > metadata (PARTIAL_LV got misused over the time and will need to be again put > into old bounds where it should be purely used for explicit activation for > i.e. data-recovery). > > 3. lvm2 table manipulation knows 2 states (committed + pre-commited) on > 'suspend' but only 1 state on resume (committed) - this constrain is > important for correctness, but also makes it challenging on creation of > correct sequences of table updates. > > 4. lvm2 is not outsourcing recovery steps to 3rd. party tools - so lvm2 is > not a low-level simplistic tool (aka dmsetup) to provide low-level bricks > to do basic recovery - i.e. we do not want tools like 'Cocpit' writing their > 'recovery' sequences.... > > > Also as the last part - how are other tools solving this ?
(In reply to Heinz Mauelshagen from comment #15) > (In reply to Zdenek Kabelac from comment #14) > > While not resolving any of the question from comment 13 - > > > > Typical user-case scenario for 'raid' usage is to protect service against > > given number of resilient devices failure. (i.e. 2leg raid1 protects > > service again 1 failing device) (IMHO no user will be surprised by the fact, > > that when one of 2 devices in raid1 is lost - resilience is lost - this is > > given by definition of raid > > Nope, he won't, he'll want to repair regaining reilience ASAP. It is all about the choice made by USER. User defines reaction upon error. 1. allocate 2. warn 3. remove 4. retry_X_times_before_going_for_1_or_2_or_3 5. ... User/Admin has made thoughtful decision - and he may have selected to lose resilience when drive reported error - there is no further analysis needed - drive should be lost even if you find it as a 'bad choice' (but you could bring in some convincing better policies with some stats proving it's better - once they are implemented and pass testing) There is lot of users which do never suffer from transient failures which is mostly domain of some specify type of storage hw i.e. network connected storage. Obviously supporting 'REMOVE' policy is not meant to solve problem for 'WARN' you are focused on which needs some further care - but by adding 'REMOVE' we quickly resolve a basic usability problem for a lot of users. By directly solving only 'WARN' however we add a big delay for those users which do not need this complexity (or simply needs usable stack ASAP as high priority). The main reason I'm advocating here for quick & clear 'REMOVE' policy is it's simplicity of implementation - while 'WARN' will take easily many months (as we need to resolve incorrect tree loading as well) - 'REMOVE' is a weeks work. > > > - so I'm not sure why this is presented is some > > 'major problem' for a user??? - is there any pointer where user would be > > seeing this a problem? > > Not following thatkind of question. > Again: raid1 is configured to have reislient storage. This is what it is > about. Ir it truns degraded for some reason, regain resilience ASAP is > required. Users I'm talking with do expect 'D'ead-leg to be dropped and they will 'possibly' restore resilience once they buy a new drive - which might be delayed by weeks... If users would have 'resources' to maintain spare drive - they would already be using this with 'more comfortable' allocate policy. For unexperienced users (large portion of distro user-base) any complexity associated with even understanding all different sorts of failures are beyond the level they want to be even involved - simple hdd replacement driven by proper warning messages is likely the complexity to target here in future. > > However as long as number of failed device are below 'resilience quorum' > > user expects all things will works in unlimited way (i.e. not just > > read/write to device, but also ALL the operation with LV volume) - with > > assumption he is fully informed about problem i.e. resilience is LOST. > > > > What I can still get here is - when user configured 'resilience' and > > resilience was lost - we refuse to do any more work - this IMHO degrades > > usage of current raids - since user will be stuck with problems in very > > same if he hadn't had any resilience at all (possibly even bigger set of > > problems). > > I already addressed in comment 13, that a degraded Raid1LV which lost all > resilience needs analysis to come to terms if it's suffering from transient Who will do this analysis ? Do you expect every lvm2 users is hdd expert ? > issues which can be solved hence using MD's capabilities to recover from it > and regain full resilience -or- if it's permanent and HW needs to be > replaced to regain full resilience. > > This is admin analysis figuring out which HW components have failed > transiently or permanently. In no case an automatic downgrade is an option, > because if the former, transient error case results from the analysis the > recoverable leg(s) would be gone. IMHO you are focusing here too much on expert-only 'transient' failures - which is rather a very specific type of failure and usually with network attached storage where device appear & disappear. However with SATA attached storage where HDD already exhausted 'repair' sectors - any write error is usually 'final' and renders drive as unreliable - so using "REMOVE" policy in such setup is usually good choice. And I'm not here enforcing this "REMOVE" policy to be the only choice (and not even to be default) - it's JUST an option for a user. 'REMOVE' proposal here is in absolutely no way taking away any recovery option from expert user here. > > As mentioned in 'comment 9' the solution to keep 'raidX' even if there is > > no real raidX is to heavily extend rimage tracking - but as also said in > > mentioned comment - this is huuuuuge amount of work - and ATM I'm not > > convinced there is significant number of users ever taking care of this to > > basically justificate spent resources on development of this complex > > extension. > > There's a reason why MD has a write intent log to cope with transient > failures or interrupted uptdates on legs. No need to reinvent anything for > it on the lvm2 side. It' not related to any MD thingy (aka write-intent bitmap) here at all (and I'm repeating this mostly in every comment in this BZ) This whole BZ is about LVM2 design rules where all supported targets needs to be complaint with. The core design issue is - activation code shall NOT ask lvm2 to activate device with missing PV - this will get even prohibited by more hardening features and will be only allowed for 'pure' activation (possibly read-only) and deactivation (i.e. user explicitly passes lvchange --partial) So all we are solving here is - how the metadata for raid will look like when you have to store 'metadata' which will be creating valid raid devices - so in every committed metadata the usage of targets is explicitly defined for every segment (aka no 'spurious' -missing segments) So what I'm expecting here as an outcome of this BZ is defined set of metadata sequences. Each committed lvm2 metadata present valid and self-consistent LV which can be activated and will present consistent transition state. Nowhere in this process are steps like: 'vgreduce --removemissing --force' and similar invalid proposals. 1. dmeventd notices error (leg 'D') 2. runs 'lvconvert --repair --usepolicies' - this command makes minimum suspend/resume steps after which such raid LV continues to be usable for any other LV command (as long as 'raidLV' contains full set of data set) 3. user continues to use such LV like if it would be 'normal' - so command like 'lvchange --refresh' or 'lvresize works without any further complexity. As long as these steps work - such raid segtype can be used in stacking. If we cannot deliver workable solution with current raids - we will need to disable existing 'raid' segtype and provide a new segment type that is able to deliver this functionality for stacked device with better design. I'm note here proposing any exact type of solution - I'm expecting here a new design for review. "REMOVE" policy should be seen as very simplified 1st. step which gains lot with min effort - fixes the unfixable state of LV when there is no free space. Knowing details of internal complexity of activation of the whole device stack - the solution for 'WARN' workflow is much harder. If you are targeting a different use-case scenario - where devices in stacked as stand-alone separately maintained devices (I'd call it 'lvsetup' as next-step of dmsetup) - there you can put different constrains - but then we are not talking about lvm2 anymore....