Bug 1465571 - raid lvconvert --repair is not handling case without extra space
raid lvconvert --repair is not handling case without extra space
Status: NEW
Product: LVM and device-mapper
Classification: Community
Component: lvm2 (Show other bugs)
2.02.172
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: LVM and device-mapper development team
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-27 12:19 EDT by Zdenek Kabelac
Modified: 2017-07-04 15:51 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: lvm‑technical‑solution?
rule-engine: lvm‑test‑coverage?


Attachments (Terms of Use)

  None (edit)
Description Zdenek Kabelac 2017-06-27 12:19:28 EDT
Description of problem:

When raid repair  cannot reallocate new 'leg'  it should still (probably in switched/better order') first drop  'D'ead leg so 'D' leg should be repaired to i.e. linear.


This works for old mirror when 'mirror' can be repaired  to 'linear' in case there  are not resources for new replacement leg.

With raid however lvm2 gets stopped on allocation error before trying to fix anything in dm table leaving 'raid' LV unrepaired  (D  in status)


Version-Release number of selected component (if applicable):
2.02.172

How reproducible:


Steps to Reproduce:
1. create 2leg  'raid1'  on 2PV 
2. kill 1PV
3. lvconvert --repair

Actual results:


Expected results:


Additional info:
Comment 1 Zdenek Kabelac 2017-06-27 12:31:35 EDT
It actually means - user  cannot  recover  in case  he has no 'extra' disk as we will 'reject'    operations like  'lvconvert -m0'  when some  PV is missing - only --repair  has the 'priviledge' to work with incomplete VG.,
Comment 2 Zdenek Kabelac 2017-06-27 12:41:18 EDT
Also the 'activation/raid_fault_policy = "remove"'  doesn't seem to work ATM as some allocation is tried first.
Comment 3 Heinz Mauelshagen 2017-06-28 11:45:52 EDT
This 2 PV issues can be tackled by
"vgreduce --removemissing --force $VG ; lvconvert -m0 $LV".

User can add a new, second PV later and upconvert again.
Comment 4 Zdenek Kabelac 2017-06-28 12:16:15 EDT
We cannot users advice  to use 'vgreduce --force'  to repair rather a standard scenario where 2 legged raid1 lost one device.

The --force may seriously harm other unrelated LVs to the raid LV wanted for repair. 

Also it leaves 2-legged raid1 LV in metadata with 'error' segments so invalid raid device and tables are not fixed as well.

Following 'lvonvert -m0' in this particular case however cleans the table in some way. But this workaround solution is purely limitted to 2legged raid1 LV and doesn't apply when i.e. raid1 has more then 1 lost leg.

'lvconvert --repair' is the way to focus one here.
Comment 5 Heinz Mauelshagen 2017-06-29 07:23:59 EDT
(In reply to Zdenek Kabelac from comment #4)
> We cannot users advice  to use 'vgreduce --force'  to repair rather a
> standard scenario where 2 legged raid1 lost one device.
> 
> The --force may seriously harm other unrelated LVs to the raid LV wanted for
> repair. 
> 

What important use cases are you talking about?
For instance, sny linear LV with parts of it's segments on the missing PV has to be tried resuing before thre removal.

> Also it leaves 2-legged raid1 LV in metadata with 'error' segments so
> invalid raid device and tables are not fixed as well.

Deactivation/activation works fine, no?
Repair as is, presuming enough PVs are available, works, no?

What's the problem with the error mappings to replace missing
PVs here when they are changed by repair?

> 
> Following 'lvonvert -m0' in this particular case however cleans the table in
> some way. But this workaround solution is purely limitted to 2legged raid1
> LV and doesn't apply when i.e. raid1 has more then 1 lost leg.

Yes, a direct "lvconvert -m0 Raid1LV" does not work.
Reducing legs one by one does as a workaround,
hence a fix in this regard is necessary.

> 
> 'lvconvert --repair' is the way to focus one here.

User set up a raid1 LV for the very purpose of gaining resilience.
Any repair of degraded RaidLVs has to be about ensuring that resilience again.

We should not call a downconvert of a degraded RaidLV a repair,
because it does not aim at regaining resilience the user requested!
Comment 6 Heinz Mauelshagen 2017-06-30 09:04:55 EDT
Allowing 'lvconvert -m N Raid1LV' to remove raid1 images on degraded VGs is a 2 line code change. Only this one would cause data loss on e.g. DDAA with
'lvconvert -m1 Raid1LV' leaving the 2 failed legs behind -> complete data loss.

To enable removal to cope with variations of failed legs properly selecting intact ones (e.g. in a 4 leg example: DDAA, DADA or DAAD), those failed legs have to be removed shifting intact ones before 'lvconvert -mN $Raid1LV' (N > 0); this is an addtional 4 line patch.

Comments on each add a few lines.
Comment 7 Zdenek Kabelac 2017-06-30 09:21:33 EDT
The trouble is - the only command that is eligible to work with VG with missing PV is 'lvconvert --repair'  - it's the rule of lvm2.

There are couple buggy piece of code  breaking this rule (i.e. lvconvert --uncache  is one such example), but this all needs to be fixed.

So please do not commit any more hacks before analyzing the issue in depth.

I'm very sure there doesn't exist any 2 or 3 line hack for this issue.

Enabling  'handling_missing_pv'  is unfortunately  not well handled and the last thing we want here is to 'extend' this rule breaking even more.
Comment 8 Heinz Mauelshagen 2017-06-30 09:40:46 EDT
(In reply to Zdenek Kabelac from comment #7)
> 
> Enabling  'handling_missing_pv'  is unfortunately  not well handled and the
> last thing we want here is to 'extend' this rule breaking even more.

Let's redesign and fix it then.

Doesn't better the point that a raid1 downconvert to linear is not a repair of the resilient RaidLV. I don't want a downconvert to be called repair.
Comment 9 Zdenek Kabelac 2017-06-30 10:44:44 EDT
(In reply to Heinz Mauelshagen from comment #8)
> (In reply to Zdenek Kabelac from comment #7)
> > 
> > Enabling  'handling_missing_pv'  is unfortunately  not well handled and the
> > last thing we want here is to 'extend' this rule breaking even more.
> 
> Let's redesign and fix it then.

Well current design is reasonably good in fact on most levels - I don't think it needs redesigning in terms of lvm2.

If you set new constrain rules it would likely had to be some other volume-manger - just no lvm2 - there is no point to break lvm2.

> 
> Doesn't better the point that a raid1 downconvert to linear is not a repair
> of the resilient RaidLV. I don't want a downconvert to be called repair.

I still think you are misinterpreting something here:

lvm2 user DEFINES policy - it will not be 'standalone' decision.

When policy tells the  raid1 can lose resilience in case the 'resilient' device is lost -  it's perfectly valid  repair  to such raid1 to  linear.
And works this well for mirror.

It's also reflecting the real-world.

There is no raid1 when there are no 2 devices - you can't call this raid1 setup - it's already been  degraded by raid target with Dead device.

There is simply no resilience if user does not have new disk.
So pretending user has resilience while in fact it's not is problem.
It's really working as a linear device just have 'raid1' target between the real 'linear' device with data.

Really don't try to look here for any magical wording.

Of course we can discuss numerous policies - i.e.  we can have leg devices getting online/offline  -  for such device  it would NOT be best to throw them away when they are invisible for a moment - so user for this device will not set REMOVAL.

But all this comes as the 'next' step.

The main thing here is -  lvm2 has to be able to fix on demand  (with given Policy or Option)  with single command   'lvconvert --repair'    i.e. raid1 to linear without any service interruption - and any further thinking which way of downconversion needs to happen.

We cannot use other calls as this needs to be fully automated for dmeventd usage.

The reason is very simple -  If you have 'stacked' raid device being a thin-pool  _tdata device - we need to have  this device running  and OK.

It's better  (given we will provide REMOVE policy) to  lose resilience if the user does not have new space  and have fully operational and RESIZABLE thin-pool - then having things stuck.

The only other option when not automated fix is in lvm2 here is to disallow usage of RAID in stack - no thin-pool, no caching, no snapshot and revert them back to old mirror for stability issue.

If the user  has luxury of having  SPARE device - he will not have a problem - new space can be allocated - also we have to ensure old space is NOT reused in case of  2nd. corruption.


There is also option/possibility to 're-implement'  device raid tracking  - so unlimited number of tracked images would need to be supported - together  with complete support of any raid operation with tracked images.

This way you can 'keep'  virtual  'raid'  target type and operate with device just as if tracked images are missing.

This workflow can work - but it's order-of-magnitude more work....
Comment 10 Heinz Mauelshagen 2017-06-30 17:09:14 EDT
(In reply to Zdenek Kabelac from comment #9)
<SNIP>
> 
> I still think you are misinterpreting something here:
> 
> lvm2 user DEFINES policy - it will not be 'standalone' decision.
> 
> When policy tells the  raid1 can lose resilience in case the 'resilient'
> device is lost -  it's perfectly valid  repair  to such raid1 to  linear.
<SNIP>

We seem to be at cross-purposes.

I'm defining this from the use case point of view not
policies or internals which you are. The latter have to
be derived from requested use cases.

The 2 use cases to distinguish are:
- repair degraded raid1 to regain resilience
- downgrade for any reason

The downgrade use case should not be disguised as a repair.

BTW:
a degraded raid1 set with > 2 legs may still be resilient, 
hence the user may request to just drop the dead legs and keep > 1
intact legs.
Comment 11 Zdenek Kabelac 2017-06-30 18:19:41 EDT
(In reply to Heinz Mauelshagen from comment #10)
> (In reply to Zdenek Kabelac from comment #9)
> <SNIP>
> > 
> > I still think you are misinterpreting something here:
> > 
> > lvm2 user DEFINES policy - it will not be 'standalone' decision.
> > 
> > When policy tells the  raid1 can lose resilience in case the 'resilient'
> > device is lost -  it's perfectly valid  repair  to such raid1 to  linear.
> <SNIP>
> 
> We seem to be at cross-purposes.
> 
> I'm defining this from the use case point of view not
> policies or internals which you are. The latter have to
> be derived from requested use cases.
> 

I'm still unsure why do you want to force users  to analyze theirs device stacks and figure them out how to 'fix' given every individual different state of dm-table??

It's the work of lvm2 to fix device state in a way, user has configure it (set policy for it).

lvm2 is highlevel tool abstraction solving problems here.

Those who can live with 'dmsetup' surely can figure things on their own.


There are many different cases - I hope you do not want me to open BZ for every such case like:

2leg Raid1  - lost PV1,  PV2 works
2leg Raid1  - PV1 works,  PV2 lost
2leg Raid1  - error sector PV1,  PV2 works
2leg Raid1  - PV1 works,  PV2 error sector
2leg Raid1  - error sector PV1 and PV1 is used by other LVs,  PV2 works
2leg Raid1  - PV1 works,  PV2 error sector and is used by other LVs 
2leg Raid1  - lost PV1,  PV2 works  used in stack as _tdata
3leg Raid1  - lost PV1,  PV2 works, PV3 works....

I could easily list hundreds of scenarios

We are addressing generic solution not a 1 particular sub-case here, where you prioritize  'virtually named state' in lvm2 metadata  over  the 'real-word' state of device.

> The 2 use cases to distinguish are:
> - repair degraded raid1 to regain resilience
> - downgrade for any reason


lvconvert --repair  is the tool used  by dmeventd to fix found problem, or by user to resolve  ugly looking error state reported by 'lvs'.

After 'lvconvert --repair' operations like  'lvchange --refresh'  shall pass!

This tool fixes the device - otherwise we leave user with a device unusable for other operations like 'lvresize'  being used as thin-pool.

POLICY is  how user configures  how to solve problem in todays lvm2.

If user doesn't want to automatically remove dead leg - and wants to retry i.e. 10 times to 'reload' target if it will not start to work before it will be removed - that's simply another policy.

I'm trying to explain here we cannot 'extend' EVERY lvm2 command with power of repairing - i.e. you wand to resize thin-pool and you need to fix raid and so on.

There is one defined 'lvconvert --repair' that  FIXES dm table & lvm2 metadata  to present consistent state.


> The downgrade use case should not be disguised as a repair.

Downgrade IS valid repair - it's option user can select and it's the way lvm2 works in thoughtful way - it's core design issue how we 'transition'  from  1-table state to another state - please remember the discussion about PARTIAL_LV and its misusage...

User can always  opt-out  and leave device invalid and deal with related troubles - but there has to be a defined  automatic  way  how to use  without ANY interruption 'raid1'  - that's primary purpose - to have uninterrupted usage of device if some of drive has problems.

Please note - I'm !NOT! talking about fact that 'raid1' kernel dm device continues to work with 'D' devices as long as there 'A' and disk is readable & writable - everyone knows it -  What doesn't work is any further manipulation with such LV (and VG in case of PV was lost).

lvconvert --repair fixes it and makes it further usable for other lvm2 commands.

If you do still fail to see it we simply need conf call with more people here - since I'm getting out example to describe how important and serious this issue is.  And if it's not addressed as result any stacked usage of raid volumes will have to be disabled.

> 
> BTW:
> a degraded raid1 set with > 2 legs may still be resilient, 
> hence the user may request to just drop the dead legs and keep > 1
> intact legs.

Not sure how it relates to anything posted above.
3leg raid - loses  1 leg and remains resilient and thus there is no issue with  losing any resilience when lost leg is automatically dropped to fix LV state.

So please list the reasoning why 'raid1' is so special target to drop existing rules and polices in lvm2  and force every user to start completely learning new way of using it ??

So we can have them available for conf call and weight them with my reasoning.
Comment 12 Heinz Mauelshagen 2017-07-03 11:27:54 EDT
(In reply to Zdenek Kabelac from comment #11)
> (In reply to Heinz Mauelshagen from comment #10)
> > (In reply to Zdenek Kabelac from comment #9)
> > <SNIP>
> > > 
> > > I still think you are misinterpreting something here:
> > > 
> > > lvm2 user DEFINES policy - it will not be 'standalone' decision.
> > > 
> > > When policy tells the  raid1 can lose resilience in case the 'resilient'
> > > device is lost -  it's perfectly valid  repair  to such raid1 to  linear.
> > <SNIP>
> > 
> > We seem to be at cross-purposes.
> > 
> > I'm defining this from the use case point of view not
> > policies or internals which you are. The latter have to
> > be derived from requested use cases.
> > 
> 
> I'm still unsure why do you want to force users  to analyze theirs device
> stacks and figure them out how to 'fix' given every individual different
> state of dm-table??
<SNIP>

You're confusing use cases with internals still, the latter being derivatives from the former.

I'm arguing, that raid1 repair cases must not include a downconvert, because a user request to repair a resilient raid1 has to end in a resilient raid1.  If that is impossible for any number of reasons, it shall not be hidden from the user by means of an automated repair.  The user may still be forced to address this appropriately which _may_ include downgrading the LV.  That's not an option for your thin example though where he requested resilience on hidden LVs and has to make a concious decision prioritizing on keeping resilience.
Thus, downconverting Raid1LV is hardly an automated scenario,

BTW: any variations of N-legged raid1 configurations don't add anything to this principle that a(n) (automated) repair shall result in a resilient Raid1LV and that loosing resilience has to be a concious decision
Comment 13 Heinz Mauelshagen 2017-07-03 13:01:36 EDT
List of open questions to be completed we need to answer are (my take):

Automated use cases (dmeventd):

1 if Raid1LV leg(s) is/are dead, automatically set the 
  failing PV(s) causing this to be bad persistently (new PV state)
  which can be retrieved on reactivation (yes)

2 if any Raid1LV legs fail, automatically try replacing them
  to regain full configured resilience (yes)

3 if automatic repair of Raid1LV impossible but at least 2 operational legs
  remain, remove the ones which can't be automatically replaced;
  this can be a mixture of replacing some legs and removing others (yes)

4 if automatic repair of Raid1LV with one operational leg impossible,
  downconvert (no)


Manual use cases:

1+2: same as automatic

3+4: manually possible to remove any failed or operational legs;
     enforce removing failed legs first (yes)
Comment 14 Zdenek Kabelac 2017-07-03 16:43:10 EDT
While not resolving any of the question from comment 13 - 

Typical user-case scenario for  'raid' usage is to protect service against given number of resilient devices failure. (i.e. 2leg raid1  protects service again 1 failing device) (IMHO no user will be surprised by the fact, that when one of 2 devices in raid1 is lost - resilience is lost - this is given by definition of raid - so I'm not sure why this is presented is some 'major problem' for a user??? -  is there any pointer where user would be seeing this a problem?

However as long as number of failed device are below 'resilience quorum' user expects  all things will works in unlimited way (i.e. not just read/write to device, but also ALL the operation with LV volume) - with assumption he is fully informed about problem i.e. resilience is LOST.

What I can still get here is - when user  configured 'resilience' and resilience was lost - we refuse to do any more work - this IMHO degrades usage of current  raids - since user will be stuck with problems in very same if he hadn't had any resilience at all (possibly even bigger set of problems).

---

As mentioned in 'comment 9' the solution to keep 'raidX'  even if there is no real raidX is to heavily  extend  rimage tracking  - but as also said in mentioned comment - this is  huuuuuge amount of work - and ATM I'm not convinced there is significant number of users ever taking care of this to basically justificate spent resources on development of this complex extension.

We have 2 categories -  enterprise users typically do have number of 'spare' devices to be used for replacement.

2nd. category uses low number of disks and just want to prevent single disk disaster - those users expect  that losing 1 disk will NOT stop  lvm2 for continual usage of raid devices.

However lvm2 now with raids is in this area worse then with mirrors where this workflow path worked in simpler way - devices were fixed and were fully functional - unlike with raids where user is 'stuck' with inconsistent dm tables.
 
----

So what needs to be solved here is - 

User must always have !simple! and defined way how to solve problem.

Nowhere during the process of regular workflow any command uses --force
(lose/error/unavailability of 'device' in raid array which loses resilience is 'standard/expected/assumed' case solved by raid usage - when user uses --force  he might lose data!.   Using --force is acceptable only when 'data' are supposed to be lost.

If raid is supposed to be used in stack  (_tdata...)  dmeventd need to know how to resolve expected problems automatically (otherwise stacked usage of existing raid solution needs to be disabled as unsupportable (without spare devices))

---

More technical limits on lvm2 sides:

1. Only limited number of commands have the privilege to solve complicated errors - i.e. lvresize is not a tool to resolve and repair missing devices by current design policy of lvm2  (i.e. lvresize cannot influence other LVs where it's only relation is sharing PV)

2. lvm2 metadata reflects reality - tables are activated by state of written metadata (PARTIAL_LV got misused over the time and will need to be again put into old bounds where it should be purely used for explicit activation for i.e. data-recovery).

3. lvm2 table manipulation knows 2 states (committed + pre-commited) on 'suspend' but only 1 state on resume (committed) - this constrain is important for correctness, but also makes it challenging on creation of correct sequences of table updates.

4. lvm2 is not outsourcing recovery steps to 3rd. party tools - so lvm2 is not a low-level simplistic tool (aka dmsetup)  to provide low-level bricks  to do basic recovery - i.e. we do not want tools like 'Cocpit' writing their 'recovery' sequences....


Also as the last part -  how are other tools solving this ?
Comment 15 Heinz Mauelshagen 2017-07-04 08:30:07 EDT
(In reply to Zdenek Kabelac from comment #14)
> While not resolving any of the question from comment 13 - 
> 
> Typical user-case scenario for  'raid' usage is to protect service against
> given number of resilient devices failure. (i.e. 2leg raid1  protects
> service again 1 failing device) (IMHO no user will be surprised by the fact,
> that when one of 2 devices in raid1 is lost - resilience is lost - this is
> given by definition of raid

Nope, he won't, he'll want to repair regaining reilience ASAP.

> - so I'm not sure why this is presented is some
> 'major problem' for a user??? -  is there any pointer where user would be
> seeing this a problem?

Not following thatkind of question.
Again: raid1 is configured to have reislient storage. This is what it is about. Ir it truns degraded for some reason, regain resilience ASAP is required.

> 
> However as long as number of failed device are below 'resilience quorum'
> user expects  all things will works in unlimited way (i.e. not just
> read/write to device, but also ALL the operation with LV volume) - with
> assumption he is fully informed about problem i.e. resilience is LOST.
> 
> What I can still get here is - when user  configured 'resilience' and
> resilience was lost - we refuse to do any more work - this IMHO degrades
> usage of current  raids - since user will be stuck with problems in very
> same if he hadn't had any resilience at all (possibly even bigger set of
> problems).

I already addressed in comment 13, that a degraded Raid1LV which lost all resilience needs analysis to come to terms if it's suffering from transient issues which can be solved hence using MD's capabilities to recover from it and regain full resilience -or- if it's permanent and HW needs to be replaced to regain full resilience.

This is admin analysis figuring out which HW components have failed transiently or permanently.  In no case an automatic downgrade is an option, because if the former, transient error case results from the analysis the recoverable leg(s) would be gone.
 
> 
> ---
> 
> As mentioned in 'comment 9' the solution to keep 'raidX'  even if there is
> no real raidX is to heavily  extend  rimage tracking  - but as also said in
> mentioned comment - this is  huuuuuge amount of work - and ATM I'm not
> convinced there is significant number of users ever taking care of this to
> basically justificate spent resources on development of this complex
> extension.

There's a reason why MD has a write intent log to cope with transient failures or interrupted uptdates on legs.  No need to reinvent anything for it on the lvm2 side.

> 
> We have 2 categories -  enterprise users typically do have number of 'spare'
> devices to be used for replacement.
> 
> 2nd. category uses low number of disks and just want to prevent single disk
> disaster - those users expect  that losing 1 disk will NOT stop  lvm2 for
> continual usage of raid devices.

It won't unless any raid type looses quorum (raid1 with no intact legs or raid0 with any number of legs lost or the rest of the raid types with > parity devices lost).  I.e. a Raid1LV will work with a minimum of one accessible leg and any kind of user will still want to check for transient, repairable errors.

> 
> However lvm2 now with raids is in this area worse then with mirrors where
> this workflow path worked in simpler way - devices were fixed and were fully
> functional - unlike with raids where user is 'stuck' with inconsistent dm
> tables.
>  
> ----
> 
> So what needs to be solved here is - 
> 
> User must always have !simple! and defined way how to solve problem.
> 
> Nowhere during the process of regular workflow any command uses --force
> (lose/error/unavailability of 'device' in raid array which loses resilience
> is 'standard/expected/assumed' case solved by raid usage - when user uses
> --force  he might lose data!.   Using --force is acceptable only when 'data'
> are supposed to be lost.
> 
> If raid is supposed to be used in stack  (_tdata...)  dmeventd need to know
> how to resolve expected problems automatically (otherwise stacked usage of
> existing raid solution needs to be disabled as unsupportable (without spare
> devices))

It's a valid request to allow dmeventd to automate as many sensefull transitions as possible but you did not address handling of transient failures.

dmeventd can't do that because it can't retrieve information about transient failure reasons only the admin can analyse.

Thanks for the following overview.  Noone requested lvresize to solve errors so we should limit this to lvconvert whewre it makes sense.

Just leave downconversion alone, it's not a repair operation to automate, because it'll prevent any recovery of transient errors causing a raid leg to be set faulty which can be recovered from using MD's resynchronization capabilities.

> 
> ---
> 
> More technical limits on lvm2 sides:
> 
> 1. Only limited number of commands have the privilege to solve complicated
> errors - i.e. lvresize is not a tool to resolve and repair missing devices
> by current design policy of lvm2  (i.e. lvresize cannot influence other LVs
> where it's only relation is sharing PV)
> 
> 2. lvm2 metadata reflects reality - tables are activated by state of written
> metadata (PARTIAL_LV got misused over the time and will need to be again put
> into old bounds where it should be purely used for explicit activation for
> i.e. data-recovery).
> 
> 3. lvm2 table manipulation knows 2 states (committed + pre-commited) on
> 'suspend' but only 1 state on resume (committed) - this constrain is
> important for correctness, but also makes it challenging on creation of
> correct sequences of table updates.
> 
> 4. lvm2 is not outsourcing recovery steps to 3rd. party tools - so lvm2 is
> not a low-level simplistic tool (aka dmsetup)  to provide low-level bricks 
> to do basic recovery - i.e. we do not want tools like 'Cocpit' writing their
> 'recovery' sequences....
> 
> 
> Also as the last part -  how are other tools solving this ?
Comment 16 Zdenek Kabelac 2017-07-04 15:51:37 EDT
(In reply to Heinz Mauelshagen from comment #15)
> (In reply to Zdenek Kabelac from comment #14)
> > While not resolving any of the question from comment 13 - 
> > 
> > Typical user-case scenario for  'raid' usage is to protect service against
> > given number of resilient devices failure. (i.e. 2leg raid1  protects
> > service again 1 failing device) (IMHO no user will be surprised by the fact,
> > that when one of 2 devices in raid1 is lost - resilience is lost - this is
> > given by definition of raid
> 
> Nope, he won't, he'll want to repair regaining reilience ASAP.

It is all about the choice made by USER.

User defines reaction upon error.

1.  allocate
2.  warn
3.  remove
4.  retry_X_times_before_going_for_1_or_2_or_3
5.  ...

User/Admin has made thoughtful decision - and he may have selected to lose resilience when drive reported error - there is no further analysis needed - drive should be lost even if you find it as a 'bad choice' (but you could bring in some convincing better policies with some stats proving it's better - once they are implemented and pass testing)

There is lot of users which do never suffer from transient failures which is mostly domain of some specify type of storage hw i.e. network connected storage.

Obviously supporting 'REMOVE'  policy is not meant to solve problem for 'WARN'  you are focused on which needs some further care - but  by adding 'REMOVE'
we quickly resolve a basic usability problem for a lot of users.

By directly solving only 'WARN' however we add a big delay for those users which do not need this complexity (or simply needs usable stack ASAP as high priority).

The main reason I'm advocating here for quick & clear 'REMOVE' policy is it's simplicity of implementation -  while  'WARN' will take easily many months (as we need to resolve incorrect tree loading as well) -  'REMOVE' is a weeks work.

> 
> > - so I'm not sure why this is presented is some
> > 'major problem' for a user??? -  is there any pointer where user would be
> > seeing this a problem?
> 
> Not following thatkind of question.
> Again: raid1 is configured to have reislient storage. This is what it is
> about. Ir it truns degraded for some reason, regain resilience ASAP is
> required.

Users I'm talking with do expect 'D'ead-leg to be dropped and they will 'possibly' restore resilience once they buy a new drive - which might be delayed by weeks...  If users would have 'resources' to maintain spare drive - they would already be using this with 'more comfortable' allocate policy.

For unexperienced users (large portion of distro user-base) any complexity associated with even understanding all different sorts of failures are beyond the level they want to be even involved - simple hdd replacement driven by proper warning messages is likely the complexity to target here in future.

> > However as long as number of failed device are below 'resilience quorum'
> > user expects  all things will works in unlimited way (i.e. not just
> > read/write to device, but also ALL the operation with LV volume) - with
> > assumption he is fully informed about problem i.e. resilience is LOST.
> > 
> > What I can still get here is - when user  configured 'resilience' and
> > resilience was lost - we refuse to do any more work - this IMHO degrades
> > usage of current  raids - since user will be stuck with problems in very
> > same if he hadn't had any resilience at all (possibly even bigger set of
> > problems).
> 
> I already addressed in comment 13, that a degraded Raid1LV which lost all
> resilience needs analysis to come to terms if it's suffering from transient

Who will do this analysis ?

Do you expect every lvm2 users is hdd expert ?

> issues which can be solved hence using MD's capabilities to recover from it
> and regain full resilience -or- if it's permanent and HW needs to be
> replaced to regain full resilience.
> 
> This is admin analysis figuring out which HW components have failed
> transiently or permanently.  In no case an automatic downgrade is an option,
> because if the former, transient error case results from the analysis the
> recoverable leg(s) would be gone.

IMHO you are focusing here too much on expert-only 'transient' failures - which is rather a very specific type of failure and usually with network attached storage where device appear & disappear.  However with SATA attached storage where HDD already exhausted 'repair' sectors - any write error is usually 'final' and renders drive as unreliable - so using  "REMOVE" policy in such setup is usually good choice.

And I'm not here enforcing this "REMOVE" policy to be the only choice (and not even to be default) - it's JUST an option for a user.

'REMOVE' proposal here is in absolutely no way taking away any recovery option from expert user here.

> > As mentioned in 'comment 9' the solution to keep 'raidX'  even if there is
> > no real raidX is to heavily  extend  rimage tracking  - but as also said in
> > mentioned comment - this is  huuuuuge amount of work - and ATM I'm not
> > convinced there is significant number of users ever taking care of this to
> > basically justificate spent resources on development of this complex
> > extension.
> 
> There's a reason why MD has a write intent log to cope with transient
> failures or interrupted uptdates on legs.  No need to reinvent anything for
> it on the lvm2 side.

It' not related to any MD thingy (aka write-intent bitmap) here at all (and I'm repeating this mostly in every comment in this BZ) 

This whole BZ is about LVM2 design rules where all supported targets needs to be complaint with.

The core design issue is - activation code shall NOT ask lvm2 to activate device with missing PV - this will get even prohibited by more hardening features  and will be only allowed for 'pure' activation (possibly read-only) and deactivation (i.e. user explicitly passes  lvchange --partial)

So all we are solving here is - how the metadata for raid will look like when you have to store 'metadata'  which will be creating valid raid devices - so in every committed metadata the usage of targets is explicitly defined for every segment (aka no 'spurious' -missing segments)

So what I'm expecting here as an outcome of this BZ is  defined  set of metadata sequences.

Each committed lvm2 metadata present valid and self-consistent LV which can be activated and will present consistent transition state.

Nowhere  in this process are steps like:  'vgreduce --removemissing --force' and similar invalid proposals.

1. dmeventd  notices error  (leg 'D') 

2. runs  'lvconvert --repair --usepolicies' - this command makes minimum suspend/resume  steps  after which such raid LV continues to be  usable for any other LV command  (as long as  'raidLV' contains full set of data set)

3. user continues to use such LV like if it would be 'normal' - so command like  'lvchange --refresh' or 'lvresize  works without any further complexity.

As long as these steps work - such raid segtype can be used in stacking.

If we cannot deliver workable solution with current raids - we will need to disable existing 'raid' segtype and provide a new segment type that is able to deliver this functionality for stacked device with better design.


I'm note here proposing any exact type of solution - I'm expecting here a new design for review.

"REMOVE" policy should be seen as very simplified 1st. step which gains lot with min effort - fixes the unfixable state of LV when there is no free space.

Knowing details of internal complexity of activation of the whole device stack - the solution for 'WARN' workflow is much harder.


If you are targeting a different use-case scenario - where devices in stacked as stand-alone separately maintained devices  (I'd call it  'lvsetup' as next-step of dmsetup) - there you can put different constrains - but then we are not talking about lvm2 anymore....

Note You need to log in before you can comment on or make changes to this bug.