Bug 1402908 - raidX is not supporting degraded activation properly
Summary: raidX is not supporting degraded activation properly
Keywords:
Status: NEW
Alias: None
Product: LVM and device-mapper
Classification: Community
Component: lvm2
Version: 2.02.169
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Heinz Mauelshagen
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-08 15:38 UTC by Zdenek Kabelac
Modified: 2020-12-02 18:50 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
rule-engine: lvm-technical-solution?
rule-engine: lvm-test-coverage?


Attachments (Terms of Use)
Trace of lvm2 test suite lvconvert-raid.sh (277.76 KB, text/plain)
2017-02-18 16:24 UTC, Zdenek Kabelac
no flags Details
part of lvconvert-raid.txt with more debug (873.11 KB, text/plain)
2017-02-18 16:31 UTC, Zdenek Kabelac
no flags Details

Description Zdenek Kabelac 2016-12-08 15:38:13 UTC
Description of problem:

When raid LV (let's for simplicity say raid1) is having disk issue with particular raid image device - user still should be able to activate
such device in 'degraded' mode 
(lvm.conf   activation/activation_mode="degraded")

ATM internal lvm2 logic detects missing device for rimageX,
then the activation code merging meaning of degraded mode with
partial LV activation and the resulting table tries to push
a raid  where faulty device is replaced with 'error' segments
instead of activation of raid device is 'degreaded' mode
(without ANY LV_rimage_X-missing_Y_Z device in dm table)

How to test -

create raid  & wait for sync
replace one PV with error device
try degraded activation


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

device-mapper: raid: New device injected into existing raid set without 'delta_disks' or 'rebuild' parameter specified
device-mapper: table: 253:9: raid: Unable to assemble array: Invalid superblocks
device-mapper: ioctl: error adding target to table

Expected results:

Correct activation of 'raid' with just 1 health leg.

Additional info:

Comment 1 Pierguido Lambri 2016-12-27 21:09:35 UTC
I've just got a similar issue on Fedora 25 with a RAID1 LV (which is actually the metadata disk of a thin device) but I got it after having repaired it.
Now every time I try to activate the VG, it logs the same error:

Dec 27 20:28:09 plambri-affligem kernel: device-mapper: raid: New device injected into existing raid set without 'delta_disks' or 'rebuild' parameter specified
Dec 27 20:28:09 plambri-affligem kernel: device-mapper: table: 253:7: raid: Unable to assemble array: Invalid superblocks
Dec 27 20:28:09 plambri-affligem kernel: device-mapper: ioctl: error adding target to table

Is there any way to recover from this situation?

Comment 2 Jonathan Earl Brassow 2017-02-01 22:53:59 UTC
(In reply to Pierguido Lambri from comment #1)
> I've just got a similar issue on Fedora 25 with a RAID1 LV (which is
> actually the metadata disk of a thin device) but I got it after having
> repaired it.
> Now every time I try to activate the VG, it logs the same error:
> 
> Dec 27 20:28:09 plambri-affligem kernel: device-mapper: raid: New device
> injected into existing raid set without 'delta_disks' or 'rebuild' parameter
> specified
> Dec 27 20:28:09 plambri-affligem kernel: device-mapper: table: 253:7: raid:
> Unable to assemble array: Invalid superblocks
> Dec 27 20:28:09 plambri-affligem kernel: device-mapper: ioctl: error adding
> target to table
> 
> Is there any way to recover from this situation?

i've never heard of this before... do you have a reproducer?

Comment 3 Zdenek Kabelac 2017-02-18 16:24:17 UTC
Created attachment 1255269 [details]
Trace of lvm2 test suite  lvconvert-raid.sh

This is our buildbot sample where is this 'injection' visible.

Comment 4 Zdenek Kabelac 2017-02-18 16:31:22 UTC
Created attachment 1255284 [details]
part of lvconvert-raid.txt with more debug

I'm providing some more debug version (only cut of trace).


[ 0:01] #libdm-deptree.c:2731     Loading @PREFIX@vg-LV1 table (253:10)
[ 0:01] #libdm-deptree.c:2675         Adding target to (253:10): 0 8192 raid raid1 3 0 region_size 1024 2 253:11 253:12 253:13 253:14
[ 0:01] #ioctl/libdm-iface.c:1838         dm table   (253:10) [ opencount flush ]   [16384] (*1)
[ 0:01] #ioctl/libdm-iface.c:1838         dm reload   (253:10) [ noopencount flush ]   [16384] (*1)
[ 0:01] #activate/activate.c:2132         Requiring flush for LV @PREFIX@vg/LV1.
[ 0:01] #mm/memlock.c:582         Entering critical section (suspending).
[ 0:01] #mm/memlock.c:551         Lock:   Memlock counters: locked:0 critical:1 daemon:0 suspended:0
[ 0:01] #mm/memlock.c:475       Locking memory
[ 0:01] #libdm-config.c:1064       activation/use_mlockall not found in config: defaulting to 0
[ 0:01] 6,10381,158994530818,-;device-mapper: raid: Superblocks created for new raid set
[ 0:01] 6,10382,158994540522,-;md/raid1:mdX: not clean -- starting background reconstruction
[ 0:01] 6,10383,158994540537,-;md/raid1:mdX: active with 2 out of 2 mirrors
[ 0:01] #mm/memlock.c:287         mlock          0KiB 5563adeda000 - 5563ae0df000 r-xp 00000000 08:06 10756824  


This looks like primary suspect.

'raid' table is preloaded - and it does look like raid is already starting to do some action based on this new table characteristic.

lvm2 expects 'pre-loaded' table is not having any effect until there is matching 'resume' operation.

Comment 5 Zdenek Kabelac 2017-02-22 14:55:05 UTC
This upstream patch:

https://www.redhat.com/archives/lvm-devel/2017-February/msg00160.html

should minimize the chance of hiting 'race' with mdraid core on removal of active origin with snapshots - were couple extra table reloads have been executed before final 'origin' removal.

Note: this is purely addressing issue with 'lvremove -ff' from test suite.
It's not addressing the bug from BZ description - there could be possibly a few more cases were we could be 'smarter' in lvm2 to avoid racy logic to happen.


Note You need to log in before you can comment on or make changes to this bug.