Bug 1533639
Summary: | scrubbing raids can lead to resource busy failures | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> | ||||
lvm2 sub component: | Mirroring and RAID | QA Contact: | cluster-qe <cluster-qe> | ||||
Status: | CLOSED DEFERRED | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, rhandlin, zkabelac | ||||
Version: | 7.5 | ||||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1653744 (view as bug list) | Environment: | |||||
Last Closed: | 2018-12-21 16:50:40 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1653744 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Corey Marthaler
2018-01-11 20:49:57 UTC
Created attachment 1380222 [details]
verbose lvchange attempt
I don't think this is "small raid" dependent as I just repo'ed this on a 3G raid1 thin pool volume. SCENARIO - [full_EXT_thin_snap_verification] Verify full snapshots on top of EXT can be unmounted and removed Making pool volume Converting *Raid* volumes to thin pool and thin pool metadata devices lvcreate --type raid1 -m 1 --zero n -L 4M -n meta snapper_thinp WARNING: Logical volume snapper_thinp/meta not zeroed. lvcreate --type raid1 -m 1 --zero n -L 1G -n POOL snapper_thinp WARNING: Logical volume snapper_thinp/POOL not zeroed. Waiting until all mirror|raid volumes become fully syncd... 1/2 mirror(s) are fully synced: ( 62.95% 100.00% ) 2/2 mirror(s) are fully synced: ( 100.00% 100.00% ) Sleeping 15 sec Sleeping 15 sec lvconvert --zero n --thinpool snapper_thinp/POOL --poolmetadata meta --yes WARNING: Converting snapper_thinp/POOL and snapper_thinp/meta to thin pool's data and metadata volumes with metadata wiping. THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.) Sanity checking pool device (POOL) metadata thin_check /dev/mapper/snapper_thinp-meta_swap.503 examining superblock examining devices tree examining mapping tree checking space map counts Making origin volume lvcreate --virtualsize 2G -T snapper_thinp/POOL -n origin WARNING: Sum of all thin volume sizes (2.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (1.00 GiB). lvcreate -V 2G -T snapper_thinp/POOL -n other1 WARNING: Sum of all thin volume sizes (4.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (1.00 GiB). lvcreate --virtualsize 2G -T snapper_thinp/POOL -n other2 WARNING: Sum of all thin volume sizes (6.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (1.00 GiB). lvcreate --virtualsize 2G -T snapper_thinp/POOL -n other3 WARNING: Sum of all thin volume sizes (8.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (1.00 GiB). lvcreate -V 2G -T snapper_thinp/POOL -n other4 WARNING: Sum of all thin volume sizes (10.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (1.00 GiB). lvcreate --virtualsize 2G -T snapper_thinp/POOL -n other5 WARNING: Sum of all thin volume sizes (12.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (1.00 GiB). Placing an EXT filesystem on origin volume mke2fs 1.42.9 (28-Dec-2013) Making snapshot of origin volume lvcreate -y -k n -s /dev/snapper_thinp/origin -n full_snap Mounting snapshot volume Writing files to /mnt/full_snap Checking files on /mnt/full_snap Filling snapshot filesystem: /mnt/full_snap dd if=/dev/zero of=/mnt/full_snap/fill_file count=1500 bs=1M oflag=direct dd: error writing ‘/mnt/full_snap/fill_file’: Read-only file system 777+0 records in 776+0 records out 813694976 bytes (814 MB) copied, 74.567 s, 10.9 MB/s Attempt to create other virtual volumes while pool is full and in RO mode lvcreate -V 1G -T snapper_thinp/POOL -n other6 Cannot create new thin volume, free space in thin pool snapper_thinp/POOL reached threshold. Extend the pool to kick it back into RW mode lvextend -L +2G snapper_thinp/POOL WARNING: Sum of all thin volume sizes (14.00 GiB) exceeds the size of thin pools (3.00 GiB). Again, attempt to create other virtual volumes while pool is full back in RW mode lvcreate -V 1G -T snapper_thinp/POOL -n other7 WARNING: Sum of all thin volume sizes (15.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (3.00 GiB). Checking files on /mnt/full_snap Removing snap volume snapper_thinp/full_snap lvremove -f /dev/snapper_thinp/full_snap perform raid scrubbing on POOL_tdata (lvchange --syncaction check snapper_thinp/POOL_tdata) device-mapper: message ioctl on LVM-IQd88oo4U73QeJzlVHkP9S9EADIglPe5brRykOxO8D1TQWA7q5nF16aPSQmNrWxb-tdata failed: Device or resource busy raid scrubbing check failed 3.10.0-851.el7.x86_64 lvm2-2.02.177-2.el7 BUILT: Wed Feb 7 17:39:26 CET 2018 lvm2-libs-2.02.177-2.el7 BUILT: Wed Feb 7 17:39:26 CET 2018 lvm2-cluster-2.02.177-2.el7 BUILT: Wed Feb 7 17:39:26 CET 2018 lvm2-lockd-2.02.177-2.el7 BUILT: Wed Feb 7 17:39:26 CET 2018 lvm2-python-boom-0.8.3-2.el7 BUILT: Wed Feb 7 17:42:09 CET 2018 cmirror-2.02.177-2.el7 BUILT: Wed Feb 7 17:39:26 CET 2018 device-mapper-1.02.146-2.el7 BUILT: Wed Feb 7 17:39:26 CET 2018 device-mapper-libs-1.02.146-2.el7 BUILT: Wed Feb 7 17:39:26 CET 2018 device-mapper-event-1.02.146-2.el7 BUILT: Wed Feb 7 17:39:26 CET 2018 device-mapper-event-libs-1.02.146-2.el7 BUILT: Wed Feb 7 17:39:26 CET 2018 device-mapper-persistent-data-0.7.3-3.el7 BUILT: Tue Nov 14 12:07:18 CET 2017 corey, is it possible this is a testing bug? i.e. not waiting for a conflicting sync action to finish before starting a scrub? (In reply to Jonathan Earl Brassow from comment #8) > corey, is it possible this is a testing bug? i.e. not waiting for a > conflicting sync action to finish before starting a scrub? <corey> the scrubbing tests actually wait for the raid to be sync'ed before attempted the repair scrub. Also, a "state is currently "resync". Unable..." doesn't result in a failure, and certainly shouldn't produce some d-m ioclt message, so the test just moves on in those cases: [root@hayes-03 ~]# lvcreate --type raid1 -m 1 -n raid -L 800G test Logical volume "raid" created. [root@hayes-03 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices raid test rwi-a-r--- 800.00g 0.07 raid_rimage_0(0),raid_rimage_1(0) [raid_rimage_0] test Iwi-aor--- 800.00g /dev/sdf1(1) [raid_rimage_1] test Iwi-aor--- 800.00g /dev/sdg1(1) [raid_rmeta_0] test ewi-aor--- 4.00m /dev/sdf1(0) [raid_rmeta_1] test ewi-aor--- 4.00m /dev/sdg1(0) [root@hayes-03 ~]# lvchange --syncaction repair test/raid test/raid state is currently "resync". Unable to switch to "repair". [root@hayes-03 ~]# echo $? 0 (In reply to Jonathan Earl Brassow from comment #9) > (In reply to Jonathan Earl Brassow from comment #8) > > corey, is it possible this is a testing bug? i.e. not waiting for a > > conflicting sync action to finish before starting a scrub? > > <corey> the scrubbing tests actually wait for the raid to be sync'ed before > attempted the repair scrub. Also, a "state is currently "resync". > Unable..." doesn't result in a failure, and certainly shouldn't produce some > d-m ioclt message, so the test just moves on in those cases: dm-raid reports -EBUSY when a check/repair is being requested (which we do via the targets message interface) in case the state of the raid is not 'idle'). This looks like a remaining race effect in message processing sync action checks to me, because I'm not using decipher_sync_action() there as oposed to the status interface but check recovery flags directly as before... > > [root@hayes-03 ~]# lvcreate --type raid1 -m 1 -n raid -L 800G test > Logical volume "raid" created. > [root@hayes-03 ~]# lvs -a -o +devices > LV VG Attr LSize Pool Origin Data% Meta% Move Log > Cpy%Sync Convert Devices > raid test rwi-a-r--- 800.00g > 0.07 raid_rimage_0(0),raid_rimage_1(0) > [raid_rimage_0] test Iwi-aor--- 800.00g > /dev/sdf1(1) > [raid_rimage_1] test Iwi-aor--- 800.00g > /dev/sdg1(1) > [raid_rmeta_0] test ewi-aor--- 4.00m > /dev/sdf1(0) > [raid_rmeta_1] test ewi-aor--- 4.00m > /dev/sdg1(0) > [root@hayes-03 ~]# lvchange --syncaction repair test/raid > test/raid state is currently "resync". Unable to switch to "repair". > [root@hayes-03 ~]# echo $? > 0 This is unlikely to get triggered in the field, because the scrub needs to requested immediately after creation or a resync request. Closing, please reopen in case there's further insight that the fix needs to go into RHEL7. |