Bug 1653744

Summary: scrubbing raids can lead to resource busy failures
Product: Red Hat Enterprise Linux 8 Reporter: Corey Marthaler <cmarthal>
Component: kernelAssignee: Heinz Mauelshagen <heinzm>
kernel sub component: RAID QA Contact: cluster-qe <cluster-qe>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: unspecified CC: agk, cluster-qe, heinzm, hkrzesin, jbrassow, msnitzer, prajnoha, rhandlin, zkabelac
Version: 8.0Keywords: Reopened
Target Milestone: rc   
Target Release: 8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-4.18.0-58.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1533639 Environment:
Last Closed: 2019-06-14 01:18:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1533639    

Comment 1 Corey Marthaler 2018-11-27 14:35:16 UTC
Reproduced this in rhel8.0, so cloning this bug to document turning off raid scrub testing in rhel8 as well.

[lvm_single_raid] [lvm_single_raid_sanity_raid1] 20: linear convert (lvconvert --yes --type raid1 -m 0); raid convert (lvconvert --yes --type raid1 -m 1)
[lvm_single_raid] [lvm_single_raid_sanity_raid1] 
[lvm_single_raid] [lvm_single_raid_sanity_raid1] perform raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/raid_2_linear
[lvm_single_raid] [lvm_single_raid_sanity_raid1]   device-mapper: message ioctl on  LVM-yRukXSJApFKVQPGALU4kh3ncDCC2seaXRdyT3MXFvcVs7oFV6pvtqfdm5sulpoKW failed: Device or resource busy
[lvm_single_raid] [lvm_single_raid_sanity_raid1] raid scrubbing failed (Probably bug 1533639)



kernel-4.18.0-39.el8    BUILT: Wed Nov 14 06:18:11 CST 2018
lvm2-2.03.01-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
lvm2-libs-2.03.01-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
lvm2-dbusd-2.03.01-1.el8    BUILT: Thu Nov  1 04:31:09 CDT 2018
lvm2-lockd-2.03.01-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
device-mapper-1.02.153-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
device-mapper-libs-1.02.153-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
device-mapper-event-1.02.153-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
device-mapper-event-libs-1.02.153-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
device-mapper-persistent-data-0.7.6-1.el8    BUILT: Sun Aug 12 04:21:55 CDT 2018

Comment 3 Heinz Mauelshagen 2018-12-18 16:33:05 UTC
Sending a check/repair message infrequently leads to -EBUSY
as oposed to identifying an active resync properly.
Cause is raid_message() testing recovery bits in a racy way.

Fix by calling decipher_sync_action() from raid_message()
to identify the idle state of the RAID device properly.

Comment 4 Heinz Mauelshagen 2018-12-18 16:40:46 UTC
Upstream patch -> https://patchwork.kernel.org/patch/10736023/

Comment 5 Mike Snitzer 2018-12-18 17:09:00 UTC

*** This bug has been marked as a duplicate of bug 1658757 ***

Comment 6 Heinz Mauelshagen 2018-12-18 20:35:08 UTC
Brew test kernel build -> http://brew-task-repos.usersys.redhat.com/repos/scratch/heinzm/kernel/4.18.0/57.el8.bz1653744/

Comment 7 Heinz Mauelshagen 2018-12-19 15:19:29 UTC
I was able to regress "Device or resource busy\nraid scrubbing failed" without the patch with:

for i in {0..99}; do lvremove -y $vg/$lv ; lvcreate -n$lv -L8 -m1 -y $vg ; lvchange --syncaction check/repair $vg/$lv ; done

quickly within a few cycles.

With the patch, it did not regress within a thousand cycles.

Comment 14 Corey Marthaler 2019-03-15 15:34:34 UTC
Marking this verified in the latest kernel. 

kernel-4.18.0-75.el8    BUILT: Fri Mar  1 11:37:34 CST 2019
lvm2-2.03.02-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
lvm2-libs-2.03.02-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
lvm2-dbusd-2.03.02-6.el8    BUILT: Fri Feb 22 04:50:28 CST 2019
lvm2-lockd-2.03.02-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
device-mapper-1.02.155-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
device-mapper-libs-1.02.155-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
device-mapper-event-1.02.155-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
device-mapper-event-libs-1.02.155-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
device-mapper-persistent-data-0.7.6-1.el8    BUILT: Sun Aug 12 04:21:55 CDT 2018


Raid scrubbing has been turned back on in LVM testing and no errors were seen as of late. We'll open a new bug if issues are seen again.

============================================================
Iteration 135 of 20000 started at Fri Mar 15 10:19:25 CDT 2019
============================================================
SCENARIO (raid10) - [small_raid_creation]
Make small raids
50 hayes-03: lvcreate  --type raid10_near -i 2 -n small_50M -L 50M raid_sanity
45 hayes-03: lvcreate  --type raid10_near -i 2 -n small_45M -L 45M raid_sanity
40 hayes-03: lvcreate  --type raid10_near -i 2 -n small_40M -L 40M raid_sanity
35 hayes-03: lvcreate  --type raid10 -i 2 -n small_35M -L 35M raid_sanity
30 hayes-03: lvcreate  --type raid10_near -i 2 -n small_30M -L 30M raid_sanity
25 hayes-03: lvcreate  --type raid10 -i 2 -n small_25M -L 25M raid_sanity
20 hayes-03: lvcreate  --type raid10 -i 2 -n small_20M -L 20M raid_sanity
15 hayes-03: lvcreate  --type raid10 -i 2 -n small_15M -L 15M raid_sanity
10 hayes-03: lvcreate  --type raid10_near -i 2 -n small_10M -L 10M raid_sanity
5 hayes-03: lvcreate  --type raid10 -i 2 -n small_5M -L 5M raid_sanity

Removing the small raids
5 
perform raid scrubbing (lvchange --syncaction check) on raid raid_sanity/small_5M

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_5M... and removing
10 
perform raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/small_10M

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_10M... and removing
15 
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
perform raid rebuilding (lvchange --yes --rebuild /dev/sdd1 raid_sanity/small_15M)
rebuild attr:raid_sanity-small_15M: 0 32768 raid raid10 4 AAAA 32768/32768 idle 0 0 -


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_15M... and removing
20 
perform raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/small_20M

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_20M... and removing
25 
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
perform raid rebuilding (lvchange --yes --rebuild /dev/sdd1 raid_sanity/small_25M)
rebuild attr:raid_sanity-small_25M: 0 65536 raid raid10 4 AAAA 65536/65536 idle 0 0 -


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_25M... and removing
30 
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
perform raid rebuilding (lvchange --yes --rebuild /dev/sdd1 raid_sanity/small_30M)
rebuild attr:raid_sanity-small_30M: 0 65536 raid raid10 4 AAAA 65536/65536 idle 0 0 -


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_30M... and removing
35 
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
perform raid rebuilding (lvchange --yes --rebuild /dev/sdd1 raid_sanity/small_35M)
rebuild attr:raid_sanity-small_35M: 0 81920 raid raid10 4 aAAA 40320/81920 recover 0 0 -


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_35M... and removing
40 
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
perform raid rebuilding (lvchange --yes --rebuild /dev/sdd1 raid_sanity/small_40M)
rebuild attr:raid_sanity-small_40M: 0 81920 raid raid10 4 aAAA 37632/81920 recover 0 0 -


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_40M... and removing
45 
perform raid scrubbing (lvchange --syncaction check) on raid raid_sanity/small_45M

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_45M... and removing
50 
perform raid scrubbing (lvchange --syncaction check) on raid raid_sanity/small_50M

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_50M... and removing