1653744 – scrubbing raids can lead to resource busy failures

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1653744 - scrubbing raids can lead to resource busy failures

Summary: scrubbing raids can lead to resource busy failures

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	8.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Heinz Mauelshagen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1533639
TreeView+	depends on / blocked

Reported:	2018-11-27 14:31 UTC by Corey Marthaler
Modified:	2021-09-06 16:55 UTC (History)
CC List:	9 users (show)
Fixed In Version:	kernel-4.18.0-58.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1533639
Environment:
Last Closed:	2019-06-14 01:18:01 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-20338	0	None	None	None	2021-09-06 16:55:32 UTC

Comment 1 Corey Marthaler 2018-11-27 14:35:16 UTC

Reproduced this in rhel8.0, so cloning this bug to document turning off raid scrub testing in rhel8 as well.

[lvm_single_raid] [lvm_single_raid_sanity_raid1] 20: linear convert (lvconvert --yes --type raid1 -m 0); raid convert (lvconvert --yes --type raid1 -m 1)
[lvm_single_raid] [lvm_single_raid_sanity_raid1] 
[lvm_single_raid] [lvm_single_raid_sanity_raid1] perform raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/raid_2_linear
[lvm_single_raid] [lvm_single_raid_sanity_raid1]   device-mapper: message ioctl on  LVM-yRukXSJApFKVQPGALU4kh3ncDCC2seaXRdyT3MXFvcVs7oFV6pvtqfdm5sulpoKW failed: Device or resource busy
[lvm_single_raid] [lvm_single_raid_sanity_raid1] raid scrubbing failed (Probably bug 1533639)



kernel-4.18.0-39.el8    BUILT: Wed Nov 14 06:18:11 CST 2018
lvm2-2.03.01-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
lvm2-libs-2.03.01-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
lvm2-dbusd-2.03.01-1.el8    BUILT: Thu Nov  1 04:31:09 CDT 2018
lvm2-lockd-2.03.01-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
device-mapper-1.02.153-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
device-mapper-libs-1.02.153-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
device-mapper-event-1.02.153-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
device-mapper-event-libs-1.02.153-1.el8    BUILT: Thu Nov  1 04:29:08 CDT 2018
device-mapper-persistent-data-0.7.6-1.el8    BUILT: Sun Aug 12 04:21:55 CDT 2018

Comment 3 Heinz Mauelshagen 2018-12-18 16:33:05 UTC

Sending a check/repair message infrequently leads to -EBUSY
as oposed to identifying an active resync properly.
Cause is raid_message() testing recovery bits in a racy way.

Fix by calling decipher_sync_action() from raid_message()
to identify the idle state of the RAID device properly.

Comment 4 Heinz Mauelshagen 2018-12-18 16:40:46 UTC

Upstream patch -> https://patchwork.kernel.org/patch/10736023/

Comment 5 Mike Snitzer 2018-12-18 17:09:00 UTC


*** This bug has been marked as a duplicate of bug 1658757 ***

Comment 6 Heinz Mauelshagen 2018-12-18 20:35:08 UTC

Brew test kernel build -> http://brew-task-repos.usersys.redhat.com/repos/scratch/heinzm/kernel/4.18.0/57.el8.bz1653744/

Comment 7 Heinz Mauelshagen 2018-12-19 15:19:29 UTC

I was able to regress "Device or resource busy\nraid scrubbing failed" without the patch with:

for i in {0..99}; do lvremove -y $vg/$lv ; lvcreate -n$lv -L8 -m1 -y $vg ; lvchange --syncaction check/repair $vg/$lv ; done

quickly within a few cycles.

With the patch, it did not regress within a thousand cycles.

Comment 14 Corey Marthaler 2019-03-15 15:34:34 UTC

Marking this verified in the latest kernel. 

kernel-4.18.0-75.el8    BUILT: Fri Mar  1 11:37:34 CST 2019
lvm2-2.03.02-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
lvm2-libs-2.03.02-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
lvm2-dbusd-2.03.02-6.el8    BUILT: Fri Feb 22 04:50:28 CST 2019
lvm2-lockd-2.03.02-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
device-mapper-1.02.155-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
device-mapper-libs-1.02.155-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
device-mapper-event-1.02.155-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
device-mapper-event-libs-1.02.155-6.el8    BUILT: Fri Feb 22 04:47:54 CST 2019
device-mapper-persistent-data-0.7.6-1.el8    BUILT: Sun Aug 12 04:21:55 CDT 2018


Raid scrubbing has been turned back on in LVM testing and no errors were seen as of late. We'll open a new bug if issues are seen again.

============================================================
Iteration 135 of 20000 started at Fri Mar 15 10:19:25 CDT 2019
============================================================
SCENARIO (raid10) - [small_raid_creation]
Make small raids
50 hayes-03: lvcreate  --type raid10_near -i 2 -n small_50M -L 50M raid_sanity
45 hayes-03: lvcreate  --type raid10_near -i 2 -n small_45M -L 45M raid_sanity
40 hayes-03: lvcreate  --type raid10_near -i 2 -n small_40M -L 40M raid_sanity
35 hayes-03: lvcreate  --type raid10 -i 2 -n small_35M -L 35M raid_sanity
30 hayes-03: lvcreate  --type raid10_near -i 2 -n small_30M -L 30M raid_sanity
25 hayes-03: lvcreate  --type raid10 -i 2 -n small_25M -L 25M raid_sanity
20 hayes-03: lvcreate  --type raid10 -i 2 -n small_20M -L 20M raid_sanity
15 hayes-03: lvcreate  --type raid10 -i 2 -n small_15M -L 15M raid_sanity
10 hayes-03: lvcreate  --type raid10_near -i 2 -n small_10M -L 10M raid_sanity
5 hayes-03: lvcreate  --type raid10 -i 2 -n small_5M -L 5M raid_sanity

Removing the small raids
5 
perform raid scrubbing (lvchange --syncaction check) on raid raid_sanity/small_5M

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_5M... and removing
10 
perform raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/small_10M

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_10M... and removing
15 
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
perform raid rebuilding (lvchange --yes --rebuild /dev/sdd1 raid_sanity/small_15M)
rebuild attr:raid_sanity-small_15M: 0 32768 raid raid10 4 AAAA 32768/32768 idle 0 0 -


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_15M... and removing
20 
perform raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/small_20M

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_20M... and removing
25 
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
perform raid rebuilding (lvchange --yes --rebuild /dev/sdd1 raid_sanity/small_25M)
rebuild attr:raid_sanity-small_25M: 0 65536 raid raid10 4 AAAA 65536/65536 idle 0 0 -


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_25M... and removing
30 
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
perform raid rebuilding (lvchange --yes --rebuild /dev/sdd1 raid_sanity/small_30M)
rebuild attr:raid_sanity-small_30M: 0 65536 raid raid10 4 AAAA 65536/65536 idle 0 0 -


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_30M... and removing
35 
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
perform raid rebuilding (lvchange --yes --rebuild /dev/sdd1 raid_sanity/small_35M)
rebuild attr:raid_sanity-small_35M: 0 81920 raid raid10 4 aAAA 40320/81920 recover 0 0 -


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_35M... and removing
40 
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
perform raid rebuilding (lvchange --yes --rebuild /dev/sdd1 raid_sanity/small_40M)
rebuild attr:raid_sanity-small_40M: 0 81920 raid raid10 4 aAAA 37632/81920 recover 0 0 -


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_40M... and removing
45 
perform raid scrubbing (lvchange --syncaction check) on raid raid_sanity/small_45M

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_45M... and removing
50 
perform raid scrubbing (lvchange --syncaction check) on raid raid_sanity/small_50M

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Deactivating raid small_50M... and removing

Note You need to log in before you can comment on or make changes to this bug.