956387 – second raid10 volume creation attempt results in a deadlock

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 956387 - second raid10 volume creation attempt results in a deadlock

Summary: second raid10 volume creation attempt results in a deadlock

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	LVM and device-mapper development team
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	951721 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-04-24 21:01 UTC by Corey Marthaler
Modified:	2023-03-08 07:25 UTC (History)
CC List:	12 users (show)
Fixed In Version:	lvm2-2.02.103-1.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	969171 (view as bug list)
Environment:
Last Closed:	2014-06-13 12:50:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
lvcreate verbose output of stuck RAID10 creation (676.99 KB, text/plain) 2013-05-28 03:22 UTC, Jonathan Earl Brassow	no flags	Details
View All

Description Corey Marthaler 2013-04-24 21:01:12 UTC

Description of problem:
./snapper_thinp -t raid10 -e fs_io_A

SCENARIO - [fs_io_A]
Create snapshots of origin with fs data, and then verify that data on snapshots
Making origin volume
Converting *Raid* volumes to thin pool and thin pool metadata devices
lvcreate --type raid10 -m 1 -L 4G -n POOL snapper_thinp
lvcreate --type raid10 -m 1 -L 4G -n meta snapper_thinp
[DEADLOCK]

[root@qalvm-02 ~]# ps -ef | grep lvcreate
root      2447  2446  0 14:45 ?        00:00:00 lvcreate --type raid10 -m 1 -L 4G -n meta snapper_thinp

[root@qalvm-02 ~]# strace -p 2447
Process 2447 attached
semop(1605632, {{0, 0, 0}}, 1


Apr 24 14:45:56 qalvm-02 qarshd[2382]: Running cmdline: lvcreate --type raid10 -m 1 -L 4G -n POOL snapper_thinp
Apr 24 14:45:56 qalvm-02 kernel: [ 1212.910631] bio: create slab <bio-1> at 1
Apr 24 14:45:56 qalvm-02 kernel: [ 1212.967845] bio: create slab <bio-1> at 1
Apr 24 14:45:56 qalvm-02 kernel: [ 1213.013757] bio: create slab <bio-1> at 1
Apr 24 14:45:56 qalvm-02 kernel: [ 1213.082891] bio: create slab <bio-1> at 1
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.414962] bio: create slab <bio-1> at 1
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.431118] device-mapper: raid: Superblocks created for new array
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.441796] md/raid10:mdX: not clean -- starting background reconstruction
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.442649] md/raid10:mdX: active with 4 out of 4 devices
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.443416] Choosing daemon_sleep default (5 sec)
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.443979] created bitmap (4 pages) for device mdX
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.529178] mdX: bitmap file is out of date, doing full recovery
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.557934] mdX: bitmap initialized from disk: read 1 pages, set 8192 of 8192 bits
Apr 24 14:45:57 qalvm-02 lvm[1040]: Monitoring RAID device snapper_thinp-POOL for events.
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.640940] md: resync of RAID array mdX
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.641488] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.642186] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
Apr 24 14:45:57 qalvm-02 kernel: [ 1213.643345] md: using 128k window, over a total of 4194304k.
Apr 24 14:45:57 qalvm-02 xinetd[793]: EXIT: qarsh status=0 pid=2382 duration=1(sec)
Apr 24 14:45:57 qalvm-02 xinetd[793]: START: qarsh pid=2446 from=::ffff:10.15.80.47
Apr 24 14:45:57 qalvm-02 qarshd[2446]: Talking to peer 10.15.80.47:47231 (IPv4)
Apr 24 14:45:57 qalvm-02 qarshd[2446]: Running cmdline: lvcreate --type raid10 -m 1 -L 4G -n meta snapper_thinp
Apr 24 14:45:58 qalvm-02 kernel: [ 1215.038051] device-mapper: raid: Superblocks created for new array
Apr 24 14:45:58 qalvm-02 kernel: [ 1215.046853] md/raid10:mdX: not clean -- starting background reconstruction
Apr 24 14:45:58 qalvm-02 kernel: [ 1215.047701] md/raid10:mdX: active with 4 out of 4 devices
Apr 24 14:45:58 qalvm-02 kernel: [ 1215.048335] Choosing daemon_sleep default (5 sec)
Apr 24 14:45:58 qalvm-02 kernel: [ 1215.048907] created bitmap (4 pages) for device mdX
Apr 24 14:45:58 qalvm-02 kernel: [ 1215.120175] mdX: bitmap file is out of date, doing full recovery
Apr 24 14:45:58 qalvm-02 kernel: [ 1215.157407] mdX: bitmap initialized from disk: read 1 pages, set 8192 of 8192 bits
Apr 24 14:45:59 qalvm-02 lvm[1040]: Monitoring RAID device snapper_thinp-meta for events.
Apr 24 14:46:03 qalvm-02 kernel: [ 1219.759886] md: resync of RAID array mdX
Apr 24 14:46:03 qalvm-02 kernel: [ 1219.760424] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Apr 24 14:46:03 qalvm-02 kernel: [ 1219.762622] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
Apr 24 14:46:03 qalvm-02 kernel: [ 1219.763714] md: using 128k window, over a total of 4194304k.
Apr 24 14:46:29 qalvm-02 systemd-udevd[395]: worker [2297] /devices/virtual/block/dm-19 timeout; kill it
Apr 24 14:46:29 qalvm-02 systemd-udevd[395]: seq 2691 '/devices/virtual/block/dm-19' killed
Apr 24 14:46:33 qalvm-02 systemd-udevd[395]: worker [2297] terminated by signal 9 (Killed)
Apr 24 14:49:34 qalvm-02 kernel: [ 1430.692062] md: mdX: resync done.
Apr 24 14:49:39 qalvm-02 lvm[1040]: raid10 array, snapper_thinp-POOL, is now in-sync.
Apr 24 14:50:10 qalvm-02 kernel: [ 1466.413811] md: mdX: resync done.
Apr 24 14:50:11 qalvm-02 lvm[1040]: raid10 array, snapper_thinp-meta, is now in-sync.



Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243] lvcreate        S ffff88007fc14180     0  2447   2446 0x00000080
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  ffff880035879cf8 0000000000000086 ffff880079b00000 ffff880035879fd8
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  ffff880035879fd8 ffff880035879fd8 ffff880035934e60 ffff880079b00000
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  ffff880035879ce8 ffff880079b00000 ffffffff819665c8 0000000000000000
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243] Call Trace:
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  [<ffffffff8160dba9>] schedule+0x29/0x70
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  [<ffffffff812893fd>] sys_semtimedop+0x5fd/0x9c0
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  [<ffffffff8101339f>] ? do_signal+0x3f/0x5b0
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  [<ffffffff810760f4>] ? ptrace_notify+0x64/0x90
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  [<ffffffff810e2d7c>] ? __audit_syscall_exit+0x20c/0x2c0
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  [<ffffffff81013990>] ? do_notify_resume+0x80/0xb0
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  [<ffffffff812897d0>] sys_semop+0x10/0x20
Apr 24 15:59:25 qalvm-02 kernel: [ 5620.055243]  [<ffffffff81617359>] system_call_fastpath+0x16/0x1b



Version-Release number of selected component (if applicable):
3.8.0-0.40.el7.x86_64

lvm2-2.02.99-0.25.el7    BUILT: Tue Apr 23 19:27:37 CDT 2013
lvm2-libs-2.02.99-0.25.el7    BUILT: Tue Apr 23 19:27:37 CDT 2013
lvm2-cluster-2.02.99-0.25.el7    BUILT: Tue Apr 23 19:27:37 CDT 2013
device-mapper-1.02.78-0.25.el7    BUILT: Tue Apr 23 19:27:37 CDT 2013
device-mapper-libs-1.02.78-0.25.el7    BUILT: Tue Apr 23 19:27:37 CDT 2013
device-mapper-event-1.02.78-0.25.el7    BUILT: Tue Apr 23 19:27:37 CDT 2013
device-mapper-event-libs-1.02.78-0.25.el7    BUILT: Tue Apr 23 19:27:37 CDT 2013
cmirror-2.02.99-0.25.el7    BUILT: Tue Apr 23 19:27:37 CDT 2013


How reproducible:
Only once so far

Comment 1 Corey Marthaler 2013-04-25 16:20:41 UTC

This is reproducible.

Comment 2 Corey Marthaler 2013-05-01 16:18:16 UTC

This issue continues to appear during regression testing and blocks further progress.

Comment 3 Corey Marthaler 2013-05-01 19:41:04 UTC

All this requires to reproduce is two raid10 create attempts.

[root@qalvm-02 ~]# lvcreate --type raid10 -m 1 -L 4G -n lv1 test
  Logical volume "lv1" created
[root@qalvm-02 ~]# lvcreate --type raid10 -m 1 -L 4G -n lv2 test
[DEADLOCK]


May  1 14:27:27 qalvm-02 systemd-udevd[461]: worker [1324] /devices/virtual/block/dm-19 timeout; kill it
May  1 14:27:27 qalvm-02 systemd-udevd[461]: seq 1977 '/devices/virtual/block/dm-19' killed
May  1 14:27:33 qalvm-02 systemd-udevd[461]: worker [1324] terminated by signal 9 (Killed)

Comment 4 Jonathan Earl Brassow 2013-05-14 18:17:04 UTC

I've been able to create 25 RAID10 LV's using upstream kernel.

Is this reproducable on RHEL6?

I'll get RHEL7 installed and try it there.

Comment 5 Corey Marthaler 2013-05-15 16:55:41 UTC

This does not happen in official rhel6.4 build.

[root@taft-01 ~]# lvcreate --type raid10 -m 1 -L 4G -n lv1 taft
  Logical volume "lv1" created
[root@taft-01 ~]# lvcreate --type raid10 -m 1 -L 4G -n lv2 taft
  Logical volume "lv2" created

2.6.32-354.el6.x86_64
lvm2-2.02.98-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
lvm2-libs-2.02.98-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
lvm2-cluster-2.02.98-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
udev-147-2.43.el6    BUILT: Thu Oct 11 05:59:38 CDT 2012
device-mapper-1.02.77-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
device-mapper-libs-1.02.77-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
device-mapper-event-1.02.77-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
device-mapper-event-libs-1.02.77-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013
cmirror-2.02.98-9.el6    BUILT: Wed Jan 23 10:06:55 CST 2013

Comment 6 Corey Marthaler 2013-05-15 18:50:06 UTC

I can no longer reproduce this with the latest build from today (15-May-2013 15:29). The second create does still take quite awhile (in this case 2 minutes) apparently waiting for the first on to sync, but no longer hangs indefinitely. 

[root@qalvm-01 ~]# time lvcreate --type raid10 -m 1 -L 4G -n lv3 FOO
  Logical volume "lv3" created
real    0m2.107s
user    0m0.133s
sys     0m0.152s

[root@qalvm-01 ~]# time lvcreate --type raid10 -m 1 -L 4G -n lv4 FOO
  Logical volume "lv4" created
real    1m52.506s
user    0m0.162s
sys     0m0.157s


3.8.0-0.40.el7.x86_64
lvm2-2.02.99-0.32.el7    BUILT: Wed May 15 08:28:08 CDT 2013
lvm2-libs-2.02.99-0.32.el7    BUILT: Wed May 15 08:28:08 CDT 2013
lvm2-cluster-2.02.99-0.32.el7    BUILT: Wed May 15 08:28:08 CDT 2013
device-mapper-1.02.78-0.32.el7    BUILT: Wed May 15 08:28:08 CDT 2013
device-mapper-libs-1.02.78-0.32.el7    BUILT: Wed May 15 08:28:08 CDT 2013
device-mapper-event-1.02.78-0.32.el7    BUILT: Wed May 15 08:28:08 CDT 2013
device-mapper-event-libs-1.02.78-0.32.el7    BUILT: Wed May 15 08:28:08 CDT 2013
cmirror-2.02.99-0.32.el7    BUILT: Wed May 15 08:28:08 CDT 2013

Comment 7 Jonathan Earl Brassow 2013-05-15 21:34:25 UTC

Let's keep this bug open for a bit.  The delay seems strange to me.

I'm having issues installing RHEL7, but I'll look at it when that's done.

Comment 8 Jonathan Earl Brassow 2013-05-28 03:22:40 UTC

Created attachment 753716 [details]
lvcreate verbose output of stuck RAID10 creation

Problem seems to be with udev/memlock.  Still needs more investigation to be sure.

Comment 9 Jonathan Earl Brassow 2013-05-29 14:52:59 UTC

The last upstream LVM package fixes the issue of an indefinite hang, but it still takes quite some time to create the logical volumes.  I get two LVs created quickly and then it seems to wait for sync before allowing new ones to be created.

I'd be willing to close this bug once the RPM is updated and the indefinite hang is cleared, but there are still a couple questions:
1) does this happen on RHEL6? (Either the hang or the long waits for creation.)
2) What upstream commit fixed the issue?
3) Is the problem due to unthrottled initialization or something else?

Comment 10 Jonathan Earl Brassow 2013-05-29 15:13:00 UTC

Nevermind. Hit the same bug with the same output as the attachment. Some of the questions in comment 9 are still valid though. The same slowdown in creation is visible in RHEL6 (but not the hang) and it seems to be tied to waiting for initialization:

Below we can see that the first 3 RAID10 LVs were created quickly (all the LVs are still just beginning initialization). The 4th create takes a long time and when finished shows the other 3 RAID10 LVs have completed their sync.

Iteration #3:
real 0m0.661s
user 0m0.184s
sys 0m0.186s
Configuration setting "activation/thin_check_executable" unknown.
Configuration setting "activation/thin_check_options" unknown.
LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert
lv1 vg rwi-a-r-- 10.00g 0.00
lv2 vg rwi-a-r-- 10.00g 0.00
lv3 vg rwi-a-r-- 10.00g 0.00

Iteration #4:
real 1m34.160s
user 0m0.227s
sys 0m0.273s
Configuration setting "activation/thin_check_executable" unknown.
Configuration setting "activation/thin_check_options" unknown.
LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert
lv1 vg rwi-a-r-- 10.00g 100.00
lv2 vg rwi-a-r-- 10.00g 100.00
lv3 vg rwi-a-r-- 10.00g 100.00
lv4 vg rwi-a-r-- 10.00g 0.00

Comment 11 Jonathan Earl Brassow 2013-05-29 15:32:05 UTC

On RHEL6, if the max_recovery_rate is throttled, the creates happen almost instantly.  So it seems that (at least on RHEL6) that the unbounded sync processes are crowding out the I/O needed to update the VG metadata - causing the creation process to appear slow.

Comment 13 Jonathan Earl Brassow 2013-05-30 21:16:21 UTC

Throttling eliminates the problem on RHEL7 too.  However, LVM operations should not hang indefinitely if throttling is not used.  So, this still needs to get fixed even if throttling is introduced.

Comment 14 Jonathan Earl Brassow 2013-05-30 22:06:02 UTC

Removing --enable-udev_sync option from 'configure' solves the problem.  The issue is most certainly with udev.  Again, you can see this also in the attachment (attachment 753716 [details]).

To reproduce this issue, simply:
for i in $(seq 25) ; do
  echo
  echo -n "Iteration #$i: "
  time lvcreate --type raid10 -i 2 -m 1 -L 10G -n lv$i vg -vvvv >& output-$i.txt
  lvs vg
done

If '--enable-udev_sync' has been used, the third or forth lvcreate will hang forever.  (It is easier to reproduce if the LVs are at least 10G in size.)

Re-assigning to prajnoha for udev inspection.

Comment 15 Corey Marthaler 2013-06-04 22:03:24 UTC

*** Bug 951721 has been marked as a duplicate of this bug. ***

Comment 16 Peter Rajnoha 2013-07-15 14:26:15 UTC

Well, I've tried this on my VM and it happens right after the second lvcreate call! I put udevd into debug mode and all the scanning bits (blkid etc) just wait too long while trying to read the disk - seems the raid sync is an overkill here.

As a consequence, after the default 30 sec udev timeout (which was 3 min before they changed this!), the blkid call gets killed and the whole udev RUN queue with that (hence we miss the completion notification from udev which is at the end of this RUN queue and hence we end up with a hang). We already filed a bug to provide some mechanism in udev to handle such situations more peacefully, at least to have a possibility to do some cleanup or provide some useful error messages to users if the udev timeout happens (see also bug #918511).

The IO queues are just huge. This helps a bit (at least in my case):

  sysctl -w vm.dirty_background_ratio=1
  sysctl -w  vm.dirty_ratio=1

Simply, the disk read is blocked too much by that md raid sync... What we would probably need here is to ban any scanning on the raid LV until it is synced and ready. But I'm not sure if that is possible to do it this way. We'd need an event generated at the end of the sync only, not before it (or at least we need to skip any scanning unless it's completely in sync and then run the scans in udev after syncing). OR the md raid needs to be throttled better by default so that other reads are allowed and there's no starving.

Jon, any idea?

Comment 17 Jonathan Earl Brassow 2013-07-19 18:45:46 UTC

I'm not very keen on throttling just to keep udev from barfing.  Besides, what value should we set it to?

While it does slow down the process of creating many LVs in RHEL6, it still functions without throttling.  This is how things should work in RHEL7.  Do you know what has changed in udev in RHEL7 vs RHEL6 that is causing such a bad problem?

Comment 18 Peter Rajnoha 2013-07-22 14:07:56 UTC

(In reply to Jonathan Earl Brassow from comment #17)
> While it does slow down the process of creating many LVs in RHEL6, it still
> functions without throttling.  This is how things should work in RHEL7.  Do
> you know what has changed in udev in RHEL7 vs RHEL6 that is causing such a
> bad problem?

Unfortunately, I don't know exactly, it would need more inspection and comparison. However, we already hit an issue where the udev processing was slow enough (or starved enough) that udev killed the run queue on timeout and it also appeared later, for example in case of bug #885978, this was 6.3 iirc and we haven't found the exact change that caused this (though one of the main factors was in udev parallelism). I guess it must be a set of system-wide changes that make up for the whole timing to change and the actual timeout to appear as a consequence... In this case it's very hard to track down the exact source :-/

Comment 19 Peter Rajnoha 2013-09-11 11:06:10 UTC

Well, I've reinstated the original 3min timeout for LVM devices:

  https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=72a9d4f879751d58e531efbffaef1f7d325092c7

Recent udev uses 30s which is not enough when the system is under a load while the raid/mirror synchronization takes place. This is just a workaround until we can optimize this so that the 30s timeout is usable... Corey, could you please test the latest RHEL7 nighly build (Sep 11) to see whether the udev timeout/hang is gone? (or I can do a scratch build if needed).

If RHEL6 works, then this workaround should do it as then the udev timeout is exactly the same as in RHEL6...

Comment 20 Peter Rajnoha 2013-09-11 11:09:47 UTC

I'm setting this one to POST for now - the other solution is to simply throttle the RAID processing. I believe the reason it did not show up in RHEL6 is the change in udev timeout only.

Comment 22 Peter Rajnoha 2013-09-27 13:41:39 UTC

I need to put this back and add a fix - the timeout was not set correctly with the original patch. The fix:

https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=e4c7236c07056ddc9af3ba6ad2314d649d6814f9

Sorry.

Comment 23 Marian Csontos 2013-10-07 09:19:00 UTC

FYI, I am still running into the issue in 6.5 VMs (latest 6.5 compose with upstream LVM including 3 minute time out.) There are more VMs running on the host which are rather I/O intensive. This may not be a realistic scenario, but udev timing out may still happen under high load.

Comment 25 Marian Csontos 2013-10-15 08:41:25 UTC

After changing elevator to deadline on both host and VMs this is no longer reproducible.

Comment 26 Nenad Peric 2013-11-13 12:28:50 UTC

Ran 25 iterations of the loop, and even though it is slow, I did not get it stuck..
Ran on real HW not virtuals. 

Marking verified with lvm2-2.02.103-4.el7

Comment 27 Ludek Smid 2014-06-13 12:50:13 UTC

This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Note You need to log in before you can comment on or make changes to this bug.