Bug 951721

Summary:	multiple large raid volume create attempts can deadlock
Product:	Red Hat Enterprise Linux 7	Reporter:	Corey Marthaler <cmarthal>
Component:	lvm2	Assignee:	LVM and device-mapper development team <lvm-team>
lvm2 sub component:	Default / Unclassified	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	high	CC:	agk, cmarthal, heinzm, jbrassow, msnitzer, prajnoha, prockai, thornber, zkabelac
Version:	7.0	Keywords:	Triaged
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-06-04 22:03:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2013-04-12 21:13:41 UTC

Description of problem:
If i attempt multiple 3G or 4G raid creations, the second attempt can deadlock. I have not been able to reproduce this with smaller 1G or 2G volumes.

[root@qalvm-01 ~]# pvscan
  PV /dev/vda1                      lvm2 [10.00 GiB]
  PV /dev/vdb1                      lvm2 [10.00 GiB]
  PV /dev/vdc1                      lvm2 [10.00 GiB]
  PV /dev/vdd1                      lvm2 [10.00 GiB]
  PV /dev/vde1                      lvm2 [10.00 GiB]
  PV /dev/vdf1                      lvm2 [10.00 GiB]
  PV /dev/vdg1                      lvm2 [10.00 GiB]
  PV /dev/vdh1                      lvm2 [10.00 GiB]
  Total: 9 [104.48 GiB] / in use: 1 [24.51 GiB] / in no VG: 8 [79.97 GiB]
[root@qalvm-01 ~]# vgcreate foo /dev/vd[abcdefgh]1
  Volume group "foo" successfully created

[root@qalvm-01 ~]# lvcreate --type raid10 -m 1 -L 3G -n lv1 foo
  Logical volume "lv1" created
[root@qalvm-01 ~]# lvcreate --type raid10 -m 1 -L 3G -n lv2 foo

[DEADLOCK]


 kernel: [  677.982083] md/raid10:mdX: not clean -- starting background reconstruction
 kernel: [  677.983045] md/raid10:mdX: active with 4 out of 4 devices
 kernel: [  677.983811] Choosing daemon_sleep default (5 sec)
 kernel: [  677.984476] created bitmap (3 pages) for device mdX
 kernel: [  681.270199] mdX: bitmap file is out of date, doing full recovery
 kernel: [  681.335417] mdX: bitmap initialized from disk: read 1 pages, set 6144 of 6144 bits
 lvm[996]: Monitoring RAID device foo-lv2 for events.
 kernel: [  682.854701] md: resync of RAID array mdX
 kernel: [  682.855475] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
 kernel: [  682.856450] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
 kernel: [  682.857954] md: using 128k window, over a total of 3145728k.
 systemd-udevd[367]: worker [1487] /devices/virtual/block/dm-19 timeout; kill it
 systemd-udevd[367]: seq 2247 '/devices/virtual/block/dm-19' killed
 systemd-udevd[367]: worker [1487] terminated by signal 9 (Killed)
resync done.
 kernel: [  719.586068] md: mdX: resync done.
 lvm[996]: raid10 array, foo-lv1, is now in-sync.
resync done.
 kernel: [  810.010242] md: mdX: resync done.
 lvm[996]: raid10 array, foo-lv2, is now in-sync.
 systemd[1]: Starting Cleanup of Temporary Directories...
 systemd[1]: Started Cleanup of Temporary Directories.


 kernel: [ 1688.401010] lvcreate        S ffff88007fd14180     0  1416    897 0x00000080
 kernel: [ 1688.401010]  ffff880079871cf8 0000000000000086 ffff88007c1bb440 ffff880079871fd8
 kernel: [ 1688.401010]  ffff880079871fd8 ffff880079871fd8 ffff88007c01ce60 ffff88007c1bb440
 kernel: [ 1688.401010]  ffff880079871ce8 ffff88007c1bb440 ffffffff819665c8 0000000000000000
 kernel: [ 1688.401010] Call Trace:
 kernel: [ 1688.401010]  [<ffffffff8160dba9>] schedule+0x29/0x70
 kernel: [ 1688.401010]  [<ffffffff812893fd>] sys_semtimedop+0x5fd/0x9c0
 kernel: [ 1688.401010]  [<ffffffff8113f1b1>] ? free_pages+0x61/0x70
 kernel: [ 1688.401010]  [<ffffffff8115d77a>] ? tlb_finish_mmu+0x3a/0x50
 kernel: [ 1688.401010]  [<ffffffff811651ab>] ? unmap_region+0xdb/0x120
 kernel: [ 1688.401010]  [<ffffffff812983e9>] ? security_ipc_permission+0x19/0x20
 kernel: [ 1688.401010]  [<ffffffff81285d0c>] ? ipcperms+0xac/0x130
 kernel: [ 1688.401010]  [<ffffffff81186c08>] ? kmem_cache_free+0x38/0x1d0
 kernel: [ 1688.401010]  [<ffffffff81298606>] ? security_sem_associate+0x16/0x20
 kernel: [ 1688.401010]  [<ffffffff81287889>] ? sem_security+0x9/0x10
 kernel: [ 1688.401010]  [<ffffffff81286001>] ? ipcget+0x101/0x210
 kernel: [ 1688.401010]  [<ffffffff812897d0>] sys_semop+0x10/0x20
 kernel: [ 1688.401010]  [<ffffffff81617359>] system_call_fastpath+0x16/0x1b



Version-Release number of selected component (if applicable):
3.8.0-0.40.el7.x86_64

lvm2-2.02.99-0.10.el7    BUILT: Wed Apr  3 08:28:34 CDT 2013
lvm2-libs-2.02.99-0.10.el7    BUILT: Wed Apr  3 08:28:34 CDT 2013
lvm2-cluster-2.02.99-0.10.el7    BUILT: Wed Apr  3 08:28:34 CDT 2013
device-mapper-1.02.78-0.10.el7    BUILT: Wed Apr  3 08:28:34 CDT 2013
device-mapper-libs-1.02.78-0.10.el7    BUILT: Wed Apr  3 08:28:34 CDT 2013
device-mapper-event-1.02.78-0.10.el7    BUILT: Wed Apr  3 08:28:34 CDT 2013
device-mapper-event-libs-1.02.78-0.10.el7    BUILT: Wed Apr  3 08:28:34 CDT 2013
cmirror-2.02.99-0.10.el7    BUILT: Wed Apr  3 08:28:34 CDT 2013



How reproducible:
Often (but not always)

Comment 1 Jonathan Earl Brassow 2013-06-04 21:57:29 UTC

Is this a duplicate of bug 956387?

Looks like you are working with the same RAID type and everything...

Capture the '-vvvv' output of the lvcreate command.  If the last thing printed before getting stuck is:
#activate/fs.c:489         Syncing device names
#libdm-common.c:2033         Udev cookie 0xd4d22b7 (semid 688128) decremented to 1
#libdm-common.c:2290         Udev cookie 0xd4d22b7 (semid 688128) waiting for zero

then this bug should be closed as a duplicate.

Comment 2 Corey Marthaler 2013-06-04 22:03:24 UTC

I believe you are correct.

*** This bug has been marked as a duplicate of bug 956387 ***