Bug 1006945
Summary: | Udev resource limit introduced in recent versions of udev causing deadlock with lvm2+lvmetad+autoactivation | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Peter Rajnoha <prajnoha> |
Component: | lvm2 | Assignee: | Peter Rajnoha <prajnoha> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 6.5 | CC: | agk, cmarthal, dwysocha, heinzm, jbrassow, msnitzer, nperic, prajnoha, prockai, thornber, zkabelac |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | lvm2-2.02.100-3.el6 | Doc Type: | Bug Fix |
Doc Text: |
This bug appeared during testing of unreleased version and it has been fixed. No doc text needed.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2013-11-21 23:27:49 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Peter Rajnoha
2013-09-11 14:32:05 UTC
To QA: to test this issue, the following steps should prove the fix: -> change /lib/udev/rules.d/69-dm-lvm-metad.rules and replace: RUN+="$env{LVM_SBIN_PATH}/lvm pvscan --cache --background --activate ay --major $major --minor $minor", ENV{LVM_SCANNED}="1" with RUN+="$env{LVM_SBIN_PATH}/lvm pvscan --cache --activate ay --major $major --minor $minor", ENV{LVM_SCANNED}="1" (iow, remove the --background option) -> kill existing udevd and run a new one with debug log [root@rhel6-a ~]# killall udevd [root@rhel6-a ~]# udevd --debug -> change max children to 1 (to create the tightest udev processing bottleneck) [root@rhel6-a ~]# udevadm control --max-childs=1 -> check udev log that the max childre is really 1 1383224102.414987 [2088] handle_ctrl_msg: udevd message (SET_MAX_CHILDS) received, max_childs=1 -> create VG/LV [root@rhel6-a ~]# vgcreate vg /dev/sda Volume group "vg" successfully created [root@rhel6-a ~]# lvcreate -l1 vg Logical volume "lvol0" created [root@rhel6-a ~]# lvchange -an vg -> check there are no dm udev_sync cookies in the system [root@rhel6-a ~]# dmsetup udevcookies Cookie Semid Value Last semop time Last change time -> trigger the add event (like it happens on device hotplug or coldplug) [root@rhel6-a ~]# echo add > /sys/block/sda/uevent -> now, the LV gets activated via the pvscan --cache called in udev rules, BUT there's a dangling symlink... [root@rhel6-a ~]# dmsetup udevcookies Cookie Semid Value Last semop time Last change time 0xd4d08b2 589831 1 Thu Oct 31 13:50:51 2013 Thu Oct 31 13:50:51 2013 -> also, there's blocked pvscan --cache call (from the udev rule) because udev can't do more than one process at a time (the pvscan --cache waits for udev notifying it about the LV activation completion, but LV activation is *another* event and that can't be handled because of the udev limit) [root@rhel6-a ~]# ps aux | grep lvm root 1028 0.0 1.5 184144 16296 ? Ssl 13:41 0:00 lvmetad root 2036 0.0 1.7 36148 17396 pts/0 S<+ 13:50 0:00 /sbin/lvm pvscan --cache --activate ay --major 8 --minor 0 root 2062 0.0 0.0 6392 664 pts/1 S+ 13:52 0:00 grep lvm -> after ~3 minutes, the timeout happens (code 0x100 is timeout) 1383224031.389869 [1921] handle_signal: worker [1927] exit 1383224031.389889 [1921] handle_signal: worker [1927] unexpectedly returned with status 0x0100 1383224031.389895 [1921] handle_signal: worker [1927] failed while handling '/devices/pci0000:00/0000:00:07.0/host2/target2:0:0/2:0:0:0/block/sda' 1383224031.390329 [1921] event_queue_delete: seq 1516 done with -32 1383224031.390339 [1921] worker_unref: worker [1927] cleaned up -> and leaving dangling cookies in the system as consequence and evidence Cookie Semid Value Last semop time Last change time 0xd4d08b2 589831 1 Thu Oct 31 13:50:51 2013 Thu Oct 31 13:50:51 2013 [root@rhel6-a ~]# rpm -q udev udev-147-2.51.el6.x86_64 Now, to test the fix, do the same with the new --background option back in the udev rule mentioned at the beginning. At this time, here should be no dangling dm udev_sync cookie seen in the dmsetup udevcookies (not taking into accound any old ones from previous test) and no timeout in udev... (or you can also do the same with echo 1 > /sys/block/sda/device/delete and then scsi-rescan --forcescan for the device to be plugged "more officially" then just issuing the "echo add" artificially) Tested with and without the fix prduced expected results. Without the fix a background process is stuck and a semaphore/cookie is still present and waiting: [root@virt-008 ~]# dmsetup udevcookies Cookie Semid Value Last semop time Last change time 0xd4d1bdb 294916 4 Thu Oct 31 15:12:16 2013 Thu Oct 31 15:12:16 2013 [root@virt-008 ~]# ps -ef | grep lvm root 743 1 0 14:56 ? 00:00:00 lvmetad root 2371 1 0 15:12 ttyS0 00:00:00 /sbin/lvm pvscan --cache --activate ay --major 8 --minor 209 After a few mins (udev debug log): 1383228916.230809 [2310] handle_signal: worker [2329] unexpectedly returned with status 0x0100 With the fix applied, it works differently: [root@virt-008 ~]# dmsetup udevcookies Cookie Semid Value Last semop time Last change time [root@virt-008 ~]# ps -ef | grep lvm root 743 1 0 14:56 ? 00:00:00 lvmetad Marking VERIFIED with lvm2-2.02.100-8.el6.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1704.html |