Bug 816724
Summary: | udev issues when running lvm regression tests | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> | ||||||||
Component: | lvm2 | Assignee: | Peter Rajnoha <prajnoha> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 6.3 | CC: | agk, bubrown, byodlows, cmackows, cmarthal, cww, daniel.brnak, dwysocha, heinzm, jbrassow, loberman, mcsontos, msnitzer, nperic, prajnoha, prockai, pvrabec, sauchter, sunair, thornber, yaoal1, zkabelac | ||||||||
Target Milestone: | rc | Keywords: | Reopened | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1173739 (view as bug list) | Environment: | |||||||||
Last Closed: | 2017-06-14 19:01:39 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 464877, 886216, 1173739, 1374441 | ||||||||||
Attachments: |
|
Description
Corey Marthaler
2012-04-26 19:25:39 UTC
Another cmd that can cause this: deactivating VG mirror_1_1021 on taft-01 EXPORTING VOLUME GROUP on taft-01 Apr 26 14:43:29 taft-01 qarshd[5525]: Running cmdline: vgexport mirror_1_1021 Apr 26 14:46:29 taft-01 udevd[613]: worker [4595] unexpectedly returned with status 0x0100 Apr 26 14:46:29 taft-01 udevd[613]: worker [4595] failed while handling '/devices/pci0000:00/0000:00:06.0/0000:08:00.2/0000:0b:02.' Apr 26 14:46:29 taft-01 udevd[613]: worker [5288] unexpectedly returned with status 0x0100 Apr 26 14:46:29 taft-01 udevd[613]: worker [5288] failed while handling '/devices/pci0000:00/0000:00:06.0/0000:08:00.2/0000:0b:02.' I guess the udevd worker is stalled and then it returns with the error code 0x0100 because of the timeout - the process it forks is probably hung. That might be because of the global lock held (we've already had one such situation with "pvmove"). I wonder if changing the: ENV{ID_FS_TYPE}=="LVM2_member|LVM1_member", RUN+="$env{LVM_SBIN_PATH}/lvm pvscan --cache --major $major --minor $minor" ...line in /lib/udev/rules.d/69-dm-lvm-lvmetad.rules into: ENV{ID_FS_TYPE}=="LVM2_member|LVM1_member", RUN+="$env{LVM_SBIN_PATH}/lvm pvscan --cache --config 'global{locking_type=4}' --major $major --minor $minor" ...would help here (so changing the locking to "no locking"). Could you please try that? Thanks. I'm no longer seeing these 'udevd[]: worker failed' messages while running with lvmetad on with the latest rpms, closing this WORKSFORME. 2.6.32-269.el6.x86_64 lvm2-2.02.95-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 lvm2-libs-2.02.95-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 lvm2-cluster-2.02.95-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 udev-147-2.41.el6 BUILT: Thu Mar 1 13:01:08 CST 2012 device-mapper-1.02.74-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 device-mapper-libs-1.02.74-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 device-mapper-event-1.02.74-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 device-mapper-event-libs-1.02.74-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 cmirror-2.02.95-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 I'm seeing these again while doing 6.3.z lvm regression testing. lvm_cluster_seven_ten: SCENARIO - [split_using_invalid_lv] lvm_cluster_seven_ten: Attempt to split using invalid LV name on taft-02 lvm_cluster_seven_ten: create a logical volume lvm_cluster_seven_ten: deactivating seven/invalid lvm_cluster_seven_ten: One or more specified logical volume(s) not found. lvm_cluster_seven_ten: deactivation failed lvm_cluster_seven_ten: Deactivating and removing volume groups... lvm_cluster_seven_ten: Error locking on node taft-02: Command timed out lvm_cluster_seven_ten: Error locking on node taft-01: Command timed out lvm_cluster_seven_ten: Error locking on node taft-01: Command timed out lvm_cluster_seven_ten: Unable to deactivate logical volume "linear" lvm_cluster_seven_ten: Error locking on node taft-01: Command timed out lvm_cluster_seven_ten: couldn't remove LVs from seven Oct 19 14:03:31 taft-01 qarshd[19695]: Running cmdline: vgremove seven Oct 19 14:03:32 taft-01 xinetd[1837]: START: qarsh pid=19714 from=::ffff:10.15.80.47 Oct 19 14:03:32 taft-01 xinetd[1837]: EXIT: qarsh status=0 pid=19695 duration=1(sec) Oct 19 14:03:32 taft-01 qarshd[19714]: Talking to peer 10.15.80.47:45235 (IPv4) Oct 19 14:03:32 taft-01 qarshd[19714]: Running cmdline: vgs ten Oct 19 14:03:32 taft-01 xinetd[1837]: EXIT: qarsh status=0 pid=19714 duration=0(sec) Oct 19 14:08:33 taft-01 udevd[613]: worker [19184] unexpectedly returned with status 0x0100 Oct 19 14:08:33 taft-01 udevd[613]: worker [19184] failed while handling '/devices/pci0000:00/0000:00:06.0/0000:08:00.2/0000:0b:02.0/host3' Oct 19 14:08:33 taft-01 udevd[613]: worker [19187] unexpectedly returned with status 0x0100 Oct 19 14:08:33 taft-01 udevd[613]: worker [19187] failed while handling '/devices/virtual/block/dm-3' Comment #5 occurred with lvmetad turned off, and with the following rpms: 2.6.32-279.el6.x86_64 lvm2-2.02.95-10.el6_3.2 BUILT: Tue Oct 16 10:56:31 CDT 2012 lvm2-libs-2.02.95-10.el6_3.2 BUILT: Tue Oct 16 10:56:31 CDT 2012 lvm2-cluster-2.02.95-10.el6_3.2 BUILT: Tue Oct 16 10:56:31 CDT 2012 udev-147-2.41.el6 BUILT: Thu Mar 1 13:01:08 CST 2012 device-mapper-1.02.74-10.el6_3.2 BUILT: Tue Oct 16 10:56:31 CDT 2012 device-mapper-libs-1.02.74-10.el6_3.2 BUILT: Tue Oct 16 10:56:31 CDT 2012 device-mapper-event-1.02.74-10.el6_3.2 BUILT: Tue Oct 16 10:56:31 CDT 2012 device-mapper-event-libs-1.02.74-10.el6_3.2 BUILT: Tue Oct 16 10:56:31 CDT 2012 cmirror-2.02.95-10.el6_3.2 BUILT: Tue Oct 16 10:56:31 CDT 2012 Another test case that hit this: lvm_cluster_seven_ten SCENARIO - [split_lv_from_vg_with_mirror] lvm_cluster_seven_ten Split out a lv from vg with additional mirror on taft-02 lvm_cluster_seven_ten free pvs for taft-02: /dev/sde1 /dev/sdh1 /dev/sdg1 /dev/sdf1 /dev/sdd1 lvm_cluster_seven_ten create a linear and a mirror in the same vg (different pvs) lvm_cluster_seven_ten Error locking on node taft-04: Command timed out lvm_cluster_seven_ten Error locking on node taft-04: Command timed out lvm_cluster_seven_ten Unable to drop cached metadata for VG seven. lvm_cluster_seven_ten Failed to initialise mirror log. lvm_cluster_seven_ten Error locking on node taft-04: Command timed out lvm_cluster_seven_ten Unable to drop cached metadata for VG seven. lvm_cluster_seven_ten Manual intervention may be required to remove abandoned LV(s) before retrying. lvm_cluster_seven_ten Error locking on node taft-04: Command timed out lvm_cluster_seven_ten couldn't create logical volume Oct 19 14:46:58 taft-02 qarshd[12933]: Running cmdline: lvcreate -m 1 -n mirror -L 100M seven /dev/sdh1 /dev/sdg1 /dev/sdf1 Oct 19 14:46:55 taft-04 udevd[596]: worker [12628] unexpectedly returned with status 0x0100 Oct 19 14:46:55 taft-04 udevd[596]: worker [12628] failed while handling '/devices/virtual/block/dm-3' Is this reproducible *without* /lib/udev/rules.d/69-dm-lvmetad.rules? Please, try to remove these rules temporarily + using global/use_lvmetad=0. Both scenarios are passing on: a) latest RHEL6.4 builds (regardless of the lvmetad udev rules) b) older builds with removed the udev rules (I have not verified the 6.3.z version) If this problem is hit in cluster, then there should be a warning message now that lvmetad + cluster is not supported yet (bug #814779 tracks this). Please, return back to assigned if tests still fail. I still have 100% success reproducing the issue with latest RHEL6.4 build (lvm2-2.02.98-3.el6). Let's have a cluster VG vg, run: VG=vg LV=lv1 lvcreate -n $LV -L 100M $VG pvcreate /dev/$VG/$LV lvchange -an $VG/$LV lvchange -ay $VG/$LV This usually results in udev worker dying after pvscan fails due to clvmd timeout, and often leaves udev cookie hanging around blocking subsequent lvm calls. Attempt activating $VG/$LV with stacked PV: - acquires dlm lock - lv node is created - udev fires the 69* lvmetad rules - pvscan --cache is called - blocks on dlm lock Created attachment 641533 [details] Do not take global lock when not needed Also a scratch build with this patch: https://brewweb.devel.redhat.com/taskinfo?taskID=5067280 I've committed the fix upstream (just a temporary fix until we properly support lvmetad + cluster): http://git.fedorahosted.org/cgit/lvm2.git/commit/?id=fc2644ae71a87ea81979f66022b98ec49c692a66 Looks good. I have left tests running over weekend and saw no related failures. Ran both mcsontos tests and a full seven_ten array of tests for 10 times without issues or failures Tested with: lvm2-2.02.98-4.el6.x86_64 lvm2-cluster-2.02.98-4.el6.x86_64 cmirror-2.02.98-4.el6.x86_64 device-mapper-1.02.77-4.el6.x86_64 kernel-2.6.32-347.el6.x86_64 I'm still seeing these with the latest rpms, different test case. This was *WITH* lvmetad turned on and running. ============================================================ Iteration 2 of 2 started at Thu Dec 13 11:56:23 CST 2012 ============================================================ SCENARIO - [open_EXT4_fsadm_resize_attempt] Create mirror, add fs, and then attempt to resize it while it's mounted taft-01: lvcreate -m 1 -n open_fsadm_resize -L 4G --nosync mirror_sanity WARNING: New mirror won't be synchronised. Don't read what you didn't write! Placing an EXT4 on open_fsadm_resize volume mke2fs 1.41.12 (17-May-2010) Attempt to resize the open mirrored filesystem multiple times with lvextend/fsadm on taft-01 (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) resize2fs 1.41.12 (17-May-2010) (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) resize2fs 1.41.12 (17-May-2010) (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) resize2fs 1.41.12 (17-May-2010) (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) resize2fs 1.41.12 (17-May-2010) (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) resize2fs 1.41.12 (17-May-2010) (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) resize2fs 1.41.12 (17-May-2010) (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) resize2fs 1.41.12 (17-May-2010) (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) resize2fs 1.41.12 (17-May-2010) (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) resize2fs 1.41.12 (17-May-2010) (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) resize2fs 1.41.12 (17-May-2010) Deactivating mirror open_fsadm_resize... [HANG] Dec 13 11:57:31 taft-01 qarshd[12094]: Running cmdline: lvchange -an /dev/mirror_sanity/open_fsadm_resize Dec 13 12:00:31 taft-01 udevd[589]: worker [10978] unexpectedly returned with status 0x0100 Dec 13 12:00:31 taft-01 udevd[589]: worker [10978] failed while handling '/devices/pci0000:00/0000:00:06.0/0000:08:00.2/0000:0b:02.0/host3/rport-3:0-2/target3:0:0/3:0:0:3/block/sdd/sdd2' Dec 13 12:00:31 taft-01 udevd[589]: worker [11066] unexpectedly returned with status 0x0100 Dec 13 12:00:31 taft-01 udevd[589]: worker [11066] failed while handling '/devices/pci0000:00/0000:00:06.0/0000:08:00.2/0000:0b:02.0/host3/rport-3:0-2/target3:0:0/3:0:0:2/block/sdc/sdc1' Dec 13 12:00:31 taft-01 udevd[589]: worker [11070] unexpectedly returned with status 0x0100 Dec 13 12:00:31 taft-01 udevd[589]: worker [11070] failed while handling '/devices/pci0000:00/0000:00:06.0/0000:08:00.2/0000:0b:02.0/host3/rport-3:0-2/target3:0:0/3:0:0:6/block/sdg/sdg2' 2.6.32-343.el6.x86_64 lvm2-2.02.98-5.el6 BUILT: Wed Dec 12 09:18:07 CST 2012 lvm2-libs-2.02.98-5.el6 BUILT: Wed Dec 12 09:18:07 CST 2012 lvm2-cluster-2.02.98-5.el6 BUILT: Wed Dec 12 09:18:07 CST 2012 udev-147-2.43.el6 BUILT: Thu Oct 11 05:59:38 CDT 2012 device-mapper-1.02.77-5.el6 BUILT: Wed Dec 12 09:18:07 CST 2012 device-mapper-libs-1.02.77-5.el6 BUILT: Wed Dec 12 09:18:07 CST 2012 device-mapper-event-1.02.77-5.el6 BUILT: Wed Dec 12 09:18:07 CST 2012 device-mapper-event-libs-1.02.77-5.el6 BUILT: Wed Dec 12 09:18:07 CST 2012 cmirror-2.02.98-5.el6 BUILT: Wed Dec 12 09:18:07 CST 2012 (In reply to comment #18) > I'm still seeing these with the latest rpms, different test case. This was > *WITH* lvmetad turned on and running. > > ============================================================ > Iteration 2 of 2 started at Thu Dec 13 11:56:23 CST 2012 > ============================================================ > SCENARIO - [open_EXT4_fsadm_resize_attempt] > Create mirror, add fs, and then attempt to resize it while it's mounted > taft-01: lvcreate -m 1 -n open_fsadm_resize -L 4G --nosync mirror_sanity > WARNING: New mirror won't be synchronised. Don't read what you didn't > write! > Placing an EXT4 on open_fsadm_resize volume > mke2fs 1.41.12 (17-May-2010) > Attempt to resize the open mirrored filesystem multiple times with > lvextend/fsadm on taft-01 > (lvextend -L +3G -r /dev/mirror_sanity/open_fsadm_resize) > resize2fs 1.41.12 (17-May-2010) ... > Deactivating mirror open_fsadm_resize... > [HANG] > > > Dec 13 11:57:31 taft-01 qarshd[12094]: Running cmdline: lvchange -an > /dev/mirror_sanity/open_fsadm_resize > Dec 13 12:00:31 taft-01 udevd[589]: worker [10978] unexpectedly returned > with status 0x0100 > device-mapper-event-libs-1.02.77-5.el6 BUILT: Wed Dec 12 09:18:07 CST 2012 > cmirror-2.02.98-5.el6 BUILT: Wed Dec 12 09:18:07 CST 2012 Unfortunately, I can't reproduce this no matter how hard I try. So, Corey, I'll need your assistance and help, please. Would it be possible to see where the deactivation hangs? It might be an LVM lock or udev_sync sempahore (try running the lvchange -an/vgchange -an with -vvvv and we should see more then I hope) Also, would it be possible to run this exact test case with what was suggested in comment #2 (to see whether lvm locks are interfering with each other in some way). And one more question, just to make sure - this is reproducible only with lvmetad on, right? Also, in addition, would it be possible to enable udev debugging for this test case? (setting the udev_log="debug" in /etc/udev/udev.conf and then restarting udevd daemon before the test) (...starting udev daemon with "udevd --debug" and grabbing its output) This problem might be related to the high number of events occuring and being processed, there's a very similar problem reported here as well: bug #885978. I believe the source of the problem is the same. I'm inspecting that right now... Corey, could you please send the list of processes (ps aux) when the hang occurs? This is definitely a timeout caused either by a process locking out (maybe the pvscan and the global lock that is used there) or just udev event processing taking too long (like that blkid in bug #885978). Thanks. SCENARIO (raid10) - [verify_copy_percents] Create nosync and regular raids and verify their correct copy percents Creating nosync... hayes-01: lvcreate --type raid10 -i 2 -n nosync -L 1G --nosync raid_sanity WARNING: New raid10 won't be synchronised. Don't read what you didn't write! Verifing percent is finished at 100% Creating regular raid... hayes-01: lvcreate --type raid10 -i 2 -n regular -L 45G raid_sanity Verifing percent is started at 0.XX% and removing Deactivating raid regular... [HANG] I'll attach the list of processes. Jan 10 14:53:42 hayes-01 qarshd[4292]: Running cmdline: lvchange -an /dev/raid_sanity/nosync Jan 10 14:56:42 hayes-01 udevd[519]: worker [4163] unexpectedly returned with status 0x0100 Jan 10 14:56:42 hayes-01 udevd[519]: worker [4163] failed while handling '/devices/virtual/block/etherd!e1.1/etherd!e1.1p1' Jan 10 14:56:42 hayes-01 udevd[519]: worker [4164] unexpectedly returned with status 0x0100 Jan 10 14:56:42 hayes-01 udevd[519]: worker [4164] failed while handling '/devices/virtual/block/etherd!e1.1/etherd!e1.1p2' Jan 10 14:56:42 hayes-01 udevd[519]: worker [4230] unexpectedly returned with status 0x0100 Jan 10 14:56:42 hayes-01 udevd[519]: worker [4230] failed while handling '/devices/virtual/block/etherd!e1.1/etherd!e1.1p3' Jan 10 14:56:42 hayes-01 udevd[519]: worker [4233] unexpectedly returned with status 0x0100 Jan 10 14:56:42 hayes-01 udevd[519]: worker [4233] failed while handling '/devices/virtual/block/etherd!e1.1/etherd!e1.1p4' Jan 10 14:56:42 hayes-01 udevd[519]: worker [4236] unexpectedly returned with status 0x0100 Jan 10 14:56:42 hayes-01 udevd[519]: worker [4236] failed while handling '/devices/virtual/block/etherd!e1.1/etherd!e1.1p5' Jan 10 14:56:42 hayes-01 udevd[519]: worker [4240] unexpectedly returned with status 0x0100 Jan 10 14:56:42 hayes-01 udevd[519]: worker [4240] failed while handling '/devices/virtual/block/etherd!e1.1/etherd!e1.1p6' Jan 10 14:56:42 hayes-01 udevd[519]: worker [4245] unexpectedly returned with status 0x0100 Jan 10 14:56:42 hayes-01 udevd[519]: worker [4245] failed while handling '/devices/virtual/block/etherd!e1.1/etherd!e1.1p7' Jan 10 14:56:42 hayes-01 udevd[519]: worker [4257] unexpectedly returned with status 0x0100 Jan 10 14:56:42 hayes-01 udevd[519]: worker [4257] failed while handling '/devices/virtual/block/etherd!e1.1/etherd!e1.1p9' Jan 10 14:56:42 hayes-01 udevd[519]: worker [4251] unexpectedly returned with status 0x0100 Jan 10 14:56:42 hayes-01 udevd[519]: worker [4251] failed while handling '/devices/virtual/block/etherd!e1.1/etherd!e1.1p8' Jan 10 14:56:42 hayes-01 udevd[519]: worker [4278] unexpectedly returned with status 0x0100 Jan 10 14:56:42 hayes-01 udevd[519]: worker [4278] failed while handling '/devices/virtual/block/etherd!e1.1/etherd!e1.1p10' Jan 10 14:57:23 hayes-01 kernel: INFO: task lvchange:4293 blocked for more than 120 seconds. Jan 10 14:57:23 hayes-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 10 14:57:23 hayes-01 kernel: lvchange D 0000000000000002 0 4293 4292 0x00000080 Jan 10 14:57:23 hayes-01 kernel: ffff88011d0ffac8 0000000000000086 ffff88011d0ffa88 ffffffffa000441e Jan 10 14:57:23 hayes-01 kernel: ffff88011d0ffa98 0000000027535c1d ffff88011d0ffab8 ffff88011c479200 Jan 10 14:57:23 hayes-01 kernel: ffff88011cbb0638 ffff88011d0fffd8 000000000000fb88 ffff88011cbb0638 Jan 10 14:57:23 hayes-01 kernel: Call Trace: Jan 10 14:57:23 hayes-01 kernel: [<ffffffffa000441e>] ? dm_table_unplug_all+0x8e/0x100 [dm_mod] Jan 10 14:57:23 hayes-01 kernel: [<ffffffff810a1aa9>] ? ktime_get_ts+0xa9/0xe0 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff8150d983>] io_schedule+0x73/0xc0 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff811be84e>] __blockdev_direct_IO_newtrunc+0x6de/0xb30 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff811901ef>] ? do_lookup+0x9f/0x230 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff811becfe>] __blockdev_direct_IO+0x5e/0xd0 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff811bb0d0>] ? blkdev_get_blocks+0x0/0xc0 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff811bc197>] blkdev_direct_IO+0x57/0x60 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff811bb0d0>] ? blkdev_get_blocks+0x0/0xc0 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff8111bb8b>] generic_file_aio_read+0x6bb/0x700 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff81223041>] ? avc_has_perm+0x71/0x90 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff811bcb69>] ? __blkdev_get+0x1a9/0x3b0 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff811bb6e3>] blkdev_aio_read+0x53/0xc0 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff81180cea>] do_sync_read+0xfa/0x140 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff811bb5bc>] ? block_ioctl+0x3c/0x40 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff81194d22>] ? vfs_ioctl+0x22/0xa0 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff81228a6b>] ? selinux_file_permission+0xfb/0x150 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff8121b946>] ? security_file_permission+0x16/0x20 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff811815d5>] vfs_read+0xb5/0x1a0 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff81181711>] sys_read+0x51/0x90 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff810dc565>] ? __audit_syscall_exit+0x265/0x290 Jan 10 14:57:23 hayes-01 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Created attachment 676588 [details]
running processes during hang
Corey, there's a scratch build for udev from bug #885978 comment #25 where udev adds some throttling while doing forks. This may resolve the issue we see here as well, might be worth a try... The devel NAK still applies for 6.5. I believe the udev fix from comment #34 should help here as well. We need to retest... I have been unable to reproduce this issue in any of the latest rhel6.5 builds. Closing and will reopen if ever seen again. Back again in rhel6.6 while doing cache volume rename (with snapshots) testing SCENARIO - [cache_pool_rename] [...] syncing before snap creation... Making 2nd snapshot of origin volume lvcreate -s /dev/cache_sanity/rename_orig_B -c 64 -n snap2 -L 500M [DEADLOCK] Jun 27 16:48:00 host-002 qarshd[4076]: Running cmdline: lvcreate -s /dev/cache_sanity/rename_orig_B -c 64 -n snap2 -L 500M Jun 27 16:48:01 host-002 lvm[3974]: No longer monitoring snapshot cache_sanity-snap1 Jun 27 16:48:02 host-002 lvm[3974]: Monitoring snapshot cache_sanity-snap1 Jun 27 16:48:02 host-002 lvm[3974]: Monitoring snapshot cache_sanity-snap2 Jun 27 16:48:02 host-002 kernel: device-mapper: cache: promotion failed; couldn't update on disk metadata Jun 27 16:51:02 host-002 udevd[444]: worker [4097] unexpectedly returned with status 0x0100 Jun 27 16:51:02 host-002 udevd[444]: worker [4097] failed while handling '/devices/virtual/block/dm-6' Jun 27 16:51:02 host-002 udevd[444]: worker [4098] unexpectedly returned with status 0x0100 Jun 27 16:51:02 host-002 udevd[444]: worker [4098] failed while handling '/devices/virtual/block/dm-9' 2.6.32-485.el6.x86_64 lvm2-2.02.107-1.el6 BUILT: Mon Jun 23 09:44:45 CDT 2014 lvm2-libs-2.02.107-1.el6 BUILT: Mon Jun 23 09:44:45 CDT 2014 lvm2-cluster-2.02.107-1.el6 BUILT: Mon Jun 23 09:44:45 CDT 2014 udev-147-2.55.el6 BUILT: Wed Jun 18 06:30:21 CDT 2014 device-mapper-1.02.86-1.el6 BUILT: Mon Jun 23 09:44:45 CDT 2014 device-mapper-libs-1.02.86-1.el6 BUILT: Mon Jun 23 09:44:45 CDT 2014 device-mapper-event-1.02.86-1.el6 BUILT: Mon Jun 23 09:44:45 CDT 2014 device-mapper-event-libs-1.02.86-1.el6 BUILT: Mon Jun 23 09:44:45 CDT 2014 device-mapper-persistent-data-0.3.2-1.el6 BUILT: Fri Apr 4 08:43:06 CDT 2014 cmirror-2.02.107-1.el6 BUILT: Mon Jun 23 09:44:45 CDT 2014 Does this happen always when running that test (I haven't hit this yet) or is it happening only occasionally? Does it happen with latest build lvm2-2.02.109-2 (or later build if there's even newer one)? What were the exact steps taken that lead to the problem? Was it: - creating cache pool - creating cache volume - renaming cache volume (or cache pool?) - creating a snapshot of cache volume I've only seen these messages a few times lately, and all three times were during raid10 creations and never did they cause a problem or deadlock. There's really no new information on this in 6.7. I'm lowering the severity and priority until this can be reliably reproduced or starts causing deadlocks. 2.6.32-554.el6.x86_64 lvm2-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 lvm2-libs-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 lvm2-cluster-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 udev-147-2.61.el6 BUILT: Mon Mar 2 05:08:11 CST 2015 device-mapper-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-libs-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-event-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-event-libs-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-persistent-data-0.3.2-1.el6 BUILT: Fri Apr 4 08:43:06 CDT 2014 cmirror-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 May 6 16:53:48 host-082 qarshd[27772]: Running cmdline: lvcreate --type raid10 -i 3 -n non_synced_random_raid10_3legs_3 -L 3G black_bird /dev/sdb1:0-2400 /dev/sdc1:0-2400 /dev/sdf1:0-2400 /dev/sdd1:0-2400 /dev/sda1:0-2400 /dev/sdh1:0-2400 May 6 16:57:23 host-082 qarshd[27921]: Running cmdline: lvs -a -o +devices May 6 16:57:47 host-082 udevd[514]: worker [27838] unexpectedly returned with status 0x0100 May 6 16:57:47 host-082 udevd[514]: worker [27838] failed while handling '/devices/virtual/block/dm-40' May 5 22:24:45 host-073 qarshd[22526]: Running cmdline: lvcreate --type raid10 -i 3 -n non_synced_primary_raid10_3legs_3 -L 3G black_bird /dev/sdg1:0-2400 /dev/sdh1:0-2400 /dev/sde1:0-2400 /dev/sdd1:0-2400 /dev/sdb1:0-2400 /dev/sdf1:0-2400 May 5 22:27:47 host-073 qarshd[22676]: Running cmdline: lvs -a -o +devices May 5 22:28:15 host-073 udevd[514]: worker [22329] unexpectedly returned with status 0x0100 May 5 22:28:15 host-073 udevd[514]: worker [22329] failed while handling '/devices/virtual/block/dm-40' May 1 12:34:16 host-073 qarshd[9810]: Running cmdline: lvcreate --type raid10 -i 3 -n non_synced_random_raid10_3legs_3 -L 3G black_bird /dev/sdc1:0-2400 /dev/sdd1:0-2400 /dev/sdg1:0-2400 /dev/sde1:0-2400 /dev/sda1:0-2400 /dev/sdh1:0-2400 May 1 12:38:13 host-073 udevd[516]: worker [9350] unexpectedly returned with status 0x0100 May 1 12:38:13 host-073 udevd[516]: worker [9350] failed while handling '/devices/virtual/block/dm-40' This Bugzilla has been reviewed by Red Hat and is not planned on being addressed in Red Hat Enterprise Linux 6 and therefore will be closed. If this bug is critical to production systems, please contact your Red Hat support representative and provide sufficient business justification. This is pretty easy to reproduce in 6.8 if we want to try and solve this again. Scenario kill_secondary_log_2_legs_2_logs: Kill secondary log of synced 2 leg redundant log mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_secondary_log_2legs_2logs_1 * sync: 1 * striped: 0 * leg devices: /dev/sdf1 /dev/sdc1 * log devices: /dev/sda1 /dev/sdg1 * no MDA devices: * failpv(s): /dev/sdg1 * failnode(s): host-116.virt.lab.msp.redhat.com * lvmetad: 1 * leg fault policy: remove * log fault policy: remove ****************************************************** Creating mirror(s) on host-116.virt.lab.msp.redhat.com... host-116.virt.lab.msp.redhat.com: lvcreate --type mirror --mirrorlog mirrored -m 1 -n syncd_secondary_log_2legs_2logs_1 -L 500M helter_skelter /dev/sdf1:0-2400 /dev/sdc1:0-2400 /dev/sda1:0-150 /dev/sdg1:0-150 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices syncd_secondary_log_2legs_2logs_1 mwi-a-m--- 500.00m 9.60 syncd_secondary_log_2legs_2logs_1_mimage_0(0),syncd_secondary_log_2legs_2logs_1_mimage_1(0) [syncd_secondary_log_2legs_2logs_1_mimage_0] Iwi-aom--- 500.00m /dev/sdf1(0) [syncd_secondary_log_2legs_2logs_1_mimage_1] Iwi-aom--- 500.00m /dev/sdc1(0) [syncd_secondary_log_2legs_2logs_1_mlog] mwi-aom--- 4.00m 100.00 syncd_secondary_log_2legs_2logs_1_mlog_mimage_0(0),syncd_secondary_log_2legs_2logs_1_mlog_mimage_1(0) [syncd_secondary_log_2legs_2logs_1_mlog_mimage_0] iwi-aom--- 4.00m /dev/sda1(0) [syncd_secondary_log_2legs_2logs_1_mlog_mimage_1] iwi-aom--- 4.00m /dev/sdg1(0) Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) PV=/dev/sdg1 syncd_secondary_log_2legs_2logs_1_mlog_mimage_1: 2 PV=/dev/sdg1 syncd_secondary_log_2legs_2logs_1_mlog_mimage_1: 2 Writing verification patterns (bck) to mirror(s) on... Verifying block io data on... ---- host-116.virt.lab.msp.redhat.com ---- Verifying leg /dev/mapper/helter_skelter-syncd_secondary_log_2legs_2logs_1_mimage_1 Verifying leg /dev/mapper/helter_skelter-syncd_secondary_log_2legs_2logs_1_mimage_0 Disabling device sdg on host-116.virt.lab.msp.redhat.com Getting recovery check start time from /var/log/messages: Feb 5 17:09 Attempting I/O to cause mirror down conversion(s) on host-116.virt.lab.msp.redhat.com dd if=/dev/zero of=/dev/helter_skelter/syncd_secondary_log_2legs_2logs_1 count=1 [deadlock] Feb 5 17:09:14 host-116 qarshd[6991]: Running cmdline: dd if=/dev/zero of=/dev/helter_skelter/syncd_secondary_log_2legs_2logs_1 count=1 Feb 5 17:09:14 host-116 kernel: sd 9:0:0:1: rejecting I/O to offline device Feb 5 17:09:14 host-116 lvm[1793]: Secondary mirror device 253:3 has failed (D). Feb 5 17:09:14 host-116 lvm[1793]: Device failure in helter_skelter-syncd_secondary_log_2legs_2logs_1_mlog. Feb 5 17:09:14 host-116 lvm[1793]: /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error Feb 5 17:09:14 host-116 lvm[1793]: No PV label found on /dev/sdg1. Feb 5 17:09:14 host-116 kernel: sd 9:0:0:1: rejecting I/O to offline device Feb 5 17:09:14 host-116 lvm[1793]: WARNING: Device for PV o81C0N-nPr9-Ue8Y-V1yY-rv3Y-kC5v-WILrLs not found or rejected by a filter. Feb 5 17:09:14 host-116 lvm[1793]: /dev/sdg1: read failed after 0 of 512 at 4096: Input/output error Feb 5 17:09:14 host-116 lvm[1793]: WARNING: Failed to write an MDA of VG helter_skelter. Feb 5 17:09:14 host-116 kernel: sd 9:0:0:1: rejecting I/O to offline device Feb 5 17:09:14 host-116 lvm[1793]: Log device 253:4 has failed (D). Feb 5 17:09:14 host-116 lvm[1793]: Device failure in helter_skelter-syncd_secondary_log_2legs_2logs_1. Feb 5 17:09:14 host-116 kernel: sd 9:0:0:1: rejecting I/O to offline device Feb 5 17:09:14 host-116 lvm[1793]: WARNING: Failed to write an MDA of VG helter_skelter. Feb 5 17:09:14 host-116 kernel: sd 9:0:0:1: rejecting I/O to offline device Feb 5 17:09:14 host-116 kernel: sd 9:0:0:1: rejecting I/O to offline device Feb 5 17:09:14 host-116 multipathd: dm-3: remove map (uevent) Feb 5 17:09:14 host-116 multipathd: dm-3: devmap not registered, can't remove Feb 5 17:09:14 host-116 lvm[1793]: WARNING: Failed to write an MDA of VG helter_skelter. Feb 5 17:09:14 host-116 kernel: sd 9:0:0:1: rejecting I/O to offline device Feb 5 17:09:14 host-116 kernel: sd 9:0:0:1: rejecting I/O to offline device Feb 5 17:09:14 host-116 multipathd: dm-2: remove map (uevent) Feb 5 17:09:14 host-116 multipathd: dm-2: devmap not registered, can't remove Feb 5 17:09:14 host-116 lvm[1793]: WARNING: Failed to write an MDA of VG helter_skelter. Feb 5 17:09:14 host-116 kernel: sd 9:0:0:1: rejecting I/O to offline device Feb 5 17:09:14 host-116 kernel: sd 9:0:0:1: rejecting I/O to offline device Feb 5 17:09:14 host-116 lvm[1793]: Mirror log status: 1 of 2 images failed. Feb 5 17:11:52 host-116 kernel: INFO: task dd:6992 blocked for more than 120 seconds. Feb 5 17:11:52 host-116 kernel: Not tainted 2.6.32-610.el6.x86_64 #1 Feb 5 17:11:52 host-116 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 5 17:11:52 host-116 kernel: dd D 0000000000000000 0 6992 6991 0x00000080 Feb 5 17:11:52 host-116 kernel: ffff88001b29fbf8 0000000000000082 ffff880000000000 000956d0b5b04c29 Feb 5 17:11:52 host-116 kernel: ffff88001b29fbb8 ffff88003d1c20c0 00000000000951a4 ffffffffa94ad9e0 Feb 5 17:11:52 host-116 kernel: 0000000010a178a6 ffffffff81aa8340 ffff88003e1ab068 ffff88001b29ffd8 Feb 5 17:11:52 host-116 kernel: Call Trace: Feb 5 17:11:52 host-116 kernel: [<ffffffff8112e2a0>] ? sync_page+0x0/0x50 Feb 5 17:11:52 host-116 kernel: [<ffffffff8112e2a0>] ? sync_page+0x0/0x50 Feb 5 17:11:52 host-116 kernel: [<ffffffff81545853>] io_schedule+0x73/0xc0 Feb 5 17:11:52 host-116 kernel: [<ffffffff8112e2dd>] sync_page+0x3d/0x50 Feb 5 17:11:52 host-116 kernel: [<ffffffff8154633f>] __wait_on_bit+0x5f/0x90 Feb 5 17:11:52 host-116 kernel: [<ffffffff8112e513>] wait_on_page_bit+0x73/0x80 Feb 5 17:11:52 host-116 kernel: [<ffffffff810a6b40>] ? wake_bit_function+0x0/0x50 Feb 5 17:11:52 host-116 kernel: [<ffffffff81144635>] ? pagevec_lookup_tag+0x25/0x40 Feb 5 17:11:52 host-116 kernel: [<ffffffff8112e93b>] wait_on_page_writeback_range+0xfb/0x190 Feb 5 17:11:52 host-116 kernel: [<ffffffff81143694>] ? generic_writepages+0x24/0x30 Feb 5 17:11:52 host-116 kernel: [<ffffffff811436c1>] ? do_writepages+0x21/0x40 Feb 5 17:11:52 host-116 kernel: [<ffffffff8112ea8b>] ? __filemap_fdatawrite_range+0x5b/0x60 Feb 5 17:11:52 host-116 kernel: [<ffffffff8112e9ff>] filemap_fdatawait+0x2f/0x40 Feb 5 17:11:52 host-116 kernel: [<ffffffff8112f024>] filemap_write_and_wait+0x44/0x60 Feb 5 17:11:52 host-116 kernel: [<ffffffff811d7594>] __sync_blockdev+0x24/0x50 Feb 5 17:11:52 host-116 kernel: [<ffffffff811d75d3>] sync_blockdev+0x13/0x20 Feb 5 17:11:52 host-116 kernel: [<ffffffff811d7758>] __blkdev_put+0x178/0x1b0 Feb 5 17:11:52 host-116 kernel: [<ffffffff811dd843>] ? fsnotify+0x113/0x160 Feb 5 17:11:52 host-116 kernel: [<ffffffff811d77a0>] blkdev_put+0x10/0x20 Feb 5 17:11:52 host-116 kernel: [<ffffffff811d77e3>] blkdev_close+0x33/0x60 Feb 5 17:11:52 host-116 kernel: [<ffffffff8119b2f5>] __fput+0xf5/0x210 Feb 5 17:11:52 host-116 kernel: [<ffffffff8119b435>] fput+0x25/0x30 Feb 5 17:11:52 host-116 kernel: [<ffffffff8119655d>] filp_close+0x5d/0x90 Feb 5 17:11:52 host-116 kernel: [<ffffffff81196635>] sys_close+0xa5/0x100 Feb 5 17:11:52 host-116 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b 2.6.32-610.el6.x86_64 lvm2-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 lvm2-libs-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 lvm2-cluster-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 udev-147-2.66.el6 BUILT: Mon Jan 18 02:42:20 CST 2016 device-mapper-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-libs-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-event-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-event-libs-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-persistent-data-0.6.0-2.el6 BUILT: Thu Jan 21 02:40:25 CST 2016 cmirror-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 Created attachment 1121528 [details]
udev debug log
Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: https://access.redhat.com |