Bug 1173739
Summary: | udev issues when running lvm regression tests | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> |
Component: | lvm2 | Assignee: | Peter Rajnoha <prajnoha> |
lvm2 sub component: | Changing Logical Volumes | QA Contact: | cluster-qe <cluster-qe> |
Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | agk, bubrown, cmackows, cmarthal, daniel.brnak, dwysocha, heinzm, jbrassow, loberman, mcsontos, msnitzer, mspqa-list, nperic, prajnoha, prockai, pvrabec, sauchter, sbhat, thornber, zkabelac |
Version: | 7.1 | Keywords: | Reopened, Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | 816724 | Environment: | |
Last Closed: | 2016-10-07 16:43:52 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 816724, 1073909 | ||
Bug Blocks: | 464877, 835616, 886216, 1203710, 1205796 |
Comment 1
Corey Marthaler
2014-12-12 20:07:04 UTC
As we run into this every few months seems the main issue is how to make it traceable at all. IMO the only sure thing is a customer will run into it sooner or later. I am running the `lvchange -pr/-prw` for a day now with a modified udev to use SIGABRT instead of SIGKILL without hitting the issue. I may keep it running for a week. To test my modification I have created an udev rule running `sleep 3600` and found it is the main udevd process killing the worker before the elapsed timeout of 180 seconds, not the worker killing the sleep process: Dec 16 10:34:35 worker [15783] /devices/virtual/block/dm-15 timeout; kill it Dec 16 10:34:35 seq 3464719 '/devices/virtual/block/dm-15' killed And two minutes later the "killed" worker kills the spawned process: Dec 16 10:36:36 timeout '/usr/bin/sleep 3600' Dec 16 10:36:37 timeout: killing '/usr/bin/sleep 3600' [15870] Dec 16 10:36:37 '/usr/bin/sleep 3600' [15870] terminated by signal 9 (Killed) Three findings: 1. there may be a bug in udev - the spawned process is not killed - it is the worker which is killed instead (and after 30s). 2. udev-worker can not be killed by SEGV nor ABRT. :-( 3. it looks like the `OPTIONS+="event_timeout=180"` may not be what we thought it is, or the timeout is set incorrectly, or the udev is wrong. Apparently in the last Corey's output the workers are killed after 30s. Assigning to Peter "The Udev Expert". Corey, any chance you have the machine still available? Or "at least" full journal? `journalctl -p debug` could shine some light into the issue... lvmetad is running. cluster is not. dmsetup looks like everything is fine: [root@host-112 ~]# dmsetup info -c Name Maj Min Stat Open Targ Event UUID rhel_host--112-swap 253 0 L--w 2 1 0 LVM-KOv5yV1bb5WyLDQ8bMaT1hrBYhsu7DwMaxeJpWPE4An9sleJyFwy8pdCCqk6Heeg rhel_host--112-root 253 1 L--w 1 1 0 LVM-KOv5yV1bb5WyLDQ8bMaT1hrBYhsu7DwM9r8IdKgOp47euRvaklB2fkusQfxfEI2d cache_6_9573-cache_6_95730_fast_cmeta 253 3 L--w 1 1 0 LVM-4YMuGkrrdnFWOe4tIDBfvlxnfTuBXH7qf7KV0G80fpE3b6T3feDPJEBoGMZVcHgW-cmeta cache_6_9573-cache_6_95730_fast_cdata 253 2 L--w 1 1 0 LVM-4YMuGkrrdnFWOe4tIDBfvlxnfTuBXH7qdrRz528znTqVDe0gWZMMVdFt2Ukr7VeW-cdata cache_6_9573-cache_6_95730_corig 253 4 L--w 1 1 0 LVM-4YMuGkrrdnFWOe4tIDBfvlxnfTuBXH7qQfgMC9zrMpSlk5BtzmF75JQxaYoYknc1-real cache_6_9573-cache_6_95730 253 58 L--w 1 1 1 LVM-4YMuGkrrdnFWOe4tIDBfvlxnfTuBXH7qLJjXNwbP5B043Y1NcpUGu0tdSOfBJE0i [root@host-112 ~]# dmsetup status rhel_host--112-swap: 0 1679360 linear rhel_host--112-root: 0 13983744 linear cache_6_9573-cache_6_95730_fast_cmeta: 0 16384 linear cache_6_9573-cache_6_95730_fast_cdata: 0 2097152 linear cache_6_9573-cache_6_95730_corig: 0 8388608 linear cache_6_9573-cache_6_95730: 0 8388608 cache 8 58/2048 128 0/16384 196 87 0 0 0 1 0 1 writethrough 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 512 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 8 [root@host-112 ~]# dmsetup table rhel_host--112-swap: 0 1679360 linear 252:2 2048 rhel_host--112-root: 0 13983744 linear 252:2 1681408 cache_6_9573-cache_6_95730_fast_cmeta: 0 16384 linear 8:81 2099200 cache_6_9573-cache_6_95730_fast_cdata: 0 2097152 linear 8:81 2048 cache_6_9573-cache_6_95730_corig: 0 8388608 linear 8:97 2048 cache_6_9573-cache_6_95730: 0 8388608 cache 253:3 253:2 253:4 128 1 writethrough mq 0 [root@host-112 ~]# dmsetup udevcookies Cookie Semid Value Last semop time Last change time 0xd4d51eb 201490433 1 Fri Dec 12 19:41:18 2014 Fri Dec 12 19:41:18 2014 It is just the lvchange is waiting for udev which is waiting for... what exactly? 4 S root 482 1 0 80 0 - 11202 ep_pol Dec12 ? 00:00:31 /usr/lib/systemd/systemd-udevd 5 D root 16920 482 0 80 0 - 0 sleep_ Dec12 ? 00:00:00 [systemd-udevd] 5 D root 16924 482 0 80 0 - 11201 blkdev Dec12 ? 00:00:00 /usr/lib/systemd/systemd-udevd 5 D root 16927 482 0 80 0 - 11201 blkdev Dec12 ? 00:00:00 /usr/lib/systemd/systemd-udevd It's the cache_6_9573-cache_6_95730 LV which is blocking. Why? Is it a kernel issue? All 3 udev-workers are blocked. There's nothing spawned. Were all processes' childs already killed by kernel? What signal is udev-worker handling? SIGCHLD? INFO: task systemd-udevd:16920 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. systemd-udevd D ffff88003fc14680 0 16920 482 0x00000086 ffff88002ab939a8 0000000000000046 ffff88003bbe2220 ffff88002ab93fd8 ffff88002ab93fd8 ffff88002ab93fd8 ffff88003bbe2220 ffff88003fc14f48 ffff88002ab93a30 0000000000000002 ffffffff81155fb0 ffff88002ab93a20 Call Trace: [<ffffffff81155fb0>] ? wait_on_page_read+0x60/0x60 [<ffffffff81609e6d>] io_schedule+0x9d/0x130 [<ffffffff81155fbe>] sleep_on_page+0xe/0x20 [<ffffffff81607d7b>] __wait_on_bit_lock+0x5b/0xc0 [<ffffffff811560d8>] __lock_page+0x78/0xa0 [<ffffffff81098270>] ? autoremove_wake_function+0x40/0x40 [<ffffffff8116648e>] truncate_inode_pages_range+0x6fe/0x740 [<ffffffff811664e5>] truncate_inode_pages+0x15/0x20 [<ffffffff811ff1af>] kill_bdev+0x2f/0x40 [<ffffffff812007e4>] __blkdev_put+0x64/0x1a0 [<ffffffff8120126e>] blkdev_put+0x4e/0x140 [<ffffffff81201415>] blkdev_close+0x25/0x30 [<ffffffff811c7f49>] __fput+0xe9/0x270 [<ffffffff811c820e>] ____fput+0xe/0x10 [<ffffffff81093b44>] task_work_run+0xc4/0xe0 [<ffffffff810745ab>] do_exit+0x2cb/0xa60 [<ffffffff811e0e7f>] ? touch_atime+0x12f/0x160 [<ffffffff81158198>] ? generic_file_aio_read+0x5b8/0x750 [<ffffffff81074dbf>] do_group_exit+0x3f/0xa0 [<ffffffff810850b0>] get_signal_to_deliver+0x1d0/0x6d0 [<ffffffff811ffcfc>] ? blkdev_aio_read+0x4c/0x70 [<ffffffff81013407>] do_signal+0x57/0x6c0 [<ffffffff81013ad9>] do_notify_resume+0x69/0xb0 [<ffffffff8161499d>] int_signal+0x12/0x17 INFO: task systemd-udevd:16924 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. systemd-udevd D ffff88003fc14680 0 16924 482 0x00000084 ffff88002abf3b10 0000000000000082 ffff88003bbe0b60 ffff88002abf3fd8 ffff88002abf3fd8 ffff88002abf3fd8 ffff88003bbe0b60 ffff88003dc26d98 ffff88003dc26d9c ffff88003bbe0b60 00000000ffffffff ffff88003dc26da0 Call Trace: [<ffffffff8160ad59>] schedule_preempt_disabled+0x29/0x70 [<ffffffff81608aa5>] __mutex_lock_slowpath+0xc5/0x1c0 [<ffffffff81607f0f>] mutex_lock+0x1f/0x2f [<ffffffff81200996>] __blkdev_get+0x76/0x4d0 [<ffffffff81200fc5>] blkdev_get+0x1d5/0x360 [<ffffffff812011fb>] blkdev_open+0x5b/0x80 [<ffffffff811c403f>] do_dentry_open+0x18f/0x2b0 [<ffffffff811d1e32>] ? __inode_permission+0x52/0xc0 [<ffffffff812011a0>] ? blkdev_get_by_dev+0x50/0x50 [<ffffffff811c4191>] finish_open+0x31/0x40 [<ffffffff811d5aac>] do_last+0x57c/0x1220 [<ffffffff811ab6de>] ? kmem_cache_alloc_trace+0x1ce/0x1f0 [<ffffffff811d6812>] path_openat+0xc2/0x4c0 [<ffffffff811d742b>] do_filp_open+0x4b/0xb0 [<ffffffff811e3c57>] ? __alloc_fd+0xa7/0x130 [<ffffffff811c5643>] do_sys_open+0xf3/0x1f0 [<ffffffff811c575e>] SyS_open+0x1e/0x20 [<ffffffff816146e9>] system_call_fastpath+0x16/0x1b INFO: task systemd-udevd:16927 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. systemd-udevd D ffff88003fc14680 0 16927 482 0x00000084 ffff88000d81fb10 0000000000000086 ffff88002b71a220 ffff88000d81ffd8 ffff88000d81ffd8 ffff88000d81ffd8 ffff88002b71a220 ffff88003dc26d98 ffff88003dc26d9c ffff88002b71a220 00000000ffffffff ffff88003dc26da0 Call Trace: [<ffffffff8160ad59>] schedule_preempt_disabled+0x29/0x70 [<ffffffff81608aa5>] __mutex_lock_slowpath+0xc5/0x1c0 [<ffffffff81607f0f>] mutex_lock+0x1f/0x2f [<ffffffff81200996>] __blkdev_get+0x76/0x4d0 [<ffffffff81200fc5>] blkdev_get+0x1d5/0x360 [<ffffffff812011fb>] blkdev_open+0x5b/0x80 [<ffffffff811c403f>] do_dentry_open+0x18f/0x2b0 [<ffffffff811d1e32>] ? __inode_permission+0x52/0xc0 [<ffffffff812011a0>] ? blkdev_get_by_dev+0x50/0x50 [<ffffffff811c4191>] finish_open+0x31/0x40 [<ffffffff811d5aac>] do_last+0x57c/0x1220 [<ffffffff811ab6de>] ? kmem_cache_alloc_trace+0x1ce/0x1f0 [<ffffffff811d6812>] path_openat+0xc2/0x4c0 [<ffffffff811d742b>] do_filp_open+0x4b/0xb0 [<ffffffff811e3c57>] ? __alloc_fd+0xa8/0x130 [<ffffffff811c5643>] do_sys_open+0xf3/0x1f0 [<ffffffff811c575e>] SyS_open+0x1e/0x20 [<ffffffff816146e9>] system_call_fastpath+0x16/0x1b The full log of what was executed: Dec 12 19:40:59 host-112 qarshd[16540]: Running cmdline: pvcreate /dev/sdb2 /dev/sdb1 /dev/sda2 /dev/sda1 /dev/sdd2 /dev/sdd1 /dev/sde2 /dev/sde1 /dev/sdf2 /dev/sdf1 /dev/sdg2 /dev/sdg1 Dec 12 19:41:00 host-112 qarshd[16546]: Running cmdline: vgcreate cache_6_9573 /dev/sdb2 /dev/sdb1 /dev/sda2 /dev/sda1 /dev/sdd2 /dev/sdd1 /dev/sde2 /dev/sde1 /dev/sdf2 /dev/sdf1 /dev/sdg2 /dev/sdg1 Dec 12 19:41:00 host-112 qarshd[16559]: Running cmdline: pvs --noheadings Dec 12 19:41:00 host-112 qarshd[16564]: Running cmdline: pvdisplay /dev/sda1 Dec 12 19:41:00 host-112 qarshd[16569]: Running cmdline: pvdisplay /dev/sda2 Dec 12 19:41:00 host-112 qarshd[16574]: Running cmdline: pvdisplay /dev/sdb1 Dec 12 19:41:00 host-112 qarshd[16579]: Running cmdline: pvdisplay /dev/sdb2 Dec 12 19:41:00 host-112 qarshd[16584]: Running cmdline: pvdisplay /dev/sdd1 Dec 12 19:41:00 host-112 qarshd[16588]: Running cmdline: pvdisplay /dev/sdd2 Dec 12 19:41:00 host-112 qarshd[16594]: Running cmdline: pvdisplay /dev/sde1 Dec 12 19:41:00 host-112 qarshd[16599]: Running cmdline: pvdisplay /dev/sde2 Dec 12 19:41:01 host-112 qarshd[16604]: Running cmdline: pvdisplay /dev/sdf1 Dec 12 19:41:01 host-112 qarshd[16609]: Running cmdline: pvdisplay /dev/sdf2 Dec 12 19:41:01 host-112 qarshd[16615]: Running cmdline: pvdisplay /dev/sdg1 Dec 12 19:41:01 host-112 qarshd[16620]: Running cmdline: pvdisplay /dev/sdg2 Dec 12 19:41:01 host-112 qarshd[16625]: Running cmdline: vgdisplay cache_6_9573 Dec 12 19:41:01 host-112 qarshd[16630]: Running cmdline: pvs --noheadings Dec 12 19:41:01 host-112 qarshd[16635]: Running cmdline: lvcreate -L 4G -n cache_6_95730 cache_6_9573 /dev/sdg1 Dec 12 19:41:01 host-112 qarshd[16646]: Running cmdline: lvcreate -L 1G -n cache_6_95730_fast cache_6_9573 /dev/sdf1 Dec 12 19:41:02 host-112 qarshd[16659]: Running cmdline: lvcreate -L 8M -n cache_6_95730_fast_meta cache_6_9573 /dev/sdf1 Dec 12 19:41:02 host-112 qarshd[16671]: Running cmdline: lvconvert --type cache-pool --poolmetadata cache_6_9573/cache_6_95730_fast_meta cache_6_9573/cache_6_95730_fast --yes Dec 12 19:41:02 host-112 qarshd[16706]: Running cmdline: lvconvert --type cache --cachepool cache_6_9573/cache_6_95730_fast cache_6_9573/cache_6_95730 --yes Dec 12 19:41:03 host-112 qarshd[16741]: Running cmdline: vgcreate test /dev/xxxxx Dec 12 19:41:03 host-112 qarshd[16746]: Running cmdline: pvs --noheadings Dec 12 19:41:03 host-112 qarshd[16751]: Running cmdline: vgs -a -o +devices Dec 12 19:41:03 host-112 qarshd[16756]: Running cmdline: vgs -a -o +devices Dec 12 19:41:03 host-112 qarshd[16761]: Running cmdline: vgs -a -o +devices Dec 12 19:41:03 host-112 qarshd[16766]: Running cmdline: vgs -a -o +devices Dec 12 19:41:03 host-112 qarshd[16770]: Running cmdline: vgs -a -o +devices Dec 12 19:41:03 host-112 qarshd[16775]: Running cmdline: vgs -a -o +devices Dec 12 19:41:03 host-112 qarshd[16779]: Running cmdline: vgs -a -o +devices Dec 12 19:41:03 host-112 qarshd[16784]: Running cmdline: vgs -a -o +devices Dec 12 19:41:03 host-112 qarshd[16789]: Running cmdline: vgs -a -o +devices Dec 12 19:41:04 host-112 qarshd[16794]: Running cmdline: vgs -a -o +devices Dec 12 19:41:04 host-112 qarshd[16799]: Running cmdline: vgs -a -o +devices Dec 12 19:41:04 host-112 qarshd[16804]: Running cmdline: vgs -a -o +devices Dec 12 19:41:04 host-112 qarshd[16809]: Running cmdline: pvchange --addtag P1/ --addtag P2= /dev/sdb2 Dec 12 19:41:04 host-112 qarshd[16812]: Running cmdline: pvs /dev/sdb2 --noheadings -o pv_tags Dec 12 19:41:04 host-112 qarshd[16817]: Running cmdline: pvchange --deltag P1/ --addtag P3! /dev/sdb2 Dec 12 19:41:04 host-112 qarshd[16820]: Running cmdline: pvs /dev/sdb2 --noheadings -o pv_tags Dec 12 19:41:04 host-112 qarshd[16826]: Running cmdline: pvchange --deltag P2= --deltag P3! /dev/sdb2 Dec 12 19:41:04 host-112 qarshd[16829]: Running cmdline: pvs /dev/sdb2 --noheadings -o pv_tags Dec 12 19:41:05 host-112 qarshd[16835]: Running cmdline: vgchange --addtag V4: --addtag V5# cache_6_9573 Dec 12 19:41:05 host-112 qarshd[16838]: Running cmdline: vgs cache_6_9573 --noheadings -o vg_tags Dec 12 19:41:05 host-112 qarshd[16844]: Running cmdline: vgchange --deltag V4: --addtag V6\& --addtag lvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_config Dec 12 19:41:05 host-112 qarshd[16847]: Running cmdline: vgs cache_6_9573 --noheadings -o vg_tags Dec 12 19:41:05 host-112 qarshd[16852]: Running cmdline: vgchange --deltag V5# --deltag V6\& --deltag lvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_config Dec 12 19:41:05 host-112 qarshd[16855]: Running cmdline: vgs cache_6_9573 --noheadings -o vg_tags Dec 12 19:41:05 host-112 qarshd[16861]: Running cmdline: lvchange --addtag L7/ --addtag L8= /dev/cache_6_9573/cache_6_95730 Dec 12 19:41:05 host-112 qarshd[16864]: Running cmdline: lvs /dev/cache_6_9573/cache_6_95730 --noheadings -o lv_tags Dec 12 19:41:06 host-112 qarshd[16870]: Running cmdline: lvchange --deltag L7/ --addtag L9! --addtag lvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configl Dec 12 19:41:06 host-112 qarshd[16873]: Running cmdline: lvs /dev/cache_6_9573/cache_6_95730 --noheadings -o lv_tags Dec 12 19:41:06 host-112 qarshd[16879]: Running cmdline: lvchange --deltag L8= --deltag L9! --deltag lvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configlvm_configl Dec 12 19:41:06 host-112 qarshd[16882]: Running cmdline: lvs /dev/cache_6_9573/cache_6_95730 --noheadings -o lv_tags Dec 12 19:41:06 host-112 qarshd[16887]: Running cmdline: pvs -a Dec 12 19:41:06 host-112 qarshd[16892]: Running cmdline: vgs -a Dec 12 19:41:06 host-112 qarshd[16897]: Running cmdline: lvs -a Dec 12 19:41:06 host-112 qarshd[16902]: Running cmdline: lvdisplay /dev/cache_6_9573/cache_6_95730 Dec 12 19:41:06 host-112 qarshd[16907]: Running cmdline: lvdisplay /dev/cache_6_9573/cache_6_95730 Dec 12 19:41:12 host-112 qarshd[16912]: Running cmdline: lvdisplay /dev/cache_6_9573/cache_6_95730 Dec 12 19:41:12 host-112 qarshd[16917]: Running cmdline: lvchange -aye -f -My --major 255 --minor 58 /dev/cache_6_9573/cache_6_95730 Dec 12 19:41:13 host-112 qarshd[16978]: Running cmdline: dmsetup ls | grep cache_6_95730 | grep 58 Dec 12 19:41:13 host-112 qarshd[16985]: Running cmdline: lvs --noheadings -o lv_kernel_minor cache_6_9573/cache_6_95730 | grep 58 Dec 12 19:41:13 host-112 qarshd[16991]: Running cmdline: lvrename cache_6_9573 /dev/cache_6_9573/cache_6_95730 rename_107 Dec 12 19:41:14 host-112 qarshd[17025]: Running cmdline: lvrename cache_6_9573 rename_107 /dev/cache_6_9573/cache_6_95730 Dec 12 19:41:15 host-112 qarshd[17059]: Running cmdline: vgchange -an cache_6_9573 Dec 12 19:41:16 host-112 qarshd[17077]: Running cmdline: vgexport cache_6_9573 Dec 12 19:41:16 host-112 qarshd[17080]: Running cmdline: vgchange -aye cache_6_9573 Dec 12 19:41:16 host-112 qarshd[17085]: Running cmdline: vgimport cache_6_9573 Dec 12 19:41:16 host-112 qarshd[17089]: Running cmdline: vgchange -ay cache_6_9573 Dec 12 19:41:16 host-112 qarshd[17128]: Running cmdline: lvchange -pr /dev/cache_6_9573/cache_6_95730 Dec 12 19:41:17 host-112 qarshd[17155]: Running cmdline: lvdisplay /dev/cache_6_9573/cache_6_95730 Dec 12 19:41:18 host-112 qarshd[17159]: Running cmdline: vgchange -ay cache_6_9573 Dec 12 19:41:18 host-112 qarshd[17164]: Running cmdline: dd if=/dev/zero of=/dev/cache_6_9573/cache_6_95730 count=10 Dec 12 19:41:18 host-112 qarshd[17169]: Running cmdline: lvchange -prw /dev/cache_6_9573/cache_6_95730 I have found there are some left-over links in /dev/mapper: lrwxrwxrwx. 1 root root 7 Dec 12 17:01 cache_5_2236-cache_5_22360_corig -> ../dm-4 lrwxrwxrwx. 1 root root 7 Dec 12 17:01 cache_5_2236-cache_5_22360_fast_cdata -> ../dm-2 lrwxrwxrwx. 1 root root 7 Dec 12 17:01 cache_5_2236-cache_5_22360_fast_cmeta -> ../dm-3 lrwxrwxrwx. 1 root root 8 Dec 12 19:41 cache_6_9573-cache_6_95730 -> ../dm-58 lrwxrwxrwx. 1 root root 7 Dec 12 19:41 cache_6_9573-cache_6_95730_corig -> ../dm-4 lrwxrwxrwx. 1 root root 7 Dec 12 19:41 cache_6_9573-cache_6_95730_fast_cdata -> ../dm-2 lrwxrwxrwx. 1 root root 7 Dec 12 19:41 cache_6_9573-cache_6_95730_fast_cmeta -> ../dm-3 Is it possible udev is trying to touch one device from multiple threads? Why were they not removed? What's multipathd doing there? Dec 12 18:40:28 host-112 multipathd: dm-2: remove map (uevent) Dec 12 18:40:28 host-112 multipathd: dm-2: devmap not registered, can't remove Dec 12 18:40:28 host-112 multipathd: dm-3: remove map (uevent) Dec 12 18:40:28 host-112 multipathd: dm-3: devmap not registered, can't remove Dec 12 18:40:28 host-112 multipathd: dm-4: remove map (uevent) Dec 12 18:40:28 host-112 multipathd: dm-4: devmap not registered, can't remove Looks harmless, or am I wrong? What's the lvchange waiting for: 4 S root 17170 17169 17170 0 1 80 0 - 36512 SYSC_s Dec12 ? 00:00:00 lvchange -prw /dev/cache_6_9573/cache_6_95730 81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) (gdb) bt full #0 0x00007f271c4899a7 in semop () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f271c9aa1cc in _udev_wait (cookie=223171051) at libdm-common.c:2499 semid = 201490433 sb = {sem_num = 0, sem_op = 0, sem_flg = 0} #2 dm_udev_wait (cookie=223171051) at libdm-common.c:2517 No locals. #3 0x00007f271d7363ed in fs_unlock () at activate/fs.c:491 No locals. #4 0x00007f271d73b845 in _file_lock_resource (cmd=<optimized out>, resource=<optimized out>, flags=256, lv=<optimized out>) at locking/file_locking.c:64 lockfile = '\000' <repeats 1216 times>... origin_only = 0 revert = 0 #5 0x00007f271d6dcef8 in _lock_vol (cmd=cmd@entry=0x7f271dce6060, resource=<optimized out>, resource@entry=0x7fffe79de380 "#sync_names", flags=flags@entry=256, lv_op=lv_op@entry=LV_NOOP, lv=lv@entry=0x0) at locking/locking.c:274 lck_type = 0 lck_scope = 0 ret = 0 __PRETTY_FUNCTION__ = "_lock_vol" #6 0x00007f271d6dd833 in lock_vol (cmd=cmd@entry=0x7f271dce6060, vol=<optimized out>, vol@entry=0x7f271d75d0b9 "#sync_names", flags=flags@entry=256, lv=lv@entry=0x0) at locking/locking.c:354 resource = "#sync_names", '\000' <repeats 246 times> lv_op = LV_NOOP lck_type = 0 #7 0x00007f271d6de510 in sync_dev_names (cmd=cmd@entry=0x7f271dce6060) at locking/locking.c:535 No locals. #8 0x00007f271d6a7980 in _process_lv_vgnameid_list (arg_vgnames=0x7fffe79de560, process_single_lv=0x7f271d6829e0 <_lvchange_single>, handle=0x0, arg_tags=0x7fffe79de550, arg_lvnames=0x7fffe79de570, vgnameids_to_process=0x7fffe79de590, flags=1048576, cmd=0x7f271dce6060) at toollib.c:1967 sl = <optimized out> tags_arg = <optimized out> vg_uuid = 0x0 vg = 0x7f271dd94dd0 vgn = <optimized out> lvn = <optimized out> ret_max = 1 ret = <optimized out> vgnl = 0x7f271dd6eaa8 lvnames = {n = 0x7fffe79de5a0, p = 0x7fffe79de5a0} vg_name = 0x7f271dd6eaca "cache_6_9573" skip = 0 #9 process_each_lv (cmd=0x7f271dce6060, argc=<optimized out>, argv=<optimized out>, flags=1048576, handle=0x0, process_single_lv=0x7f271d6829e0 <_lvchange_single>) at toollib.c:2030 arg_tags = {n = 0x7fffe79de550, p = 0x7fffe79de550} arg_vgnames = {n = 0x7f271dd6ea48, p = 0x7f271dd6ea48} arg_lvnames = {n = 0x7f271dd6ea88, p = 0x7f271dd6ea88} vgnameids_on_system = {n = 0x7fffe79de580, p = 0x7fffe79de580} vgnameids_to_process = {n = 0x7f271dd6eaa8, p = 0x7f271dd6eaa8} enable_all_vgs = 0 ret = <optimized out> #10 0x00007f271d69968a in lvm_run_command (cmd=cmd@entry=0x7f271dce6060, argc=1, argc@entry=3, argv=0x7fffe79de8f8, argv@entry=0x7fffe79de8e8) at lvmcmdline.c:1452 config_string_cft = <optimized out> config_profile_command_cft = <optimized out> config_profile_metadata_cft = <optimized out> ret = <optimized out> locking_type = -1 monitoring = 1 #11 0x00007f271d699d49 in lvm2_main (argc=3, argv=0x7fffe79de8e8) at lvmcmdline.c:1907 base = <optimized out> ret = <optimized out> alias = 1 cmd = 0x7f271dce6060 #12 0x00007f271c3b3af5 in __libc_start_main (main=0x7f271d681fa0 <main>, argc=3, ubp_av=0x7fffe79de8e8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffe79de8d8) at libc-start.c:274 result = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {0, -4926053904182868928, 139805973815208, 140737079273696, 0, 0, 4926070653777413184, 5038714776698160192}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x7f271d44a4e3 <_dl_init+275>, 0x7f271d65e208}, data = {prev = 0x0, cleanup = 0x0, canceltype = 491037923}}} not_first_call = <optimized out> #13 0x00007f271d681fd1 in _start () No symbol table info available. Peter, shall I try to run udevcomplete_all and see if it recovers? Will not we lose some interesting state? The EVENT_TIMEOUT rule is not recognized anymore in recent version, they just removed it with systemd v216: http://cgit.freedesktop.org/systemd/systemd/commit/?id=dd5eddd28a74a49607a8fffcaf960040dba98479 And the timeout is 30s by default, we can only change event timeout for the whole udev daemon only: http://cgit.freedesktop.org/systemd/systemd/tree/man/systemd-udevd.service.xml?id=dd5eddd28a74a49607a8fffcaf960040dba98479#n100 The multipath thing: Dec 12 18:40:28 host-112 multipathd: dm-2: remove map (uevent) Dec 12 18:40:28 host-112 multipathd: dm-2: devmap not registered, can't remove Dec 12 18:40:28 host-112 multipathd: dm-3: remove map (uevent) Dec 12 18:40:28 host-112 multipathd: dm-3: devmap not registered, can't remove Dec 12 18:40:28 host-112 multipathd: dm-4: remove map (uevent) Dec 12 18:40:28 host-112 multipathd: dm-4: devmap not registered, can't remove ...is not interesting for us. It's just a spurious message, not a problem (multipath can't remove dm mapping which it does not own/has claimed before, that's all, so it's harmless). As for udevcomplete_all use - well, if the events timed out already, you can't do any more harm with udevcomplete_all - it just cleans up the semaphores and unlocks waiting LVM processes (the notificiation will never come if the udev event worker is already shot down). We need to assess whether this is still reproducible with recent release. Unfortunately, if this is udev timeout, we're no longer able to change timeout time via udev rule as it can be set only when udev daemon starts with recent udev versions. So if this is still a problem and if this default timeout is not enough, we need to also try to reproduce with exact timestamps to see where things are being stalled and, if at all, there's any parallel jobs running during the tests which may slow it down and what kind of devices we're working with (e.g. if it's iSCSI - are there any lags?). Corey, have you hit this with recent releases? If this is still reproducible, we might as well try setting "udev.event-timeout=<more_than_default_30_seconds>" on kernel cmd line (or alternatively, editing the /usr/lib/systemd/system/systemd-udevd.service and "ExecStart=/usr/lib/systemd/systemd-udevd --event-timeout=<number_of_seconds>" line there) and check if we still hit the issue during testing (and also if the timeout is really that number of seconds we set for udev!). Also, it's important to assess the overall system load if these udev issues happen and whether the workload is not just too artificial (too many commands run in very short time) that usually doesn't happen in real world. No further information - closing as per comment 15. |