Bug 1263356
Summary: | "flood" of reporting cmds causes a deadlock | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> |
Component: | lvm2 | Assignee: | David Teigland <teigland> |
lvm2 sub component: | Clustering / clvmd | QA Contact: | cluster-qe <cluster-qe> |
Status: | CLOSED WONTFIX | Docs Contact: | |
Severity: | low | ||
Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, teigland, zkabelac |
Version: | 7.2 | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-08-20 21:02:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2015-09-15 15:48:17 UTC
To be more clear, this has been failing in the 7.2 builds, not just the recent 130-2 build. (Maybe also bug #1194667.) This is very easily reproduced still with the latest rpms. 3.10.0-418.el7.x86_64 lvm2-2.02.152-2.el7 BUILT: Thu May 5 02:33:28 CDT 2016 lvm2-libs-2.02.152-2.el7 BUILT: Thu May 5 02:33:28 CDT 2016 lvm2-cluster-2.02.152-2.el7 BUILT: Thu May 5 02:33:28 CDT 2016 device-mapper-1.02.124-2.el7 BUILT: Thu May 5 02:33:28 CDT 2016 device-mapper-libs-1.02.124-2.el7 BUILT: Thu May 5 02:33:28 CDT 2016 device-mapper-event-1.02.124-2.el7 BUILT: Thu May 5 02:33:28 CDT 2016 device-mapper-event-libs-1.02.124-2.el7 BUILT: Thu May 5 02:33:28 CDT 2016 device-mapper-persistent-data-0.6.2-0.1.rc8.el7 BUILT: Wed May 4 02:56:34 CDT 2016 cmirror-2.02.152-2.el7 BUILT: Thu May 5 02:33:28 CDT 2016 sanlock-3.3.0-1.el7 BUILT: Wed Feb 24 09:52:30 CST 2016 sanlock-lib-3.3.0-1.el7 BUILT: Wed Feb 24 09:52:30 CST 2016 lvm2-lockd-2.02.152-2.el7 BUILT: Thu May 5 02:33:28 CDT 2016 [root@mckinley-03 ~]# pvs -o pv_mda_size -O pv_mda_count Error locking on node 4: Command timed out Unable to obtain global lock. [root@mckinley-03 ~]# dlm_tool lockdump clvmd && grep "" /sys/kernel/debug/dlm/* id 0000000f gr PR rq IV pid 7978 master 1 "P_#orphans" id 0000000e gr PR rq IV pid 7973 master 1 "P_#orphans" id 0000000d gr PR rq IV pid 7969 master 1 "P_#orphans" id 0000000c gr PR rq IV pid 7981 master 1 "P_#orphans" id 00000009 gr PR rq IV pid 7981 master 1 "P_#orphans" id 00000008 gr PR rq IV pid 7969 master 1 "P_#orphans" id 00000006 gr PR rq IV pid 7973 master 1 "P_#orphans" id 00000003 gr PR rq IV pid 7978 master 1 "P_#orphans" /sys/kernel/debug/dlm/clvmd: /sys/kernel/debug/dlm/clvmd:Resource ffff880ffd6a0780 Name (len=10) "P_#orphans" /sys/kernel/debug/dlm/clvmd:Local Copy, Master is node 1 /sys/kernel/debug/dlm/clvmd:Granted Queue /sys/kernel/debug/dlm/clvmd:0000000f PR Master: 00000035 /sys/kernel/debug/dlm/clvmd:0000000e PR Master: 00000034 /sys/kernel/debug/dlm/clvmd:0000000d PR Master: 00000033 /sys/kernel/debug/dlm/clvmd:0000000c PR Master: 00000032 /sys/kernel/debug/dlm/clvmd:00000009 PR Master: 00000031 /sys/kernel/debug/dlm/clvmd:00000008 PR Master: 00000027 /sys/kernel/debug/dlm/clvmd:00000006 PR Master: 00000026 /sys/kernel/debug/dlm/clvmd:00000003 PR Master: 00000021 /sys/kernel/debug/dlm/clvmd:Conversion Queue /sys/kernel/debug/dlm/clvmd:Waiting Queue /sys/kernel/debug/dlm/clvmd_all:version rsb 1.1 lvb 1.1 lkb 1.1 /sys/kernel/debug/dlm/clvmd_all:rsb ffff880ffd6a0780 1 0 0 0 0 0 10 str P_#orphans /sys/kernel/debug/dlm/clvmd_all:lkb f 1 35 7978 0 0 1 2 3 -1 0 0 0 0 1027921252123 0 /sys/kernel/debug/dlm/clvmd_all:lkb e 1 34 7973 0 0 1 2 3 -1 0 0 0 0 1027921246812 0 /sys/kernel/debug/dlm/clvmd_all:lkb d 1 33 7969 0 0 1 2 3 -1 0 0 0 0 1027920985613 0 /sys/kernel/debug/dlm/clvmd_all:lkb c 1 32 7981 0 0 1 2 3 -1 0 0 0 0 1027920982663 0 /sys/kernel/debug/dlm/clvmd_all:lkb 9 1 31 7981 0 0 1 2 3 -1 0 0 0 0 967540896588 0 /sys/kernel/debug/dlm/clvmd_all:lkb 8 1 27 7969 0 0 1 2 3 -1 0 0 0 0 967540597918 0 /sys/kernel/debug/dlm/clvmd_all:lkb 6 1 26 7973 0 0 1 2 3 -1 0 0 0 0 967540593158 0 /sys/kernel/debug/dlm/clvmd_all:lkb 3 1 21 7978 0 0 1 2 3 -1 0 0 0 0 967540480118 0 /sys/kernel/debug/dlm/clvmd_locks:id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name /sys/kernel/debug/dlm/clvmd_locks:f 1 35 7978 0 0 1 2 3 -1 149254003107 1 10 "P_#orphans" /sys/kernel/debug/dlm/clvmd_locks:e 1 34 7973 0 0 1 2 3 -1 149254003116 1 10 "P_#orphans" /sys/kernel/debug/dlm/clvmd_locks:d 1 33 7969 0 0 1 2 3 -1 149254003379 1 10 "P_#orphans" /sys/kernel/debug/dlm/clvmd_locks:c 1 32 7981 0 0 1 2 3 -1 149254003384 1 10 "P_#orphans" /sys/kernel/debug/dlm/clvmd_locks:9 1 31 7981 0 0 1 2 3 -1 149314383473 1 10 "P_#orphans" /sys/kernel/debug/dlm/clvmd_locks:8 1 27 7969 0 0 1 2 3 -1 149314383773 1 10 "P_#orphans" /sys/kernel/debug/dlm/clvmd_locks:6 1 26 7973 0 0 1 2 3 -1 149314383780 1 10 "P_#orphans" /sys/kernel/debug/dlm/clvmd_locks:3 1 21 7978 0 0 1 2 3 -1 149314383895 1 10 "P_#orphans" /sys/kernel/debug/dlm/clvmd_toss:version 4 rsb 2 /sys/kernel/debug/dlm/clvmd_toss:rsb ffff880ff7abdb80 1 1 3 3 4295610045 0 64 str t0c3BXzhUkI75NESqFdX3YtvX32NJQHZfaVVH3kMgmDrSc9gfpgekmY9nm09EBR6 /sys/kernel/debug/dlm/clvmd_toss:rsb ffff880ff7abde00 1 1 3 3 4295609541 0 64 str t0c3BXzhUkI75NESqFdX3YtvX32NJQHZahnDEX5nmaknWmGiBidUFCfZbbmFMao0 Please check all the nodes and confirm that the *only* active lvm commands running at the point of failure are all 'pvs'. (I notice you mention 'pvscan' as well.) Or in other words, what does 'ps' on every node show is running? Is there another process actually already causing the lock to be held? If so, what is blocking it? The dlm does not show any locks being waited for (dlm_tool *lockdebug* is required, not *lockdump*). As expected, this looks like a clvmd problem, [root@mckinley-03 ~]# pvs -vvvv #lvmcmdline.c:1582 DEGRADED MODE. Incomplete RAID LVs will be processed. #libdm-config.c:1036 Setting activation/monitoring to 1 #lvmcmdline.c:1588 Processing: pvs -vvvv #lvmcmdline.c:1589 system ID: #lvmcmdline.c:1592 O_DIRECT will be used #libdm-config.c:972 Setting global/locking_type to 3 #libdm-config.c:1036 Setting global/wait_for_locks to 1 #locking/locking.c:156 Cluster locking selected. #libdm-config.c:1036 Setting global/use_lvmlockd to 0 #libdm-config.c:1049 report/aligned not found in config: defaulting to 1 #libdm-config.c:1049 report/buffered not found in config: defaulting to 1 #libdm-config.c:1049 report/headings not found in config: defaulting to 1 #libdm-config.c:950 report/separator not found in config: defaulting to #libdm-config.c:1049 report/prefixes not found in config: defaulting to 0 #libdm-config.c:1049 report/quoted not found in config: defaulting to 1 #libdm-config.c:1049 report/colums_as_rows not found in config: defaulting to 0 #libdm-config.c:950 report/pvs_sort not found in config: defaulting to pv_name #libdm-config.c:950 report/pvs_cols_verbose not found in config: defaulting to pv_name,vg_name,pv_fmt,pv_attr,pv_size,pv_free,dev_size,pv_uuid #locking/cluster_locking.c:512 Locking VG P_#global CR (VG) (0x1) (waits here for a long timeout period) #locking/cluster_locking.c:398 Error locking on node 4: Command timed out #locking/cluster_locking.c:398 Error locking on node 1: Command timed out #locking/locking.c:284 <backtrace> #locking/locking.c:356 <backtrace> #reporter.c:903 Unable to obtain global lock. #libdm-config.c:1036 Setting global/notify_dbus to 1 #lvmcmdline.c:1716 Completed: pvs -vvvv While pvs above is waiting for the long timeout period, clvmd has these threads, where the first thread is the one running for the command: (gdb) thread apply all bt Thread 6 (Thread 0x7f8025aed700 (LWP 18104)): #0 0x00007f8029a916d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f802ad8e2cd in pre_and_post_thread () #2 0x00007f8029a8ddc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f80293afe2d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f80262ee700 (LWP 3410)): #0 0x00007f8029a9422d in read () from /lib64/libpthread.so.0 #1 0x00007f8029ca3945 in do_dlm_dispatch_v6 () from /lib64/libdlm.so.3 #2 0x00007f8029ca3d6f in dlm_recv_thread () from /lib64/libdlm.so.3 #3 0x00007f8029a8ddc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f80293afe2d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f802abe6700 (LWP 3411)): #0 0x00007f8029a916d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f802ad8d8ef in lvm_thread_fn () #2 0x00007f8029a8ddc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f80293afe2d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f802ad4d880 (LWP 3404)): #0 0x00007f80293a7553 in select () from /lib64/libc.so.6 #1 0x00007f802ad8b2a2 in main () If anyone is interested in seeing more info from clvmd I saved a core file of clvmd from comment 8. [root@mckinley-01 ~]# dlm_tool lockdebug clvmd Resource len 64 "OX2lt6T8CSljIPAzdSiC15GAxipborghhHFZcZ9K5U3TSFtOVjUHaSU6Bow6Tohe" Master:2 Granted 00000002 CR Master: 2 00000004 Resource len 10 "P_#orphans" Master Granted 00000007 PR Remote: 2 00000012 0000003f PR Remote: 2 00000008 ----------------------------------------------------------------------------------- [root@mckinley-02 ~]# dlm_tool lockdebug clvmd Resource len 64 "OX2lt6T8CSljIPAzdSiC15GAxipborghhHFZcZ9K5U3TSFtOVjUHaSU6Bow6Tohe" Master Granted 00000004 CR Remote: 1 00000002 00000003 CR Remote: 4 00000001 00000002 CR Remote: 3 00000001 00000001 CR Resource len 10 "P_#orphans" Master:1 Granted 00000012 PR Master: 1 00000007 00000008 PR Master: 1 0000003f ----------------------------------------------------------------------------------- [root@mckinley-03 ~]# dlm_tool lockdebug clvmd Resource len 64 "OX2lt6T8CSljIPAzdSiC15GAxipborghhHFZcZ9K5U3TSFtOVjUHaSU6Bow6Tohe" Master:2 Granted 00000001 CR Master: 2 00000002 Resource len 10 "P_#orphans" Master:1 Dir ----------------------------------------------------------------------------------- [root@mckinley-04 ~]# dlm_tool lockdebug clvmd Resource len 64 "OX2lt6T8CSljIPAzdSiC15GAxipborghhHFZcZ9K5U3TSFtOVjUHaSU6Bow6Tohe" Master:2 Granted 00000001 CR Master: 2 00000003 clvmd holding two orphan locks on mckinley-02 does not look right. I think there must be a threading race in clvmd which causes it two acquire two locks on the same resource. The last lvm commands run by the test on that node, presumably the last before it got stuck, are two concurrent pvs commands: Jun 9 11:30:26 mckinley-02 qarshd[8532]: Running cmdline: pvs -o pv_attr -O pv_attr Jun 9 11:30:26 mckinley-02 qarshd[8531]: Running cmdline: pvs -o dev_size -O pv_tags [root@mckinley-02 ~]# pvs -vvvv #lvmcmdline.c:1582 DEGRADED MODE. Incomplete RAID LVs will be processed. #libdm-config.c:1036 Setting activation/monitoring to 1 #lvmcmdline.c:1588 Processing: pvs -vvvv #lvmcmdline.c:1589 system ID: #lvmcmdline.c:1592 O_DIRECT will be used #libdm-config.c:972 Setting global/locking_type to 3 #libdm-config.c:1036 Setting global/wait_for_locks to 1 #locking/locking.c:156 Cluster locking selected. #libdm-config.c:1036 Setting global/use_lvmlockd to 0 #libdm-config.c:1049 report/aligned not found in config: defaulting to 1 #libdm-config.c:1049 report/buffered not found in config: defaulting to 1 #libdm-config.c:1049 report/headings not found in config: defaulting to 1 #libdm-config.c:950 report/separator not found in config: defaulting to #libdm-config.c:1049 report/prefixes not found in config: defaulting to 0 #libdm-config.c:1049 report/quoted not found in config: defaulting to 1 #libdm-config.c:1049 report/colums_as_rows not found in config: defaulting to 0 #libdm-config.c:950 report/pvs_sort not found in config: defaulting to pv_name #libdm-config.c:950 report/pvs_cols_verbose not found in config: defaulting to pv_name,vg_name,pv_fmt,pv_attr,pv_size,pv_free,dev_size,pv_uuid #locking/cluster_locking.c:512 Locking VG P_#global CR (VG) (0x1) (waiting) #locking/cluster_locking.c:398 Error locking on node 3: Command timed out #locking/locking.c:284 <backtrace> #locking/locking.c:356 <backtrace> #reporter.c:903 Unable to obtain global lock. #libdm-config.c:1036 Setting global/notify_dbus to 1 #lvmcmdline.c:1716 Completed: pvs -vvvv While pvs is waiting, the local dlm lock state is: [root@mckinley-02 ~]# dlm_tool lockdebug clvmd Resource len 9 "P_#global" Master Granted 00000005 PR Resource len 64 "OX2lt6T8CSljIPAzdSiC15GAxipborghhHFZcZ9K5U3TSFtOVjUHaSU6Bow6Tohe" Master Granted 00000004 CR Remote: 1 00000002 00000003 CR Remote: 4 00000001 00000002 CR Remote: 3 00000001 00000001 CR Resource len 10 "P_#orphans" Master:1 Granted 00000012 PR Master: 1 00000007 00000008 PR Master: 1 0000003f Which means that it has been granted the "P_#global" lock in PR mode (not CR mode as the log message seems to imply before blocking, is that just a mistake in the message logging?) Next I tried running 'vgs -vvvv' and that got stuck in other ways, ... #toollib.c:1894 Process single VG stripe_5_666 #metadata/vg.c:88 Freeing VG stripe_5_666 at 0x7f5ff17fec60. #mm/memlock.c:557 Unlock: Memlock counters: locked:0 critical:0 daemon:0 suspended:0 #locking/cluster_locking.c:432 Requesting sync names. (long wait) #locking/cluster_locking.c:398 Error locking on node 3: Command timed out #locking/locking.c:284 <backtrace> #locking/locking.c:356 <backtrace> #toollib.c:1905 <backtrace> #locking/cluster_locking.c:512 Locking VG V_stripe_5_666 UN (VG) (0x6) ... #toollib.c:1894 Process single VG rhel_mckinley-02 #metadata/vg.c:88 Freeing VG rhel_mckinley-02 at 0x7f5ff1802c80. #mm/memlock.c:557 Unlock: Memlock counters: locked:0 critical:0 daemon:0 suspended:0 #locking/cluster_locking.c:432 Requesting sync names. (long wait) #locking/cluster_locking.c:398 Error locking on node 3: Command timed out #locking/locking.c:284 <backtrace> #locking/locking.c:356 <backtrace> #toollib.c:1905 <backtrace> #locking/cluster_locking.c:512 Locking VG V_rhel_mckinley-02 UN (VG) (0x6) #device/dev-io.c:614 Closed /dev/sda2 #device/dev-io.c:614 Closed /dev/sdb1 #metadata/vg.c:88 Freeing VG rhel_mckinley-02 at 0x7f5ff17fac60. #metadata/vg.c:88 Freeing VG rhel_mckinley-02 at 0x7f5ff17f6c50. #libdm-config.c:1049 report/compact_output not found in config: defaulting to 0 #libdm-config.c:950 report/compact_output_cols not found in config: defaulting to #libdm-report.c:4083 VG Attr Ext #PV #LV #SN VSize VFree VG UUID VProfile #libdm-report.c:4411 rhel_mckinley-02 wz--n- 4.00m 2 3 0 557.91g 0 2DpBYR-3tuh-8zJC-3Hpt-vyzj-0zdK-YI8uaf #libdm-report.c:4411 stripe_5_666 wz--nc 4.00m 10 1 0 1.22t 1.22t OX2lt6-T8CS-ljIP-AzdS-iC15-GAxi-pborgh #libdm-config.c:1036 Setting global/notify_dbus to 1 #lvmcmdline.c:1716 Completed: vgs -vvvv (gdb) thread apply all bt Thread 7 (Thread 0x7f7bcc547700 (LWP 24125)): #0 0x00007f7bd050c6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f7bd18092cd in pre_and_post_thread (arg=0x7f7bd33e5b60) at clvmd.c:1741 #2 0x00007f7bd0508dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7bcfe2ae2d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f7bccd69700 (LWP 3352)): #0 0x00007f7bd050f22d in read () from /lib64/libpthread.so.0 #1 0x00007f7bd071e945 in do_dlm_dispatch_v6 () from /lib64/libdlm.so.3 #2 0x00007f7bd071ed6f in dlm_recv_thread () from /lib64/libdlm.so.3 #3 0x00007f7bd0508dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f7bcfe2ae2d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f7bd1661700 (LWP 3353)): #0 0x00007f7bd050c6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f7bd18088ef in lvm_thread_fn (arg=<optimized out>) at clvmd.c:2069 #2 0x00007f7bd0508dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f7bcfe2ae2d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f7bd17c8880 (LWP 3346)): #0 0x00007f7bcfe22553 in select () from /lib64/libc.so.6 #1 0x00007f7bd18062a2 in main_loop (cmd_timeout=60) at clvmd.c:903 #2 main (argc=<optimized out>, argv=<optimized out>) at clvmd.c:627 I was unable to reproduce this bug with clvmd running in debug mode. I'll let the tests run over night... |