1263356 – "flood" of reporting cmds causes a deadlock

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1263356 - "flood" of reporting cmds causes a deadlock

Summary: "flood" of reporting cmds causes a deadlock

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-15 15:48 UTC by Corey Marthaler
Modified:	2021-09-03 12:41 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-20 21:02:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2015-09-15 15:48:17 UTC

Description of problem:
One of our regression tests has been to keep 25 active lvm reporting commands (pvs,vgs,lvs) running on each of the nodes in a cluster. However they have been failing in the latest build. This it technically a "regression" however since this is not a very important test case I'll leave that keyword off since this bug should probably not block any release.

[root@mckinley-01 ~]# pcs status
Cluster name: MCKINLEY
Last updated: Tue Sep 15 10:34:46 2015          Last change: Mon Sep 14 19:42:22 2015 by root via cibadmin on mckinley-01
Stack: corosync
Current DC: mckinley-04 (version 1.1.13-44eb2dd) - partition with quorum
4 nodes and 9 resources configured

Online: [ mckinley-01 mckinley-02 mckinley-03 mckinley-04 ]

Full list of resources:

 mckinley-apc   (stonith:fence_apc):    Started mckinley-01
 Clone Set: dlm-clone [dlm]
     Started: [ mckinley-01 mckinley-02 mckinley-03 mckinley-04 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ mckinley-01 mckinley-02 mckinley-03 mckinley-04 ]

PCSD Status:
  mckinley-01: Offline
  mckinley-02: Offline
  mckinley-03: Offline
  mckinley-04: Offline

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: unknown/disabled




# Commands that failed in this case:
pvs -o pvseg_size -O pv_mda_count
pvs -o pvseg_size -O pv_pe_alloc_count
pvs -o pv_mda_count -O pv_used
pvs -o pv_mda_free -O pv_pe_count


<start name="pvs_dev_size_pvseg_start_0PT5" pid="15034" time="Tue Sep 15 10:15:31 2015" type="cmd" />
[pvs_pvseg_size_pv_mda_count_0PT5]   Error locking on node 4: Command timed out
[pvs_pvseg_size_pv_mda_count_0PT5]   Unable to obtain global lock.
<fail name="pvs_pvseg_size_pv_mda_count_0PT5" pid="15000" time="Tue Sep 15 10:18:35 2015" type="cmd" duration="186" ec="5" />
PAN2: ALL STOP!!!
[pvs_pvseg_size_pv_pe_alloc_count_0PT5]   Error locking on node 4: Command timed out
[pvs_pvseg_size_pv_pe_alloc_count_0PT5]   Unable to obtain global lock.
<fail name="pvs_pvseg_size_pv_pe_alloc_count_0PT5" pid="14996" time="Tue Sep 15 10:18:35 2015" type="cmd" duration="186" ec="5" />
[pvs_pv_mda_count_pv_used_0PT5]   Error locking on node 4: Command timed out
[pvs_pv_mda_count_pv_used_0PT5]   Error locking on node 4: Command timed out
[pvs_pv_mda_count_pv_used_0PT5]   Can't get lock for #orphans_lvm1
[pvs_pv_mda_count_pv_used_0PT5]   Cannot process volume group #orphans_lvm1
<stop name="pvs_pv_mda_count_pv_used_0PT5" pid="15010" time="Tue Sep 15 10:18:35 2015" type="cmd" duration="185" />
[pvs_pv_mda_free_pv_pe_count_0PT5]   Error locking on node 4: Command timed out
[pvs_pv_mda_free_pv_pe_count_0PT5]   Error locking on node 4: Command timed out
[pvs_pv_mda_free_pv_pe_count_0PT5]   Can't get lock for #orphans_lvm1
[pvs_pv_mda_free_pv_pe_count_0PT5]   Cannot process volume group #orphans_lvm1


# After this a pvscan is also hung
[root@mckinley-02 ~]# pvscan
[deadlock]


Version-Release number of selected component (if applicable):
3.10.0-306.el7.x86_64

lvm2-2.02.130-2.el7    BUILT: Tue Sep 15 07:15:40 CDT 2015
lvm2-libs-2.02.130-2.el7    BUILT: Tue Sep 15 07:15:40 CDT 2015
lvm2-cluster-2.02.130-2.el7    BUILT: Tue Sep 15 07:15:40 CDT 2015
device-mapper-1.02.107-2.el7    BUILT: Tue Sep 15 07:15:40 CDT 2015
device-mapper-libs-1.02.107-2.el7    BUILT: Tue Sep 15 07:15:40 CDT 2015
device-mapper-event-1.02.107-2.el7    BUILT: Tue Sep 15 07:15:40 CDT 2015
device-mapper-event-libs-1.02.107-2.el7    BUILT: Tue Sep 15 07:15:40 CDT 2015
device-mapper-persistent-data-0.5.5-1.el7    BUILT: Thu Aug 13 09:58:10 CDT 2015
cmirror-2.02.130-2.el7    BUILT: Tue Sep 15 07:15:40 CDT 2015
sanlock-3.2.4-1.el7    BUILT: Fri Jun 19 12:48:49 CDT 2015
sanlock-lib-3.2.4-1.el7    BUILT: Fri Jun 19 12:48:49 CDT 2015
lvm2-lockd-2.02.130-2.el7    BUILT: Tue Sep 15 07:15:40 CDT 2015


How reproducible:
Often

Comment 1 Corey Marthaler 2015-09-15 15:50:00 UTC

To be more clear, this has been failing in the 7.2 builds, not just the recent 130-2 build.

Comment 3 Peter Rajnoha 2015-10-08 13:48:55 UTC

(Maybe also bug #1194667.)

Comment 4 Corey Marthaler 2016-06-08 16:47:10 UTC

This is very easily reproduced still with the latest rpms. 

3.10.0-418.el7.x86_64
lvm2-2.02.152-2.el7    BUILT: Thu May  5 02:33:28 CDT 2016
lvm2-libs-2.02.152-2.el7    BUILT: Thu May  5 02:33:28 CDT 2016
lvm2-cluster-2.02.152-2.el7    BUILT: Thu May  5 02:33:28 CDT 2016
device-mapper-1.02.124-2.el7    BUILT: Thu May  5 02:33:28 CDT 2016
device-mapper-libs-1.02.124-2.el7    BUILT: Thu May  5 02:33:28 CDT 2016
device-mapper-event-1.02.124-2.el7    BUILT: Thu May  5 02:33:28 CDT 2016
device-mapper-event-libs-1.02.124-2.el7    BUILT: Thu May  5 02:33:28 CDT 2016
device-mapper-persistent-data-0.6.2-0.1.rc8.el7    BUILT: Wed May  4 02:56:34 CDT 2016
cmirror-2.02.152-2.el7    BUILT: Thu May  5 02:33:28 CDT 2016
sanlock-3.3.0-1.el7    BUILT: Wed Feb 24 09:52:30 CST 2016
sanlock-lib-3.3.0-1.el7    BUILT: Wed Feb 24 09:52:30 CST 2016
lvm2-lockd-2.02.152-2.el7    BUILT: Thu May  5 02:33:28 CDT 2016




[root@mckinley-03 ~]# pvs -o pv_mda_size -O pv_mda_count
  Error locking on node 4: Command timed out
  Unable to obtain global lock.

[root@mckinley-03 ~]# dlm_tool lockdump clvmd && grep "" /sys/kernel/debug/dlm/*
id 0000000f gr PR rq IV pid 7978 master 1 "P_#orphans"
id 0000000e gr PR rq IV pid 7973 master 1 "P_#orphans"
id 0000000d gr PR rq IV pid 7969 master 1 "P_#orphans"
id 0000000c gr PR rq IV pid 7981 master 1 "P_#orphans"
id 00000009 gr PR rq IV pid 7981 master 1 "P_#orphans"
id 00000008 gr PR rq IV pid 7969 master 1 "P_#orphans"
id 00000006 gr PR rq IV pid 7973 master 1 "P_#orphans"
id 00000003 gr PR rq IV pid 7978 master 1 "P_#orphans"
/sys/kernel/debug/dlm/clvmd:
/sys/kernel/debug/dlm/clvmd:Resource ffff880ffd6a0780 Name (len=10) "P_#orphans"  
/sys/kernel/debug/dlm/clvmd:Local Copy, Master is node 1
/sys/kernel/debug/dlm/clvmd:Granted Queue
/sys/kernel/debug/dlm/clvmd:0000000f PR Master:     00000035
/sys/kernel/debug/dlm/clvmd:0000000e PR Master:     00000034
/sys/kernel/debug/dlm/clvmd:0000000d PR Master:     00000033
/sys/kernel/debug/dlm/clvmd:0000000c PR Master:     00000032
/sys/kernel/debug/dlm/clvmd:00000009 PR Master:     00000031
/sys/kernel/debug/dlm/clvmd:00000008 PR Master:     00000027
/sys/kernel/debug/dlm/clvmd:00000006 PR Master:     00000026
/sys/kernel/debug/dlm/clvmd:00000003 PR Master:     00000021
/sys/kernel/debug/dlm/clvmd:Conversion Queue
/sys/kernel/debug/dlm/clvmd:Waiting Queue
/sys/kernel/debug/dlm/clvmd_all:version rsb 1.1 lvb 1.1 lkb 1.1
/sys/kernel/debug/dlm/clvmd_all:rsb ffff880ffd6a0780 1 0 0 0 0 0 10 str P_#orphans
/sys/kernel/debug/dlm/clvmd_all:lkb f 1 35 7978 0 0 1 2 3 -1 0 0 0 0 1027921252123 0
/sys/kernel/debug/dlm/clvmd_all:lkb e 1 34 7973 0 0 1 2 3 -1 0 0 0 0 1027921246812 0
/sys/kernel/debug/dlm/clvmd_all:lkb d 1 33 7969 0 0 1 2 3 -1 0 0 0 0 1027920985613 0
/sys/kernel/debug/dlm/clvmd_all:lkb c 1 32 7981 0 0 1 2 3 -1 0 0 0 0 1027920982663 0
/sys/kernel/debug/dlm/clvmd_all:lkb 9 1 31 7981 0 0 1 2 3 -1 0 0 0 0 967540896588 0
/sys/kernel/debug/dlm/clvmd_all:lkb 8 1 27 7969 0 0 1 2 3 -1 0 0 0 0 967540597918 0
/sys/kernel/debug/dlm/clvmd_all:lkb 6 1 26 7973 0 0 1 2 3 -1 0 0 0 0 967540593158 0
/sys/kernel/debug/dlm/clvmd_all:lkb 3 1 21 7978 0 0 1 2 3 -1 0 0 0 0 967540480118 0
/sys/kernel/debug/dlm/clvmd_locks:id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name
/sys/kernel/debug/dlm/clvmd_locks:f 1 35 7978 0 0 1 2 3 -1 149254003107 1 10 "P_#orphans"
/sys/kernel/debug/dlm/clvmd_locks:e 1 34 7973 0 0 1 2 3 -1 149254003116 1 10 "P_#orphans"
/sys/kernel/debug/dlm/clvmd_locks:d 1 33 7969 0 0 1 2 3 -1 149254003379 1 10 "P_#orphans"
/sys/kernel/debug/dlm/clvmd_locks:c 1 32 7981 0 0 1 2 3 -1 149254003384 1 10 "P_#orphans"
/sys/kernel/debug/dlm/clvmd_locks:9 1 31 7981 0 0 1 2 3 -1 149314383473 1 10 "P_#orphans"
/sys/kernel/debug/dlm/clvmd_locks:8 1 27 7969 0 0 1 2 3 -1 149314383773 1 10 "P_#orphans"
/sys/kernel/debug/dlm/clvmd_locks:6 1 26 7973 0 0 1 2 3 -1 149314383780 1 10 "P_#orphans"
/sys/kernel/debug/dlm/clvmd_locks:3 1 21 7978 0 0 1 2 3 -1 149314383895 1 10 "P_#orphans"
/sys/kernel/debug/dlm/clvmd_toss:version 4 rsb 2
/sys/kernel/debug/dlm/clvmd_toss:rsb ffff880ff7abdb80 1 1 3 3 4295610045 0 64 str t0c3BXzhUkI75NESqFdX3YtvX32NJQHZfaVVH3kMgmDrSc9gfpgekmY9nm09EBR6
/sys/kernel/debug/dlm/clvmd_toss:rsb ffff880ff7abde00 1 1 3 3 4295609541 0 64 str t0c3BXzhUkI75NESqFdX3YtvX32NJQHZahnDEX5nmaknWmGiBidUFCfZbbmFMao0

Comment 5 Alasdair Kergon 2016-06-09 11:58:03 UTC

Please check all the nodes and confirm that the *only* active lvm commands running at the point of failure are all 'pvs'.  (I notice you mention 'pvscan' as well.)

Comment 6 Alasdair Kergon 2016-06-09 12:02:25 UTC

Or in other words, what does 'ps' on every node show is running?  Is there another process actually already causing the lock to be held?  If so, what is blocking it?

Comment 7 David Teigland 2016-06-09 18:20:34 UTC

The dlm does not show any locks being waited for (dlm_tool *lockdebug* is required, not *lockdump*).

As expected, this looks like a clvmd problem,

[root@mckinley-03 ~]# pvs -vvvv
#lvmcmdline.c:1582         DEGRADED MODE. Incomplete RAID LVs will be processed.
#libdm-config.c:1036       Setting activation/monitoring to 1
#lvmcmdline.c:1588         Processing: pvs -vvvv
#lvmcmdline.c:1589         system ID: 
#lvmcmdline.c:1592         O_DIRECT will be used
#libdm-config.c:972       Setting global/locking_type to 3
#libdm-config.c:1036       Setting global/wait_for_locks to 1
#locking/locking.c:156       Cluster locking selected.
#libdm-config.c:1036       Setting global/use_lvmlockd to 0
#libdm-config.c:1049       report/aligned not found in config: defaulting to 1
#libdm-config.c:1049       report/buffered not found in config: defaulting to 1
#libdm-config.c:1049       report/headings not found in config: defaulting to 1
#libdm-config.c:950       report/separator not found in config: defaulting to  
#libdm-config.c:1049       report/prefixes not found in config: defaulting to 0
#libdm-config.c:1049       report/quoted not found in config: defaulting to 1
#libdm-config.c:1049       report/colums_as_rows not found in config: defaulting to 0
#libdm-config.c:950       report/pvs_sort not found in config: defaulting to pv_name
#libdm-config.c:950       report/pvs_cols_verbose not found in config: defaulting to pv_name,vg_name,pv_fmt,pv_attr,pv_size,pv_free,dev_size,pv_uuid
#locking/cluster_locking.c:512       Locking VG P_#global CR (VG) (0x1)

(waits here for a long timeout period)

#locking/cluster_locking.c:398   Error locking on node 4: Command timed out
#locking/cluster_locking.c:398   Error locking on node 1: Command timed out
#locking/locking.c:284         <backtrace>
#locking/locking.c:356         <backtrace>
#reporter.c:903   Unable to obtain global lock.
#libdm-config.c:1036       Setting global/notify_dbus to 1
#lvmcmdline.c:1716         Completed: pvs -vvvv

Comment 8 David Teigland 2016-06-09 18:28:49 UTC

While pvs above is waiting for the long timeout period, clvmd has these threads, where the first thread is the one running for the command:


(gdb) thread apply all bt
Thread 6 (Thread 0x7f8025aed700 (LWP 18104)):
#0  0x00007f8029a916d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f802ad8e2cd in pre_and_post_thread ()
#2  0x00007f8029a8ddc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f80293afe2d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f80262ee700 (LWP 3410)):
#0  0x00007f8029a9422d in read () from /lib64/libpthread.so.0
#1  0x00007f8029ca3945 in do_dlm_dispatch_v6 () from /lib64/libdlm.so.3
#2  0x00007f8029ca3d6f in dlm_recv_thread () from /lib64/libdlm.so.3
#3  0x00007f8029a8ddc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f80293afe2d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f802abe6700 (LWP 3411)):
#0  0x00007f8029a916d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f802ad8d8ef in lvm_thread_fn ()
#2  0x00007f8029a8ddc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f80293afe2d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f802ad4d880 (LWP 3404)):
#0  0x00007f80293a7553 in select () from /lib64/libc.so.6
#1  0x00007f802ad8b2a2 in main ()

Comment 9 David Teigland 2016-06-09 18:42:31 UTC

If anyone is interested in seeing more info from clvmd I saved a core file of clvmd from comment 8.

Comment 10 David Teigland 2016-06-09 19:09:32 UTC

[root@mckinley-01 ~]# dlm_tool lockdebug clvmd

Resource len 64  "OX2lt6T8CSljIPAzdSiC15GAxipborghhHFZcZ9K5U3TSFtOVjUHaSU6Bow6Tohe"
Master:2         
Granted
00000002 CR      Master:   2 00000004          

Resource len 10  "P_#orphans"
Master           
Granted
00000007 PR      Remote:   2 00000012          
0000003f PR      Remote:   2 00000008          

-----------------------------------------------------------------------------------

[root@mckinley-02 ~]# dlm_tool lockdebug clvmd

Resource len 64  "OX2lt6T8CSljIPAzdSiC15GAxipborghhHFZcZ9K5U3TSFtOVjUHaSU6Bow6Tohe"
Master           
Granted
00000004 CR      Remote:   1 00000002          
00000003 CR      Remote:   4 00000001          
00000002 CR      Remote:   3 00000001          
00000001 CR                                    

Resource len 10  "P_#orphans"
Master:1         
Granted
00000012 PR      Master:   1 00000007          
00000008 PR      Master:   1 0000003f          

-----------------------------------------------------------------------------------

[root@mckinley-03 ~]# dlm_tool lockdebug clvmd

Resource len 64  "OX2lt6T8CSljIPAzdSiC15GAxipborghhHFZcZ9K5U3TSFtOVjUHaSU6Bow6Tohe"
Master:2         
Granted
00000001 CR      Master:   2 00000002          

Resource len 10  "P_#orphans"
Master:1 Dir

-----------------------------------------------------------------------------------

[root@mckinley-04 ~]# dlm_tool lockdebug clvmd

Resource len 64  "OX2lt6T8CSljIPAzdSiC15GAxipborghhHFZcZ9K5U3TSFtOVjUHaSU6Bow6Tohe"
Master:2         
Granted
00000001 CR      Master:   2 00000003

Comment 11 David Teigland 2016-06-09 19:18:52 UTC

clvmd holding two orphan locks on mckinley-02 does not look right.  I think there must be a threading race in clvmd which causes it two acquire two locks on the same resource.

The last lvm commands run by the test on that node, presumably the last before it got stuck, are two concurrent pvs commands:

Jun  9 11:30:26 mckinley-02 qarshd[8532]: Running cmdline: pvs -o pv_attr -O pv_attr
Jun  9 11:30:26 mckinley-02 qarshd[8531]: Running cmdline: pvs -o dev_size -O pv_tags

Comment 12 David Teigland 2016-06-09 19:30:13 UTC

[root@mckinley-02 ~]# pvs -vvvv
#lvmcmdline.c:1582         DEGRADED MODE. Incomplete RAID LVs will be processed.
#libdm-config.c:1036       Setting activation/monitoring to 1
#lvmcmdline.c:1588         Processing: pvs -vvvv
#lvmcmdline.c:1589         system ID: 
#lvmcmdline.c:1592         O_DIRECT will be used
#libdm-config.c:972       Setting global/locking_type to 3
#libdm-config.c:1036       Setting global/wait_for_locks to 1
#locking/locking.c:156       Cluster locking selected.
#libdm-config.c:1036       Setting global/use_lvmlockd to 0
#libdm-config.c:1049       report/aligned not found in config: defaulting to 1
#libdm-config.c:1049       report/buffered not found in config: defaulting to 1
#libdm-config.c:1049       report/headings not found in config: defaulting to 1
#libdm-config.c:950       report/separator not found in config: defaulting to  
#libdm-config.c:1049       report/prefixes not found in config: defaulting to 0
#libdm-config.c:1049       report/quoted not found in config: defaulting to 1
#libdm-config.c:1049       report/colums_as_rows not found in config: defaulting to 0
#libdm-config.c:950       report/pvs_sort not found in config: defaulting to pv_name
#libdm-config.c:950       report/pvs_cols_verbose not found in config: defaulting to pv_name,vg_name,pv_fmt,pv_attr,pv_size,pv_free,dev_size,pv_uuid
#locking/cluster_locking.c:512       Locking VG P_#global CR (VG) (0x1)

(waiting)

#locking/cluster_locking.c:398   Error locking on node 3: Command timed out
#locking/locking.c:284         <backtrace>
#locking/locking.c:356         <backtrace>
#reporter.c:903   Unable to obtain global lock.
#libdm-config.c:1036       Setting global/notify_dbus to 1
#lvmcmdline.c:1716         Completed: pvs -vvvv


While pvs is waiting, the local dlm lock state is:

[root@mckinley-02 ~]# dlm_tool lockdebug clvmd

Resource len  9  "P_#global"
Master           
Granted
00000005 PR                                    

Resource len 64  "OX2lt6T8CSljIPAzdSiC15GAxipborghhHFZcZ9K5U3TSFtOVjUHaSU6Bow6Tohe"
Master           
Granted
00000004 CR      Remote:   1 00000002          
00000003 CR      Remote:   4 00000001          
00000002 CR      Remote:   3 00000001          
00000001 CR                                    

Resource len 10  "P_#orphans"
Master:1         
Granted
00000012 PR      Master:   1 00000007          
00000008 PR      Master:   1 0000003f          


Which means that it has been granted the "P_#global" lock in PR mode (not CR mode as the log message seems to imply before blocking, is that just a mistake in the message logging?)

Next I tried running 'vgs -vvvv' and that got stuck in other ways,

...
#toollib.c:1894       Process single VG stripe_5_666
#metadata/vg.c:88         Freeing VG stripe_5_666 at 0x7f5ff17fec60.
#mm/memlock.c:557         Unlock: Memlock counters: locked:0 critical:0 daemon:0 suspended:0
#locking/cluster_locking.c:432       Requesting sync names.

(long wait)

#locking/cluster_locking.c:398   Error locking on node 3: Command timed out
#locking/locking.c:284         <backtrace>
#locking/locking.c:356         <backtrace>
#toollib.c:1905         <backtrace>
#locking/cluster_locking.c:512       Locking VG V_stripe_5_666 UN (VG) (0x6)
...

#toollib.c:1894       Process single VG rhel_mckinley-02
#metadata/vg.c:88         Freeing VG rhel_mckinley-02 at 0x7f5ff1802c80.
#mm/memlock.c:557         Unlock: Memlock counters: locked:0 critical:0 daemon:0 suspended:0
#locking/cluster_locking.c:432       Requesting sync names.

(long wait)

#locking/cluster_locking.c:398   Error locking on node 3: Command timed out
#locking/locking.c:284         <backtrace>
#locking/locking.c:356         <backtrace>
#toollib.c:1905         <backtrace>
#locking/cluster_locking.c:512       Locking VG V_rhel_mckinley-02 UN (VG) (0x6)
#device/dev-io.c:614         Closed /dev/sda2
#device/dev-io.c:614         Closed /dev/sdb1
#metadata/vg.c:88         Freeing VG rhel_mckinley-02 at 0x7f5ff17fac60.
#metadata/vg.c:88         Freeing VG rhel_mckinley-02 at 0x7f5ff17f6c50.
#libdm-config.c:1049       report/compact_output not found in config: defaulting to 0
#libdm-config.c:950       report/compact_output_cols not found in config: defaulting to 
#libdm-report.c:4083   VG               Attr   Ext   #PV #LV #SN VSize   VFree VG UUID                                VProfile
#libdm-report.c:4411   rhel_mckinley-02 wz--n- 4.00m   2   3   0 557.91g    0  2DpBYR-3tuh-8zJC-3Hpt-vyzj-0zdK-YI8uaf         
#libdm-report.c:4411   stripe_5_666     wz--nc 4.00m  10   1   0   1.22t 1.22t OX2lt6-T8CS-ljIP-AzdS-iC15-GAxi-pborgh         
#libdm-config.c:1036       Setting global/notify_dbus to 1
#lvmcmdline.c:1716         Completed: vgs -vvvv

Comment 13 David Teigland 2016-06-09 19:43:08 UTC

(gdb) thread apply all bt

Thread 7 (Thread 0x7f7bcc547700 (LWP 24125)):
#0  0x00007f7bd050c6d5 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f7bd18092cd in pre_and_post_thread (arg=0x7f7bd33e5b60) at clvmd.c:1741
#2  0x00007f7bd0508dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7bcfe2ae2d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f7bccd69700 (LWP 3352)):
#0  0x00007f7bd050f22d in read () from /lib64/libpthread.so.0
#1  0x00007f7bd071e945 in do_dlm_dispatch_v6 () from /lib64/libdlm.so.3
#2  0x00007f7bd071ed6f in dlm_recv_thread () from /lib64/libdlm.so.3
#3  0x00007f7bd0508dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f7bcfe2ae2d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f7bd1661700 (LWP 3353)):
#0  0x00007f7bd050c6d5 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f7bd18088ef in lvm_thread_fn (arg=<optimized out>) at clvmd.c:2069
#2  0x00007f7bd0508dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7bcfe2ae2d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f7bd17c8880 (LWP 3346)):
#0  0x00007f7bcfe22553 in select () from /lib64/libc.so.6
#1  0x00007f7bd18062a2 in main_loop (cmd_timeout=60) at clvmd.c:903
#2  main (argc=<optimized out>, argv=<optimized out>) at clvmd.c:627

Comment 14 Corey Marthaler 2016-06-09 22:53:48 UTC

I was unable to reproduce this bug with clvmd running in debug mode. I'll let the tests run over night...

Note You need to log in before you can comment on or make changes to this bug.