Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1392908

Summary:	clvmd occassionally panicks or hangs when stopping pacemaker-based cluster
Product:	Red Hat Enterprise Linux 6	Reporter:	Radek Steiger <rsteiger>
Component:	lvm2	Assignee:	David Teigland <teigland>
lvm2 sub component:	Clustering / clvmd (RHEL6)	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, rsteiger, zkabelac
Version:	6.9
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-06 10:37:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Radek Steiger 2016-11-08 13:09:19 UTC

> Description of problem:

In bz1353738 the pcs and pacemaker components were modifed to improve the process of stopping a cluster to avoid hangups whena shared storage is used.  I was running a start/stop in a loop to verify no further problems arise, however after several hours a fencing of one of the nodes occured. Running that several more times I've observed two situations where clvmd either hangs or causes a kernel panic during the stop action and gathered tracebacks from the server's serial console (attached later).

The cluster is running a cman+pacemaker stack consisting of 8 nodes. I've also tried disabling fencing in pacemaker but fencing still happened so it might have been invoked by dlm instead. The lvm config looks like this on either machine:

[root@virt-155 ~]# grep "  use_lvmetad\|  locking_type" /etc/lvm/lvm.conf
    locking_type = 3
    use_lvmetad = 0

[root@virt-155 ~]# lvs; vgs; pvs
  LV        VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  clusterLV clusterVG  -wi-a-----   2.00g                                                    
  lv_root   vg_virt155 -wi-ao----   7.25g                                                    
  lv_swap   vg_virt155 -wi-ao---- 880.00m                                                    
  VG         #PV #LV #SN Attr   VSize VFree
  clusterVG    1   1   0 wz--nc 2.00g    0 
  vg_virt155   1   2   0 wz--n- 8.11g    0 
  PV         VG         Fmt  Attr PSize PFree
  /dev/sda   clusterVG  lvm2 a--u 2.00g    0 
  /dev/vda2  vg_virt155 lvm2 a--u 8.11g    0


> Version-Release number of selected component (if applicable):

lvm2-cluster-2.02.143-7.el6_8.1.x86_64
kernel-2.6.32-663.el6.x86_64, kernel-2.6.32-667.el6.x86_64
udev-147-2.73.el6_8.2.x86_64
device-mapper-1.02.117-7.el6_8.1.x86_64
pacemaker-1.1.15-2.el6.x86_64
cman-3.0.12.1-79.el6.x86_64
pcs-0.9.154-1.el6.x86_64


> How reproducible:

Not easily, may take  several hours to reproduce when running a start/stop loop.


> Steps to Reproduce:

while true; do
  echo -n "$(date) ";
  pcs cluster start --all --wait >/dev/null;
  sleep 30;
  echo -n "$(crm_mon -s) ";
  echo "$(date) ";
  pcs cluster stop --all >/dev/null;
  sleep 10;
done


> Actual results:

The loop above prints timestamps for each attempt and hangs after certain amount of time due to one of the nodes gets fenced.


> Expected results:

The loop runs forever and doesn't hang and no fencing occurs.

Comment 1 Radek Steiger 2016-11-08 13:10:13 UTC

A clvmd process hangup observed more often on multiple nodes:

INFO: task clvmd:26112 blocked for more than 120 seconds.
      Not tainted 2.6.32-667.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
clvmd         D 0000000000000000     0 26112      1 0x00000080
 ffff880039c2fc80 0000000000000082 ffff88003e38f180 ffff880039c2fbe8
 ffffea00004fc920 ffff8800000116c0 ffff880000019b00 ffff880039c2fc18
 ffffffff811455de 0000000000000286 ffff880039dbf068 ffff880039c2ffd8
Call Trace:
 [<ffffffff811455de>] ? add_page_to_unevictable_list+0x9e/0xc0
 [<ffffffff81159b5f>] ? handle_pte_fault+0x4af/0xb20
 [<ffffffff8154be35>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff811779ca>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff8154bfc6>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff812a83f4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff8154b4c4>] ? down_read+0x24/0x30
 [<ffffffffa0bd7257>] dlm_user_request+0x47/0x1b0 [dlm]
 [<ffffffff8115e80b>] ? __split_vma+0x20b/0x280
 [<ffffffff81180323>] ? kmem_cache_alloc_trace+0x1b3/0x1c0
 [<ffffffffa0be4177>] device_write+0x5c7/0x720 [dlm]
 [<ffffffff8123ab16>] ? security_file_permission+0x16/0x20
 [<ffffffff81199d28>] vfs_write+0xb8/0x1a0
 [<ffffffff8119b216>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a861>] sys_write+0x51/0xb0
 [<ffffffff8154ca4e>] ? do_device_not_available+0xe/0x10
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

Comment 2 Radek Steiger 2016-11-08 13:11:25 UTC

A kernel panic as caused by clvmd (observed just once, I'll try to get a kdump next time if possible):

DLM (built Oct 14 2016 11:08:59) installed
dlm: Using TCP for communications
dlm: got connection from 3
dlm: got connection from 6
dlm: got connection from 8
dlm: got connection from 1
dlm: connecting to 7
dlm: got connection from 7
dlm: connecting to 5
dlm: got connection from 5
dlm: connecting to 2
dlm: canceled rwork for node 5
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/kernel/dlm/clvmd/event_done
CPU 0 
Modules linked in: dlm configfs iptable_filter ip_tables autofs4 sg sd_mod crc_t10dif be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 dm_multipath microcode i6300esb virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: configfs]

Pid: 27774, comm: clvmd Not tainted 2.6.32-663.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffff8154c055>]  [<ffffffff8154c055>] _spin_lock_irq+0x15/0x40
RSP: 0018:ffff88003db6bca8  EFLAGS: 00010087
RAX: 0000000000010000 RBX: 0000000000000000 RCX: 000000000000003f
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff10003deafd48
RBP: ffff88003db6bca8 R08: ffffffff8160da00 R09: 0000000000000000
R10: 0000000000000000 R11: ffff88003a74ac90 R12: ffff88003e09bbc0
R13: ffffffff8160da00 R14: ffff88003a958d00 R15: 0000000000000000
FS:  00007f54e8d5d7a0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc6f3d18000 CR3: 000000003a911000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process clvmd (pid: 27774, threadinfo ffff88003db68000, task ffff88003a944040)
Stack:
 ffff88003db6bd48 ffffffff810a096d ffff10003deafd48 ffff88003db6bcf8
<d> ffff88003e09bbc8 0000000000000000 ffff88003db6bcf8 ffffffff8113eacf
<d> ffffea0000caa378 0000000000000000 ffff88003db6bd28 ffffffff8113eb60
Call Trace:
 [<ffffffff810a096d>] __cancel_work_timer+0xad/0x1e0
 [<ffffffff8113eacf>] ? free_hot_page+0x2f/0x60
 [<ffffffff8113eb60>] ? __free_pages+0x60/0xa0
 [<ffffffff810a0ad0>] cancel_work_sync+0x10/0x20
 [<ffffffffa065c594>] close_connection+0xb4/0x120 [dlm]
 [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm]
 [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm]
 [<ffffffffa065cf15>] free_conn+0x25/0x90 [dlm]
 [<ffffffffa065b12a>] foreach_conn+0x3a/0x60 [dlm]
 [<ffffffffa065b553>] dlm_lowcomms_stop+0x73/0x90 [dlm]
 [<ffffffffa0658e4a>] threads_stop+0x1a/0x30 [dlm]
 [<ffffffffa0658fc5>] dlm_release_lockspace+0x165/0x480 [dlm]
 [<ffffffffa0661e72>] device_write+0x2b2/0x720 [dlm]
 [<ffffffff81247e1b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff8123aa96>] ? security_file_permission+0x16/0x20
 [<ffffffff81199ca8>] vfs_write+0xb8/0x1a0
 [<ffffffff8119b196>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a7e1>] sys_write+0x51/0xb0
 [<ffffffff810ee28e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Code: c1 74 0e f3 90 66 66 90 66 90 83 3f 00 75 f4 eb df 48 89 d0 c9 c3 55 48 89 e5 0f 1f 44 00 00 fa 66 66 90 66 66 90 b8 00 00 01 00 <3e> 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 66 66 90 66 90 
RIP  [<ffffffff8154c055>] _spin_lock_irq+0x15/0x40
 RSP <ffff88003db6bca8>
---[ end trace 951a2fb0d63bf236 ]---
Kernel panic - not syncing: Fatal exception
Pid: 27774, comm: clvmd Tainted: G      D    -- ------------    2.6.32-663.el6.x86_64 #1
Call Trace:
 [<ffffffff81548551>] ? panic+0xa7/0x179
 [<ffffffff8154d3e4>] ? oops_end+0xe4/0x100
 [<ffffffff8101102b>] ? die+0x5b/0x90
 [<ffffffff8154cea2>] ? do_general_protection+0x152/0x160
 [<ffffffff8154c615>] ? general_protection+0x25/0x30
 [<ffffffff8154c055>] ? _spin_lock_irq+0x15/0x40
 [<ffffffff810a096d>] ? __cancel_work_timer+0xad/0x1e0
 [<ffffffff8113eacf>] ? free_hot_page+0x2f/0x60
 [<ffffffff8113eb60>] ? __free_pages+0x60/0xa0
 [<ffffffff810a0ad0>] ? cancel_work_sync+0x10/0x20
 [<ffffffffa065c594>] ? close_connection+0xb4/0x120 [dlm]
 [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm]
 [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm]
 [<ffffffffa065cf15>] ? free_conn+0x25/0x90 [dlm]
 [<ffffffffa065b12a>] ? foreach_conn+0x3a/0x60 [dlm]
 [<ffffffffa065b553>] ? dlm_lowcomms_stop+0x73/0x90 [dlm]
 [<ffffffffa0658e4a>] ? threads_stop+0x1a/0x30 [dlm]
 [<ffffffffa0658fc5>] ? dlm_release_lockspace+0x165/0x480 [dlm]
 [<ffffffffa0661e72>] ? device_write+0x2b2/0x720 [dlm]
 [<ffffffff81247e1b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff8123aa96>] ? security_file_permission+0x16/0x20
 [<ffffffff81199ca8>] ? vfs_write+0xb8/0x1a0
 [<ffffffff8119b196>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a7e1>] ? sys_write+0x51/0xb0
 [<ffffffff810ee28e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b

Comment 3 Zdenek Kabelac 2016-11-08 13:17:07 UTC

More info is surely needed here.


Please provide complete trace from  

'echo t > /proc/sysrq-trigger'


Also try to attach 'gdb' clvmd in case it blocks - and try to capture  backtrace of all threads:

gdb> 'thread apply all backtrace'



dmsetup table
dmsetup status

Comment 4 Zdenek Kabelac 2016-11-08 13:21:56 UTC

From comment 2  it looks like kernel issue in 'dlm' code.

Comment 5 David Teigland 2016-11-08 17:41:54 UTC

looks like a duplicate of bug 1365204

Comment 9 Jan Kurik 2017-12-06 10:37:33 UTC

Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/

Comment 10 Red Hat Bugzilla 2023-09-14 03:34:01 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days