Bug 1392908

Summary: clvmd occassionally panicks or hangs when stopping pacemaker-based cluster
Product: Red Hat Enterprise Linux 6 Reporter: Radek Steiger <rsteiger>
Component: lvm2Assignee: David Teigland <teigland>
lvm2 sub component: Clustering / clvmd (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED WONTFIX Docs Contact:
Severity: unspecified    
Priority: unspecified CC: agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, rsteiger, zkabelac
Version: 6.9Flags: zkabelac: needinfo? (rsteiger)
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-06 10:37:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Radek Steiger 2016-11-08 13:09:19 UTC
> Description of problem:

In bz1353738 the pcs and pacemaker components were modifed to improve the process of stopping a cluster to avoid hangups whena shared storage is used.  I was running a start/stop in a loop to verify no further problems arise, however after several hours a fencing of one of the nodes occured. Running that several more times I've observed two situations where clvmd either hangs or causes a kernel panic during the stop action and gathered tracebacks from the server's serial console (attached later).

The cluster is running a cman+pacemaker stack consisting of 8 nodes. I've also tried disabling fencing in pacemaker but fencing still happened so it might have been invoked by dlm instead. The lvm config looks like this on either machine:

[root@virt-155 ~]# grep "  use_lvmetad\|  locking_type" /etc/lvm/lvm.conf
    locking_type = 3
    use_lvmetad = 0

[root@virt-155 ~]# lvs; vgs; pvs
  LV        VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  clusterLV clusterVG  -wi-a-----   2.00g                                                    
  lv_root   vg_virt155 -wi-ao----   7.25g                                                    
  lv_swap   vg_virt155 -wi-ao---- 880.00m                                                    
  VG         #PV #LV #SN Attr   VSize VFree
  clusterVG    1   1   0 wz--nc 2.00g    0 
  vg_virt155   1   2   0 wz--n- 8.11g    0 
  PV         VG         Fmt  Attr PSize PFree
  /dev/sda   clusterVG  lvm2 a--u 2.00g    0 
  /dev/vda2  vg_virt155 lvm2 a--u 8.11g    0


> Version-Release number of selected component (if applicable):

lvm2-cluster-2.02.143-7.el6_8.1.x86_64
kernel-2.6.32-663.el6.x86_64, kernel-2.6.32-667.el6.x86_64
udev-147-2.73.el6_8.2.x86_64
device-mapper-1.02.117-7.el6_8.1.x86_64
pacemaker-1.1.15-2.el6.x86_64
cman-3.0.12.1-79.el6.x86_64
pcs-0.9.154-1.el6.x86_64


> How reproducible:

Not easily, may take  several hours to reproduce when running a start/stop loop.


> Steps to Reproduce:

while true; do
  echo -n "$(date) ";
  pcs cluster start --all --wait >/dev/null;
  sleep 30;
  echo -n "$(crm_mon -s) ";
  echo "$(date) ";
  pcs cluster stop --all >/dev/null;
  sleep 10;
done


> Actual results:

The loop above prints timestamps for each attempt and hangs after certain amount of time due to one of the nodes gets fenced.


> Expected results:

The loop runs forever and doesn't hang and no fencing occurs.

Comment 1 Radek Steiger 2016-11-08 13:10:13 UTC
A clvmd process hangup observed more often on multiple nodes:

INFO: task clvmd:26112 blocked for more than 120 seconds.
      Not tainted 2.6.32-667.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
clvmd         D 0000000000000000     0 26112      1 0x00000080
 ffff880039c2fc80 0000000000000082 ffff88003e38f180 ffff880039c2fbe8
 ffffea00004fc920 ffff8800000116c0 ffff880000019b00 ffff880039c2fc18
 ffffffff811455de 0000000000000286 ffff880039dbf068 ffff880039c2ffd8
Call Trace:
 [<ffffffff811455de>] ? add_page_to_unevictable_list+0x9e/0xc0
 [<ffffffff81159b5f>] ? handle_pte_fault+0x4af/0xb20
 [<ffffffff8154be35>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff811779ca>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff8154bfc6>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff812a83f4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff8154b4c4>] ? down_read+0x24/0x30
 [<ffffffffa0bd7257>] dlm_user_request+0x47/0x1b0 [dlm]
 [<ffffffff8115e80b>] ? __split_vma+0x20b/0x280
 [<ffffffff81180323>] ? kmem_cache_alloc_trace+0x1b3/0x1c0
 [<ffffffffa0be4177>] device_write+0x5c7/0x720 [dlm]
 [<ffffffff8123ab16>] ? security_file_permission+0x16/0x20
 [<ffffffff81199d28>] vfs_write+0xb8/0x1a0
 [<ffffffff8119b216>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a861>] sys_write+0x51/0xb0
 [<ffffffff8154ca4e>] ? do_device_not_available+0xe/0x10
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

Comment 2 Radek Steiger 2016-11-08 13:11:25 UTC
A kernel panic as caused by clvmd (observed just once, I'll try to get a kdump next time if possible):

DLM (built Oct 14 2016 11:08:59) installed
dlm: Using TCP for communications
dlm: got connection from 3
dlm: got connection from 6
dlm: got connection from 8
dlm: got connection from 1
dlm: connecting to 7
dlm: got connection from 7
dlm: connecting to 5
dlm: got connection from 5
dlm: connecting to 2
dlm: canceled rwork for node 5
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/kernel/dlm/clvmd/event_done
CPU 0 
Modules linked in: dlm configfs iptable_filter ip_tables autofs4 sg sd_mod crc_t10dif be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 dm_multipath microcode i6300esb virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: configfs]

Pid: 27774, comm: clvmd Not tainted 2.6.32-663.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffff8154c055>]  [<ffffffff8154c055>] _spin_lock_irq+0x15/0x40
RSP: 0018:ffff88003db6bca8  EFLAGS: 00010087
RAX: 0000000000010000 RBX: 0000000000000000 RCX: 000000000000003f
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff10003deafd48
RBP: ffff88003db6bca8 R08: ffffffff8160da00 R09: 0000000000000000
R10: 0000000000000000 R11: ffff88003a74ac90 R12: ffff88003e09bbc0
R13: ffffffff8160da00 R14: ffff88003a958d00 R15: 0000000000000000
FS:  00007f54e8d5d7a0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc6f3d18000 CR3: 000000003a911000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process clvmd (pid: 27774, threadinfo ffff88003db68000, task ffff88003a944040)
Stack:
 ffff88003db6bd48 ffffffff810a096d ffff10003deafd48 ffff88003db6bcf8
<d> ffff88003e09bbc8 0000000000000000 ffff88003db6bcf8 ffffffff8113eacf
<d> ffffea0000caa378 0000000000000000 ffff88003db6bd28 ffffffff8113eb60
Call Trace:
 [<ffffffff810a096d>] __cancel_work_timer+0xad/0x1e0
 [<ffffffff8113eacf>] ? free_hot_page+0x2f/0x60
 [<ffffffff8113eb60>] ? __free_pages+0x60/0xa0
 [<ffffffff810a0ad0>] cancel_work_sync+0x10/0x20
 [<ffffffffa065c594>] close_connection+0xb4/0x120 [dlm]
 [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm]
 [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm]
 [<ffffffffa065cf15>] free_conn+0x25/0x90 [dlm]
 [<ffffffffa065b12a>] foreach_conn+0x3a/0x60 [dlm]
 [<ffffffffa065b553>] dlm_lowcomms_stop+0x73/0x90 [dlm]
 [<ffffffffa0658e4a>] threads_stop+0x1a/0x30 [dlm]
 [<ffffffffa0658fc5>] dlm_release_lockspace+0x165/0x480 [dlm]
 [<ffffffffa0661e72>] device_write+0x2b2/0x720 [dlm]
 [<ffffffff81247e1b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff8123aa96>] ? security_file_permission+0x16/0x20
 [<ffffffff81199ca8>] vfs_write+0xb8/0x1a0
 [<ffffffff8119b196>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a7e1>] sys_write+0x51/0xb0
 [<ffffffff810ee28e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Code: c1 74 0e f3 90 66 66 90 66 90 83 3f 00 75 f4 eb df 48 89 d0 c9 c3 55 48 89 e5 0f 1f 44 00 00 fa 66 66 90 66 66 90 b8 00 00 01 00 <3e> 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 66 66 90 66 90 
RIP  [<ffffffff8154c055>] _spin_lock_irq+0x15/0x40
 RSP <ffff88003db6bca8>
---[ end trace 951a2fb0d63bf236 ]---
Kernel panic - not syncing: Fatal exception
Pid: 27774, comm: clvmd Tainted: G      D    -- ------------    2.6.32-663.el6.x86_64 #1
Call Trace:
 [<ffffffff81548551>] ? panic+0xa7/0x179
 [<ffffffff8154d3e4>] ? oops_end+0xe4/0x100
 [<ffffffff8101102b>] ? die+0x5b/0x90
 [<ffffffff8154cea2>] ? do_general_protection+0x152/0x160
 [<ffffffff8154c615>] ? general_protection+0x25/0x30
 [<ffffffff8154c055>] ? _spin_lock_irq+0x15/0x40
 [<ffffffff810a096d>] ? __cancel_work_timer+0xad/0x1e0
 [<ffffffff8113eacf>] ? free_hot_page+0x2f/0x60
 [<ffffffff8113eb60>] ? __free_pages+0x60/0xa0
 [<ffffffff810a0ad0>] ? cancel_work_sync+0x10/0x20
 [<ffffffffa065c594>] ? close_connection+0xb4/0x120 [dlm]
 [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm]
 [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm]
 [<ffffffffa065cf15>] ? free_conn+0x25/0x90 [dlm]
 [<ffffffffa065b12a>] ? foreach_conn+0x3a/0x60 [dlm]
 [<ffffffffa065b553>] ? dlm_lowcomms_stop+0x73/0x90 [dlm]
 [<ffffffffa0658e4a>] ? threads_stop+0x1a/0x30 [dlm]
 [<ffffffffa0658fc5>] ? dlm_release_lockspace+0x165/0x480 [dlm]
 [<ffffffffa0661e72>] ? device_write+0x2b2/0x720 [dlm]
 [<ffffffff81247e1b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff8123aa96>] ? security_file_permission+0x16/0x20
 [<ffffffff81199ca8>] ? vfs_write+0xb8/0x1a0
 [<ffffffff8119b196>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a7e1>] ? sys_write+0x51/0xb0
 [<ffffffff810ee28e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b

Comment 3 Zdenek Kabelac 2016-11-08 13:17:07 UTC
More info is surely needed here.


Please provide complete trace from  

'echo t > /proc/sysrq-trigger'


Also try to attach 'gdb' clvmd in case it blocks - and try to capture  backtrace of all threads:

gdb> 'thread apply all backtrace'



dmsetup table
dmsetup status

Comment 4 Zdenek Kabelac 2016-11-08 13:21:56 UTC
From comment 2  it looks like kernel issue in 'dlm' code.

Comment 5 David Teigland 2016-11-08 17:41:54 UTC
looks like a duplicate of bug 1365204

Comment 9 Jan Kurik 2017-12-06 10:37:33 UTC
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/