Hide Forgot
> Description of problem: In bz1353738 the pcs and pacemaker components were modifed to improve the process of stopping a cluster to avoid hangups whena shared storage is used. I was running a start/stop in a loop to verify no further problems arise, however after several hours a fencing of one of the nodes occured. Running that several more times I've observed two situations where clvmd either hangs or causes a kernel panic during the stop action and gathered tracebacks from the server's serial console (attached later). The cluster is running a cman+pacemaker stack consisting of 8 nodes. I've also tried disabling fencing in pacemaker but fencing still happened so it might have been invoked by dlm instead. The lvm config looks like this on either machine: [root@virt-155 ~]# grep " use_lvmetad\| locking_type" /etc/lvm/lvm.conf locking_type = 3 use_lvmetad = 0 [root@virt-155 ~]# lvs; vgs; pvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert clusterLV clusterVG -wi-a----- 2.00g lv_root vg_virt155 -wi-ao---- 7.25g lv_swap vg_virt155 -wi-ao---- 880.00m VG #PV #LV #SN Attr VSize VFree clusterVG 1 1 0 wz--nc 2.00g 0 vg_virt155 1 2 0 wz--n- 8.11g 0 PV VG Fmt Attr PSize PFree /dev/sda clusterVG lvm2 a--u 2.00g 0 /dev/vda2 vg_virt155 lvm2 a--u 8.11g 0 > Version-Release number of selected component (if applicable): lvm2-cluster-2.02.143-7.el6_8.1.x86_64 kernel-2.6.32-663.el6.x86_64, kernel-2.6.32-667.el6.x86_64 udev-147-2.73.el6_8.2.x86_64 device-mapper-1.02.117-7.el6_8.1.x86_64 pacemaker-1.1.15-2.el6.x86_64 cman-3.0.12.1-79.el6.x86_64 pcs-0.9.154-1.el6.x86_64 > How reproducible: Not easily, may take several hours to reproduce when running a start/stop loop. > Steps to Reproduce: while true; do echo -n "$(date) "; pcs cluster start --all --wait >/dev/null; sleep 30; echo -n "$(crm_mon -s) "; echo "$(date) "; pcs cluster stop --all >/dev/null; sleep 10; done > Actual results: The loop above prints timestamps for each attempt and hangs after certain amount of time due to one of the nodes gets fenced. > Expected results: The loop runs forever and doesn't hang and no fencing occurs.
A clvmd process hangup observed more often on multiple nodes: INFO: task clvmd:26112 blocked for more than 120 seconds. Not tainted 2.6.32-667.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. clvmd D 0000000000000000 0 26112 1 0x00000080 ffff880039c2fc80 0000000000000082 ffff88003e38f180 ffff880039c2fbe8 ffffea00004fc920 ffff8800000116c0 ffff880000019b00 ffff880039c2fc18 ffffffff811455de 0000000000000286 ffff880039dbf068 ffff880039c2ffd8 Call Trace: [<ffffffff811455de>] ? add_page_to_unevictable_list+0x9e/0xc0 [<ffffffff81159b5f>] ? handle_pte_fault+0x4af/0xb20 [<ffffffff8154be35>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff811779ca>] ? alloc_pages_current+0xaa/0x110 [<ffffffff8154bfc6>] rwsem_down_read_failed+0x26/0x30 [<ffffffff812a83f4>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff8154b4c4>] ? down_read+0x24/0x30 [<ffffffffa0bd7257>] dlm_user_request+0x47/0x1b0 [dlm] [<ffffffff8115e80b>] ? __split_vma+0x20b/0x280 [<ffffffff81180323>] ? kmem_cache_alloc_trace+0x1b3/0x1c0 [<ffffffffa0be4177>] device_write+0x5c7/0x720 [dlm] [<ffffffff8123ab16>] ? security_file_permission+0x16/0x20 [<ffffffff81199d28>] vfs_write+0xb8/0x1a0 [<ffffffff8119b216>] ? fget_light_pos+0x16/0x50 [<ffffffff8119a861>] sys_write+0x51/0xb0 [<ffffffff8154ca4e>] ? do_device_not_available+0xe/0x10 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
A kernel panic as caused by clvmd (observed just once, I'll try to get a kdump next time if possible): DLM (built Oct 14 2016 11:08:59) installed dlm: Using TCP for communications dlm: got connection from 3 dlm: got connection from 6 dlm: got connection from 8 dlm: got connection from 1 dlm: connecting to 7 dlm: got connection from 7 dlm: connecting to 5 dlm: got connection from 5 dlm: connecting to 2 dlm: canceled rwork for node 5 general protection fault: 0000 [#1] SMP last sysfs file: /sys/kernel/dlm/clvmd/event_done CPU 0 Modules linked in: dlm configfs iptable_filter ip_tables autofs4 sg sd_mod crc_t10dif be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 dm_multipath microcode i6300esb virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: configfs] Pid: 27774, comm: clvmd Not tainted 2.6.32-663.el6.x86_64 #1 Red Hat KVM RIP: 0010:[<ffffffff8154c055>] [<ffffffff8154c055>] _spin_lock_irq+0x15/0x40 RSP: 0018:ffff88003db6bca8 EFLAGS: 00010087 RAX: 0000000000010000 RBX: 0000000000000000 RCX: 000000000000003f RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff10003deafd48 RBP: ffff88003db6bca8 R08: ffffffff8160da00 R09: 0000000000000000 R10: 0000000000000000 R11: ffff88003a74ac90 R12: ffff88003e09bbc0 R13: ffffffff8160da00 R14: ffff88003a958d00 R15: 0000000000000000 FS: 00007f54e8d5d7a0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fc6f3d18000 CR3: 000000003a911000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process clvmd (pid: 27774, threadinfo ffff88003db68000, task ffff88003a944040) Stack: ffff88003db6bd48 ffffffff810a096d ffff10003deafd48 ffff88003db6bcf8 <d> ffff88003e09bbc8 0000000000000000 ffff88003db6bcf8 ffffffff8113eacf <d> ffffea0000caa378 0000000000000000 ffff88003db6bd28 ffffffff8113eb60 Call Trace: [<ffffffff810a096d>] __cancel_work_timer+0xad/0x1e0 [<ffffffff8113eacf>] ? free_hot_page+0x2f/0x60 [<ffffffff8113eb60>] ? __free_pages+0x60/0xa0 [<ffffffff810a0ad0>] cancel_work_sync+0x10/0x20 [<ffffffffa065c594>] close_connection+0xb4/0x120 [dlm] [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm] [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm] [<ffffffffa065cf15>] free_conn+0x25/0x90 [dlm] [<ffffffffa065b12a>] foreach_conn+0x3a/0x60 [dlm] [<ffffffffa065b553>] dlm_lowcomms_stop+0x73/0x90 [dlm] [<ffffffffa0658e4a>] threads_stop+0x1a/0x30 [dlm] [<ffffffffa0658fc5>] dlm_release_lockspace+0x165/0x480 [dlm] [<ffffffffa0661e72>] device_write+0x2b2/0x720 [dlm] [<ffffffff81247e1b>] ? selinux_file_permission+0xfb/0x150 [<ffffffff8123aa96>] ? security_file_permission+0x16/0x20 [<ffffffff81199ca8>] vfs_write+0xb8/0x1a0 [<ffffffff8119b196>] ? fget_light_pos+0x16/0x50 [<ffffffff8119a7e1>] sys_write+0x51/0xb0 [<ffffffff810ee28e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b Code: c1 74 0e f3 90 66 66 90 66 90 83 3f 00 75 f4 eb df 48 89 d0 c9 c3 55 48 89 e5 0f 1f 44 00 00 fa 66 66 90 66 66 90 b8 00 00 01 00 <3e> 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 66 66 90 66 90 RIP [<ffffffff8154c055>] _spin_lock_irq+0x15/0x40 RSP <ffff88003db6bca8> ---[ end trace 951a2fb0d63bf236 ]--- Kernel panic - not syncing: Fatal exception Pid: 27774, comm: clvmd Tainted: G D -- ------------ 2.6.32-663.el6.x86_64 #1 Call Trace: [<ffffffff81548551>] ? panic+0xa7/0x179 [<ffffffff8154d3e4>] ? oops_end+0xe4/0x100 [<ffffffff8101102b>] ? die+0x5b/0x90 [<ffffffff8154cea2>] ? do_general_protection+0x152/0x160 [<ffffffff8154c615>] ? general_protection+0x25/0x30 [<ffffffff8154c055>] ? _spin_lock_irq+0x15/0x40 [<ffffffff810a096d>] ? __cancel_work_timer+0xad/0x1e0 [<ffffffff8113eacf>] ? free_hot_page+0x2f/0x60 [<ffffffff8113eb60>] ? __free_pages+0x60/0xa0 [<ffffffff810a0ad0>] ? cancel_work_sync+0x10/0x20 [<ffffffffa065c594>] ? close_connection+0xb4/0x120 [dlm] [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm] [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm] [<ffffffffa065cf15>] ? free_conn+0x25/0x90 [dlm] [<ffffffffa065b12a>] ? foreach_conn+0x3a/0x60 [dlm] [<ffffffffa065b553>] ? dlm_lowcomms_stop+0x73/0x90 [dlm] [<ffffffffa0658e4a>] ? threads_stop+0x1a/0x30 [dlm] [<ffffffffa0658fc5>] ? dlm_release_lockspace+0x165/0x480 [dlm] [<ffffffffa0661e72>] ? device_write+0x2b2/0x720 [dlm] [<ffffffff81247e1b>] ? selinux_file_permission+0xfb/0x150 [<ffffffff8123aa96>] ? security_file_permission+0x16/0x20 [<ffffffff81199ca8>] ? vfs_write+0xb8/0x1a0 [<ffffffff8119b196>] ? fget_light_pos+0x16/0x50 [<ffffffff8119a7e1>] ? sys_write+0x51/0xb0 [<ffffffff810ee28e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
More info is surely needed here. Please provide complete trace from 'echo t > /proc/sysrq-trigger' Also try to attach 'gdb' clvmd in case it blocks - and try to capture backtrace of all threads: gdb> 'thread apply all backtrace' dmsetup table dmsetup status
From comment 2 it looks like kernel issue in 'dlm' code.
looks like a duplicate of bug 1365204
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: https://access.redhat.com/