| Summary: | clvmd occassionally panicks or hangs when stopping pacemaker-based cluster | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Radek Steiger <rsteiger> |
| Component: | lvm2 | Assignee: | David Teigland <teigland> |
| lvm2 sub component: | Clustering / clvmd (RHEL6) | QA Contact: | cluster-qe <cluster-qe> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, rsteiger, zkabelac |
| Version: | 6.9 | Flags: | zkabelac:
needinfo?
(rsteiger) |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-12-06 10:37:33 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Radek Steiger
2016-11-08 13:09:19 UTC
A clvmd process hangup observed more often on multiple nodes:
INFO: task clvmd:26112 blocked for more than 120 seconds.
Not tainted 2.6.32-667.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
clvmd D 0000000000000000 0 26112 1 0x00000080
ffff880039c2fc80 0000000000000082 ffff88003e38f180 ffff880039c2fbe8
ffffea00004fc920 ffff8800000116c0 ffff880000019b00 ffff880039c2fc18
ffffffff811455de 0000000000000286 ffff880039dbf068 ffff880039c2ffd8
Call Trace:
[<ffffffff811455de>] ? add_page_to_unevictable_list+0x9e/0xc0
[<ffffffff81159b5f>] ? handle_pte_fault+0x4af/0xb20
[<ffffffff8154be35>] rwsem_down_failed_common+0x95/0x1d0
[<ffffffff811779ca>] ? alloc_pages_current+0xaa/0x110
[<ffffffff8154bfc6>] rwsem_down_read_failed+0x26/0x30
[<ffffffff812a83f4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff8154b4c4>] ? down_read+0x24/0x30
[<ffffffffa0bd7257>] dlm_user_request+0x47/0x1b0 [dlm]
[<ffffffff8115e80b>] ? __split_vma+0x20b/0x280
[<ffffffff81180323>] ? kmem_cache_alloc_trace+0x1b3/0x1c0
[<ffffffffa0be4177>] device_write+0x5c7/0x720 [dlm]
[<ffffffff8123ab16>] ? security_file_permission+0x16/0x20
[<ffffffff81199d28>] vfs_write+0xb8/0x1a0
[<ffffffff8119b216>] ? fget_light_pos+0x16/0x50
[<ffffffff8119a861>] sys_write+0x51/0xb0
[<ffffffff8154ca4e>] ? do_device_not_available+0xe/0x10
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
A kernel panic as caused by clvmd (observed just once, I'll try to get a kdump next time if possible): DLM (built Oct 14 2016 11:08:59) installed dlm: Using TCP for communications dlm: got connection from 3 dlm: got connection from 6 dlm: got connection from 8 dlm: got connection from 1 dlm: connecting to 7 dlm: got connection from 7 dlm: connecting to 5 dlm: got connection from 5 dlm: connecting to 2 dlm: canceled rwork for node 5 general protection fault: 0000 [#1] SMP last sysfs file: /sys/kernel/dlm/clvmd/event_done CPU 0 Modules linked in: dlm configfs iptable_filter ip_tables autofs4 sg sd_mod crc_t10dif be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 dm_multipath microcode i6300esb virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: configfs] Pid: 27774, comm: clvmd Not tainted 2.6.32-663.el6.x86_64 #1 Red Hat KVM RIP: 0010:[<ffffffff8154c055>] [<ffffffff8154c055>] _spin_lock_irq+0x15/0x40 RSP: 0018:ffff88003db6bca8 EFLAGS: 00010087 RAX: 0000000000010000 RBX: 0000000000000000 RCX: 000000000000003f RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff10003deafd48 RBP: ffff88003db6bca8 R08: ffffffff8160da00 R09: 0000000000000000 R10: 0000000000000000 R11: ffff88003a74ac90 R12: ffff88003e09bbc0 R13: ffffffff8160da00 R14: ffff88003a958d00 R15: 0000000000000000 FS: 00007f54e8d5d7a0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fc6f3d18000 CR3: 000000003a911000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process clvmd (pid: 27774, threadinfo ffff88003db68000, task ffff88003a944040) Stack: ffff88003db6bd48 ffffffff810a096d ffff10003deafd48 ffff88003db6bcf8 <d> ffff88003e09bbc8 0000000000000000 ffff88003db6bcf8 ffffffff8113eacf <d> ffffea0000caa378 0000000000000000 ffff88003db6bd28 ffffffff8113eb60 Call Trace: [<ffffffff810a096d>] __cancel_work_timer+0xad/0x1e0 [<ffffffff8113eacf>] ? free_hot_page+0x2f/0x60 [<ffffffff8113eb60>] ? __free_pages+0x60/0xa0 [<ffffffff810a0ad0>] cancel_work_sync+0x10/0x20 [<ffffffffa065c594>] close_connection+0xb4/0x120 [dlm] [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm] [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm] [<ffffffffa065cf15>] free_conn+0x25/0x90 [dlm] [<ffffffffa065b12a>] foreach_conn+0x3a/0x60 [dlm] [<ffffffffa065b553>] dlm_lowcomms_stop+0x73/0x90 [dlm] [<ffffffffa0658e4a>] threads_stop+0x1a/0x30 [dlm] [<ffffffffa0658fc5>] dlm_release_lockspace+0x165/0x480 [dlm] [<ffffffffa0661e72>] device_write+0x2b2/0x720 [dlm] [<ffffffff81247e1b>] ? selinux_file_permission+0xfb/0x150 [<ffffffff8123aa96>] ? security_file_permission+0x16/0x20 [<ffffffff81199ca8>] vfs_write+0xb8/0x1a0 [<ffffffff8119b196>] ? fget_light_pos+0x16/0x50 [<ffffffff8119a7e1>] sys_write+0x51/0xb0 [<ffffffff810ee28e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b Code: c1 74 0e f3 90 66 66 90 66 90 83 3f 00 75 f4 eb df 48 89 d0 c9 c3 55 48 89 e5 0f 1f 44 00 00 fa 66 66 90 66 66 90 b8 00 00 01 00 <3e> 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 66 66 90 66 90 RIP [<ffffffff8154c055>] _spin_lock_irq+0x15/0x40 RSP <ffff88003db6bca8> ---[ end trace 951a2fb0d63bf236 ]--- Kernel panic - not syncing: Fatal exception Pid: 27774, comm: clvmd Tainted: G D -- ------------ 2.6.32-663.el6.x86_64 #1 Call Trace: [<ffffffff81548551>] ? panic+0xa7/0x179 [<ffffffff8154d3e4>] ? oops_end+0xe4/0x100 [<ffffffff8101102b>] ? die+0x5b/0x90 [<ffffffff8154cea2>] ? do_general_protection+0x152/0x160 [<ffffffff8154c615>] ? general_protection+0x25/0x30 [<ffffffff8154c055>] ? _spin_lock_irq+0x15/0x40 [<ffffffff810a096d>] ? __cancel_work_timer+0xad/0x1e0 [<ffffffff8113eacf>] ? free_hot_page+0x2f/0x60 [<ffffffff8113eb60>] ? __free_pages+0x60/0xa0 [<ffffffff810a0ad0>] ? cancel_work_sync+0x10/0x20 [<ffffffffa065c594>] ? close_connection+0xb4/0x120 [dlm] [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm] [<ffffffffa065cef0>] ? free_conn+0x0/0x90 [dlm] [<ffffffffa065cf15>] ? free_conn+0x25/0x90 [dlm] [<ffffffffa065b12a>] ? foreach_conn+0x3a/0x60 [dlm] [<ffffffffa065b553>] ? dlm_lowcomms_stop+0x73/0x90 [dlm] [<ffffffffa0658e4a>] ? threads_stop+0x1a/0x30 [dlm] [<ffffffffa0658fc5>] ? dlm_release_lockspace+0x165/0x480 [dlm] [<ffffffffa0661e72>] ? device_write+0x2b2/0x720 [dlm] [<ffffffff81247e1b>] ? selinux_file_permission+0xfb/0x150 [<ffffffff8123aa96>] ? security_file_permission+0x16/0x20 [<ffffffff81199ca8>] ? vfs_write+0xb8/0x1a0 [<ffffffff8119b196>] ? fget_light_pos+0x16/0x50 [<ffffffff8119a7e1>] ? sys_write+0x51/0xb0 [<ffffffff810ee28e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b More info is surely needed here. Please provide complete trace from 'echo t > /proc/sysrq-trigger' Also try to attach 'gdb' clvmd in case it blocks - and try to capture backtrace of all threads: gdb> 'thread apply all backtrace' dmsetup table dmsetup status From comment 2 it looks like kernel issue in 'dlm' code. looks like a duplicate of bug 1365204 Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: https://access.redhat.com/ |