Bug 228070
| Summary: | DLM assertion when running GFS I/O during cmirror leg failure | ||
|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
| Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4 | CC: | ccaulfie, cfeist |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2008-08-05 21:43:02 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Corey Marthaler
2007-02-09 20:29:21 UTC
Looks like the node was probably kicked out of the cluster dlm: dlm_unlock: lkid 20196 lockspace not found We have the "emergency shutdown" messages to make it more obvious when that happens. There are none of those or any cman messages about removing the node from the cluster? Feb 9 08:58:46 link-04 kernel: CMAN: removing node link-02 from the cluster : Missed too many heartbeats Feb 9 08:31:36 link-02 kernel: CMAN: Being told to leave the cluster by node 3 Feb 9 08:31:36 link-02 kernel: CMAN: we are leaving the cluster. Feb 9 08:31:36 link-02 kernel: WARNING: dlm_emergency_shutdown Feb 9 08:31:36 link-02 kernel: WARNING: dlm_emergency_shutdown Feb 9 08:31:36 link-02 kernel: SM: 00000005 sm_stop: SG still joined Feb 9 08:31:36 link-02 kernel: SM: 01000006 sm_stop: SG still joined Feb 9 08:31:36 link-02 kernel: SM: 02000009 sm_stop: SG still joined Feb 9 08:31:36 link-02 kernel: dm-cmirror: No address list available for 1 Feb 9 08:31:36 link-02 kernel: Feb 9 08:31:36 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid. Feb 9 08:31:36 link-02 kernel: dm-cmirror: process_log_request:: failed Feb 9 08:31:36 link-02 kernel: device-mapper: recovery failed on region 10714 Feb 9 08:31:36 link-02 kernel: dm-cmirror: No address list available for 1 Feb 9 08:31:36 link-02 kernel: Feb 9 08:31:36 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid. Feb 9 08:31:36 link-02 kernel: dm-cmirror: process_log_request:: failed Feb 9 08:31:36 link-02 kernel: dm-cmirror: No address list available for 1 Feb 9 08:31:36 link-02 kernel: Feb 9 08:31:36 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid. Feb 9 08:31:36 link-02 kernel: dm-cmirror: process_log_request:: failed reproduced with slightly different assert, but again same scenario, nodes get
told to leave the cluster as a result of a cmirror device failing and slowing
down the machines.
lock_dlm: Assertion failed on line 432 of file
/builddir/build/BUILD/gfs-kernel-2.6.9-67/smp/src/dlm/lock.c
lock_dlm: assertion: "!error"
lock_dlm: time = 4300566187
gfs3: num=2,18 err=-22 cur=0 req=5 lkf=4
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:432
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) qla2300 qla2xxx
scsi_transport_fc dm_cmirror(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport
autofs4 sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random
k8_edac edac_mc tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sd_mod
scsi_mod
Pid: 6201, comm: lock_dlm1 Tainted: G M 2.6.9-46.ELsmp
RIP: 0010:[<ffffffffa0064a0c>] <ffffffffa0064a0c>{:lock_dlm:do_dlm_lock+366}
RSP: 0000:00000100300adde8 EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000020000
RDX: 00000000001b2865 RSI: 0000000000000246 RDI: ffffffff803e5600
RBP: 000001003aae3680 R08: 00000000fffffffb R09: 00000000ffffffea
R10: 0000000000000000 R11: 0000000000000000 R12: 0000010037e15e00
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS: 0000002a95562b00(0000) GS:ffffffff804ed500(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000059ba08 CR3: 0000000000101000 CR4: 00000000000006e0
Process lock_dlm1 (pid: 6201, threadinfo 00000100300ac000, task 0000010036a35030)
Stack: 0000000000000005 0000000000000004 3220202020202020 2020202020202020
3831202020202020 ffffffff80130018 000001003aae3680 0000010037e15e00
0000000000000000 0000000000000000
Call Trace:<ffffffff80130018>{ia32_setup_arg_pages+422}
<ffffffffa0064f93>{:lock_dlm:process_submit+43}
<ffffffffa0068ab8>{:lock_dlm:dlm_async+2020}
<ffffffff801346b1>{__wake_up_common+67}
<ffffffff80134660>{default_wake_function+0}
<ffffffff8014be24>{keventd_create_kthread+0}
<ffffffffa00682d4>{:lock_dlm:dlm_async+0}
<ffffffff8014be24>{keventd_create_kthread+0}
<ffffffff8014bdfb>{kthread+200} <ffffffff80110f47>{child_rip+8}
<ffffffff8014be24>{keventd_create_kthread+0} <ffffffff8014bd33>{kthread+0}
<ffffffff80110f3f>{child_rip+0}
Code: 0f 0b 08 92 06 a0 ff ff ff ff b0 01 48 c7 c7 0d 92 06 a0 31
RIP <ffffffffa0064a0c>{:lock_dlm:do_dlm_lock+366} RSP <00000100300adde8>
<3>dev<ic0e>-Kmearpneple rp:a nUniacb l-e notto sryenadc ifnrgo:m O opprsi
ary mirror during recovery m
I propose setting a restriction that mirrors are limited to 2 sides for 4.5. This would diffuse this bug. Once we agree on that, I'll open a RFE for 4.6 and make this bug dependent on that. I found a simpler way to reproduce this issue. Do I/O on a cmirror from a node other than the server, then 'cman_tool leave force' on the log server node. When a log server drops out of the cluster, it ignores any requests - forcing the clients to retry. Unfortunately, the clients never ran another election - causing operations to stall. The server now replies that it cannot handle the requests, which causes proper initiation of elections. assigned ->needinfo needinfo -> modified this should be in MODIFIED Fix verified in latest packages. Fixed in current release (4.7). |