Bug 387081
| Summary: | node told to leave cluster due to inconsistent view ends up panicing | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||
| Component: | cman-kernel | Assignee: | Christine Caulfield <ccaulfie> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Cluster QE <mspqa-list> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 4 | CC: | cluster-maint, teigland | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2007-11-21 15:24:04 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
This is one of those unfortunate situations where it's almost impossible to debug it retrospectively. The inconsistent view error is really there to catch bugs, the nodes should never get out of step with their cluster view, but odd things can happen (I suspect) if network packets get horribly delayed during transition. As we can assume that the nodes had a consistent view going into transition then it must be something in that process that has screwed up. So it's either that or a corrupted packet somewhere (or somehow) that has probably caused this. Without a tcpdump of the transition it's impossible to say more I'm afraid :( This is reproducable with the following: 2.6.9-67.EL cman-kernel-2.6.9-54.1 dlm-kernel-2.6.9-52.2 Created attachment 261521 [details]
here's the tcp dump during the failures
Did you verify that this is caused by the new packages with the "missing messages" patch ? OK, I've found the problem. Rip that patch out for 4.6 and I'll redo it for 4.7. A fix for the fix. on the RHEL4 branch: Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.31; previous revision: 1.42.2.30 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v <-- membership.c new revision: 1.44.2.29; previous revision: 1.44.2.28 done Original change backed out of the release, so moving this to be a 4.7 request. This BZ might as well be closed then. The bug was caused by an inadequate patch for bz#373671. If that patch doesn't exist anywhere then neither does this bug. I'll copy the checkin record above into that BZ and set this to MODIFIED. Corey can close it if he's happy with that. Closing... |
Description of problem: I was running revovler on the latest 4.6 builds on my 6 node cluster and after one node was shot and brought back up, another node was told to leave the cluster due to an inconsistent view and paniced. One of the nodes telling grant-01 to leave: Nov 16 09:22:42 link-02 kernel: CMAN: Started transition, generation 13 Nov 16 09:22:42 link-02 kernel: CMAN: node grant-03 rejoining Nov 16 09:22:43 link-02 kernel: CMAN: Finished transition, generation 13 Nov 16 09:22:58 link-02 kernel: CMAN: Started transition, generation 15 Nov 16 09:23:04 link-02 kernel: CMAN: node grant-01 has been removed from the cluster : Inconsistent cluster view Nov 16 09:23:13 link-02 kernel: CMAN: Initiating transition, generation 16 Nov 16 09:23:13 link-02 kernel: CMAN: Initiating transition, generation 17 Nov 16 09:23:15 link-02 kernel: CMAN: Completed transition, generation 17 Nov 16 09:23:45 link-02 fenced[7690]: grant-01 not a cluster member after 30 sec post_fail_delay Nov 16 09:23:45 link-02 fenced[7690]: fencing node "grant-01" Nov 16 09:23:45 link-02 fenced[7690]: fence "grant-01" success [root@link-02 sbin]# cman_tool nodes Node Votes Exp Sts Name 1 1 6 M link-02 2 1 6 M grant-02 3 1 6 M grant-03 4 1 6 M link-07 5 1 6 X grant-01 6 1 6 M link-08 [root@link-02 sbin]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 4 6 2 3] DLM Lock Space: "clvmd" 3 3 run - [1 4 6 2 3] DLM Lock Space: "LINK_1280" 9 5 run - [1 4 6 2] DLM Lock Space: "LINK_1281" 11 7 run - [1 4 6 2] DLM Lock Space: "LINK_1282" 13 9 run - [1 4 6 2] GFS Mount Group: "LINK_1280" 10 6 run - [1 4 6 2] GFS Mount Group: "LINK_1281" 12 8 run - [1 4 6 2] GFS Mount Group: "LINK_1282" 14 10 run - [1 4 6 2] Grant-01: Nov 16 09:22:42 grant-01 kernel: CMAN: node grant-03 rejoining Nov 16 09:22:42 grant-01 kernel: CMAN: Initiating transition, generation 13 Nov 16 09:22:43 grant-01 kernel: CMAN: Completed transition, generation 13 Nov 16 09:22:58 grant-01 kernel: CMAN: Initiating transition, generation 15 WARNING: dlm_emergency_shutdown dlm: process_cluster_request invalid lockspace 100000b from 2 req 1 WARNING: dlm_emergency_shutdown SM: 00000002 sm_stop: SG still joined SM: 01000003 sm_stop: SG still joined SM: 0200000a sm_stop: SG still joined dlm: dlm_lock: no lockspace 2 resend 34004c lq 1 flg 200008 node -1/-1 " 8 LINK_1282 resend 3c0261 lq 1 flg 200000 node -1/-1 " 5 LINK_1282 resent 3 requests LINK_1282 recover event 36 finished LINK_1280 move flags 0,0,1 ids 22,36,36 LINK_1280 process held requests LINK_1280 processed 0 requests LINK_1280 resend marked requests LINK_1280 resend e0317 lq 1 flg 200000 node -1/-1 " 7 LINK_1280 resent 1 requests LINK_1280 recover event 36 finished LINK_1281 mark waiting requests LINK_1281 mark 18004b lq 1 nodeid -1 LINK_1281 marked 1 requests LINK_1281 purge locks of departed nodes LINK_1281 purged 1 locks LINK_1281 update remastered resources LINK_1281 updated 1 resources LINK_1281 rebuild locks LINK_1281 rebuilt 1 locks LINK_1281 recover event 36 done LINK_1281 move flags 0,0,1 ids 28,36,36 LINK_1281 process held requests LINK_1281 processed 0 requests LINK_1281 resend marked requests LINK_1281 resend 18004b lq 1 flg 200000 node -1/-1 " 2 LINK_1281 resent 1 requests LINK_1281 recover event 36 finished ,2797006 8924 remove 7,2797006 8924 ex punlock 0 8924 en plock 7,279700c 8924 req 7,279700c ex 0-7fffffffffffffff lkf 2000 wait 1 8924 ex plock 0 8920 en punlock 7,279700b 8920 remove 7,279700b 8920 ex punlock 0 8920 en plock 7,2797006 8920 req 7,2797006 ex 0-7fffffffffffffff lkf 2000 wait 1 8924 en punlock 7,279700c 8924 remove 7,279700c 8924 ex punlock 0 8924 en plock 7,279700b 8924 req 7,279700b ex 0-7fffffffffffffff lkf 2000 wait 1 8924 ex plock 0 8924 en punlock 7,279700b 8924 remove 7,279700b 8924 ex punlock 0 8924 en plock 7,2797006 8913 ex plock 0 8907 ex plock 0 8920 ex plock 0 8908 ex plock 0 8913 en punlock 7,279701b 8920 en punlock 7,2797006 8907 en punlock 7,27970d8 8920 remove 7,2797006 8908 en punlock 7,27a7190 8908 remove 7,27a7190 8920 ex punlock 0 8920 en plock 7,279700c 8920 req 7,279700c ex 0-7fffffffffffffff lkf 2000 wait 1 8908 ex punlock 0 8908 en plock 7,27970ce 8924 req 7,2797006 ex 0-7fffffffffffffff lkf 2000 wait 1 8913 remove 7,279701b 8913 ex punlock 0 8913 en plock 7,2797017 8913 req 7,2797017 ex 0-7fffffffffffffff lkf 2000 wait 1 8912 req 7,279701b ex 0-7fffffffffffffff lkf 2000 wait 1 8920 ex plock 0 8907 remove 7,27970d8 8907 ex punlock 0 8907 en plock 7,4223afe 8908 req 7,27970ce ex 0-7fffffffffffffff lkf 2000 wait 1 8913 ex plock 0 8907 req 7,4223afe ex 0-7fffffffffffffff lkf 2000 wait 1 8908 ex plock 0 8907 ex plock 0 8924 ex plock 0 8912 ex plock 0 8913 en punlock 7,2797017 8908 en punlock 7,27970ce 8920 en punlock 7,279700c 8907 en punlock 7,4223afe 8908 remove 7,27970ce 8920 remove 7,279700c 8908 ex punlock 0 8908 en plock 7,4223ada 8920 ex punlock 0 8920 en plock 7,2797006 8913 remove 7,2797017 8913 ex punlock 0 8913 en plock 7,279701b 8907 remove 7,4223afe 8907 ex punlock 0 8907 en plock 7,27a7190 8908 req 7,4223ada ex 0-7fffffffffffffff lkf 2000 wait 1 8907 req 7,27a7190 ex 0-7fffffffffffffff lkf 2000 wait 1 8908 ex plock 0 8907 ex plock 0 8942 en plock 7,2797006 8942 req 7,2797006 ex 9580a8-12756d54 lkf 2000 wait 1 8942 ex plock 0 8912 en punlock 7,279701b 8908 en punlock 7,4223ada 8907 en punlock 7,27a7190 8908 remove 7,4223ada 8924 en punlock 7,2797006 8907 remove 7,27a7190 8924 remove 7,2797006 8908 ex punlock 0 8908 en plock 7,27a71a8 8907 ex punlock 0 8907 en plock 7,27970ce 8924 ex punlock 0 8924 en plock 7,279700c 8924 req 7,279700c ex 0-7fffffffffffffff lkf 2000 wait 1 8920 req 7,2797006 ex 0-7fffffffffffffff lkf 2000 wait 1 8912 remove 7,279701b 8912 ex punlock 0 8912 en plock 7,279701d 8912 req 7,279701d ex 0-7fffffffffffffff lkf 2000 wait 1 8913 req 7,279701b ex 0-7fffffffffffffff lkf 2000 wait 1 8912 ex plock 0 8924 ex plock 0 8920 ex plock 0 8908 req 7,27a71a8 ex 0-7fffffffffffffff lkf 2000 wait 1 8907 req 7,27970ce ex 0-7fffffffffffffff lkf 2000 wait 1 8908 ex plock 0 8907 ex plock 0 8913 ex plock 0 8912 en punlock 7,279701d 8924 en punlock 7,279700c 8908 en punlock 7,27a71a8 8924 remove 7,279700c 8907 en punlock 7,27970ce 8924 ex punlock 0 8924 en plock 7,2797006 8912 remove 7,279701d 8912 ex punlock 0 8912 en plock 7,2797019 8912 req 7,2797019 ex 0-7fffffffffffffff lkf 2000 wait 1 8912 ex plock 0 8908 remove 7,27a71a8 8908 ex punlock 0 8907 remove 7,27970ce 8907 ex punlock 0 8907 en plock 7,4223ada 8907 req 7,4223ada ex 0-7fffffffffffffff lkf 2000 wait 1 8907 ex plock 0 8913 en punlock 7,279701b 8920 en punlock 7,2797006 8920 remove 7,2797006 8907 en punlock 7,4223ada 8920 ex punlock 0 8920 en plock 7,2797008 8920 req 7,2797008 ex 0-7fffffffffffffff lkf 2000 wait 1 8924 req 7,2797006 ex 0-7fffffffffffffff lkf 2000 wait 1 8913 remove 7,279701b 8920 ex plock 0 8913 ex punlock 0 8913 en plock 7,279701d 8924 ex plock 0 8913 req 7,279701d ex 0-7fffffffffffffff lkf 2000 wait 1 8913 ex plock 0 8907 remove 7,4223ada 8907 ex punlock 0 8942 en punlock 7,2797006 8942 remove 7,2797006 8942 ex punlock 0 8912 en punlock 7,2797019 8912 remove 7,2797019 8912 ex punlock 0 8912 en plock 7,279701f 8912 req 7,279701f ex 0-7fffffffffffffff lkf 2000 wait 1 8912 ex plock 0 8920 en punlock 7,2797008 8920 remove 7,2797008 8920 ex punlock 0 8920 en plock 7,2797006 8942 en plock 7,2797006 lock_dlm: Assertion failed on line 432 of file /builddir/build/BUILD/gfs-kernel-2.6.9-75/up/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 4297038051 LINK_1281: num=11,2797006 err=-22 cur=0 req=5 lkf=4 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:432 invalid operand: 0000 [1] CPU 0 Modules linked in: lock_dlm(U) dm_cmirror(U) gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) qlad Pid: 8942, comm: doio Not tainted 2.6.9-67.EL RIP: 0010:[<ffffffffa0349327>] <ffffffffa0349327>{:lock_dlm:do_dlm_lock+363} RSP: 0018:00000101e2133c28 EFLAGS: 00010212 RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 000000000003f733 RDX: 00000000ffffff01 RSI: 000000000003f733 RDI: ffffffff8043d300 RBP: 00000101ed1196c0 R08: 00000000000927bf R09: 00000000000927c0 R10: 0000000000000246 R11: 0000ffff8045c520 R12: 0000010210106c00 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000012756d54 FS: 0000002a95562b00(0000) GS:ffffffff80554580(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000007fbfffb8e0 CR3: 0000000000101000 CR4: 00000000000006e0 Process doio (pid: 8942, threadinfo 00000101e2132000, task 00000101e28dcce0) Stack: 0000000000000005 0000000000000004 3131202020202020 2020202020202020 3630303739373220 0000000000000018 00000101ed1196c0 0000010169a3fc00 00000101ed119718 0000010210106c00 Call Trace:<ffffffffa03493a3>{:lock_dlm:do_dlm_lock_sync+85} <ffffffffa034babf>{:lock_dlm:lock_resource+127} <ffffffffa034d13c>{:lock_dlm:lm_dlm_plock+601} <ffffffff801355dd>{default_wake_function+0} <ffffffffa03493ab>{:lock_dlm:do_dlm_lock_sync+93} <ffffffffa02ddb2f>{:gfs:gfs_lm_plock+45} <ffffffffa02ea053>{:gfs:gfs_lock+196} <ffffffff801aa772>{fcntl_setlk+311} <ffffffff8035fb2c>{thread_return+0} <ffffffff801a5eb5>{sys_fcntl+1163} <ffffffff80110a92>{system_call+126} Code: 0f 0b b6 ed 34 a0 ff ff ff ff b0 01 48 c7 c7 bb ed 34 a0 31 RIP <ffffffffa0349327>{:lock_dlm:do_dlm_lock+363} RSP <00000101e2133c28> <0>Kernel panic - not syncing: Oops Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: