165160 – cman leaves cluster after losing network interface

Bug 165160 - cman leaves cluster after losing network interface

Summary: cman leaves cluster after losing network interface

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	164915
TreeView+	depends on / blocked

Reported:	2005-08-04 19:31 UTC by Corey Marthaler
Modified:	2009-04-16 20:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2006-0166
Clone Of:
Environment:
Last Closed:	2006-01-06 20:28:16 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0166	0	normal	SHIPPED_LIVE	cman-kernel bug fix update	2006-01-06 05:00:00 UTC

Description Corey Marthaler 2005-08-04 19:31:18 UTC

Description of problem:
I saw machines assert quite a few times last night after seeing these "network
interfaces" messages with the latest build (cluster-i686-2005-08-03-1549) while
running recovery tests to an x86 two node cluster (link-10, link-11). 

Revolver had gone though 207 iterations of killing (reboot -fin) one node at a
time, which caused these to happen:

link-11:
Aug  3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101
Aug  3 21:53:32 link-11 kernel: CMAN: No functional network interfaces, leaving
cluster
Aug  3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101
Aug  3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101
Aug  3 21:53:32 link-11 kernel: CMAN: we are leaving the cluster.
Aug  3 21:53:32 link-11 kernel: Synchronizing SCSI cache for disk sdd:
Aug  3 21:53:32 link-11 clurgmgrd[2817]: <warning> #67: Shutting down uncleanly
Aug  3 21:53:32 link-11 kernel: SM: 00000001 sm_stop: SG still joined
Aug  3 21:53:32 link-11 kernel: SM: 01000002 sm_stop: SG still joined
Aug  3 21:53:32 link-11 kernel: SM: 02000005 sm_stop: SG still joined
Aug  3 21:53:32 link-11 kernel: SM: 03000003 sm_stop: SG still joined
Aug  3 21:53:32 link-11 ccsd[2278]: Cluster manager shutdown.  Attemping to
reconnect...
Aug  3 21:53:32 link-11 kernel: SM: send_nodeid_message error -107 to 1
Aug  3 21:53:33 link-11 kernel: sts
Aug  3 21:53:34 link-11 kernel: clvmd recover event 10 finished
Aug  3 21:53:34 link-11 kernel: gfs0 move flags 1,0,0 ids 8,8,8
Aug  3 21:53:34 link-11 kernel: gfs0 move flags 0,1,0 ids 8,11,8
Aug  3 21:53:34 link-11 kernel: gfs0 move use event 11
Aug  3 21:53:34 link-11 kernel: gfs0 recover event 11
Aug  3 21:53:34 link-11 kernel: gfs0 add node 1
Aug  3 21:53:34 link-11 kernel: gfs0 total nodes 2
Aug  3 21:53:34 link-11 kernel: gfs0 rebuild resource directory
Aug  3 21:53:34 link-11 kernel: gfs0 rebuilt 1163 resources
Aug  3 21:53:34 link-11 kernel: gfs0 purge requests
Aug  3 21:53:34 link-11 kernel: gfs0 purged 0 requests
[...]
eventually asserts


link-10:
Aug  3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101
Aug  3 22:21:04 link-10 kernel: CMAN: No functional network interfaces, leaving
cluster
Aug  3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101
Aug  3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101
Aug  3 22:21:04 link-10 kernel: CMAN: we are leaving the cluster.
Aug  3 22:21:04 link-10 clurgmgrd[2854]: <warning> #67: Shutting down uncleanly
Aug  3 22:21:04 link-10 kernel: SM: 00000001 sm_stop: SG still joined
Aug  3 22:21:04 link-10 kernel: SM: 01000002 sm_stop: SG still joined
Aug  3 22:21:04 link-10 kernel: SM: 02000005 sm_stop: SG still joined
Aug  3 22:21:04 link-10 kernel: SM: 03000003 sm_stop: SG still joined
Aug  3 22:21:04 link-10 ccsd[2279]: Cluster manager shutdown.  Attemping to
reconnect...
Aug  3 22:21:04 link-10 kernel: SM: send_nodeid_message error -107 to 2
Aug  3 22:21:05 link-10 kernel: ests
Aug  3 22:21:05 link-10 kernel: clvmd recover event 10 finished
Aug  3 22:21:05 link-10 kernel: gfs0 move flags 1,0,0 ids 8,8,8
Aug  3 22:21:05 link-10 kernel: gfs0 move flags 0,1,0 ids 8,11,8
Aug  3 22:21:05 link-10 kernel: gfs0 move use event 11
Aug  3 22:21:06 link-10 kernel: gfs0 recover event 11
Aug  3 22:21:06 link-10 kernel: gfs0 add node 2
Aug  3 22:21:06 link-10 kernel: gfs0 total nodes 2
Aug  3 22:21:06 link-10 kernel: gfs0 rebuild resource directory
Aug  3 22:21:06 link-10 kernel: gfs0 rebuilt 1139 resources
Aug  3 22:21:06 link-10 kernel: gfs0 purge requests
Aug  3 22:21:06 link-10 kernel: gfs0 purged 0 requests
Aug  3 22:21:06 link-10 kernel: gfs0 mark waiting requests
Aug  3 22:21:06 link-10 kernel: gfs0 marked 0 requests
[...]
also eventually asserts


Version-Release number of selected component (if applicable):
CMAN 2.6.9-38.2 (built Aug  3 2005 16:22:35)

[root@link-10 ~]# cman_tool -V
cman_tool 1.0.2 (built Aug  1 2005 14:45:22)
Copyright (C) Red Hat, Inc.  2004  All rights reserved.

  
Actual results:
assertion

Expected results:
withdraw from the cluster

Comment 1 Christine Caulfield 2005-08-05 10:45:09 UTC

err, can you post the assert please ??

Comment 2 Corey Marthaler 2005-08-05 14:46:27 UTC

Aug  3 22:02:31 link-10 kernel: 3638 req 7,3b ex 0-651e77 lkf 2000 wait 1
Aug  3 22:02:31 link-10 kernel: 3638 ex plock 0
Aug  3 22:02:31 link-10 kernel: 3638 en punlock 7,3b
Aug  3 22:02:31 link-10 kernel: 3638 remove 7,3b
Aug  3 22:02:31 link-10 kernel: 3638 ex punlock 0
Aug  3 22:02:31 link-10 kernel: 3638 en plock 7,3b
Aug  3 22:02:31 link-10 kernel: 3638 req 7,3b ex 651e78-6901fb lkf 2000 wait 1
Aug  3 22:02:31 link-10 kernel: 3638 ex plock 0
Aug  3 22:02:31 link-10 kernel: 3638 en punlock 7,3b
Aug  3 22:02:31 link-10 kernel:
Aug  3 22:02:31 link-10 kernel: lock_dlm:  Assertion failed on line 428 of file
/usr/src/build/598004-i686/BUILD/gfs-kernel-2.6.9-36/
smp/src/dlm/lock.c
Aug  3 22:02:31 link-10 kernel: lock_dlm:  assertion:  "!error"
Aug  3 22:02:31 link-10 kernel: lock_dlm:  time = 4294825539
Aug  3 22:02:31 link-10 kernel: gfs0: num=11,3b err=-22 cur=0 req=5 lkf=4
Aug  3 22:02:31 link-10 kernel:
Aug  3 22:02:31 link-10 kernel: ------------[ cut here ]------------
Aug  3 22:02:31 link-10 kernel: kernel BUG at
/usr/src/build/598004-i686/BUILD/gfs-kernel-2.6.9-36/smp/src/dlm/lock.c:428!
Aug  3 22:02:31 link-10 kernel: invalid operand: 0000 [#1]
Aug  3 22:02:31 link-10 kernel: SMP
Aug  3 22:02:31 link-10 kernel: Modules linked in: parport_pc lp parport autofs4
i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U)
dm_mirror dlm(U) cman(U) md5 ipv6 sunrpc button battery ac uhci_hcd ehci_hcd
e1000 floppy ext3 jbd dm_mod qla2300 qla2xxx scsi_transp
ort_fc sd_mod scsi_mod
Aug  3 22:02:31 link-10 kernel: CPU:    0
Aug  3 22:02:31 link-10 kernel: EIP:    0060:[<e0374779>]    Not tainted VLI
Aug  3 22:02:31 link-10 kernel: EFLAGS: 00010246   (2.6.9-12.ELsmp)
Aug  3 22:02:31 link-10 kernel: EIP is at do_dlm_lock+0x134/0x14e [lock_dlm]
Aug  3 22:02:31 link-10 kernel: eax: 00000001   ebx: ffffffea   ecx: dd328e4c  
edx: e0379175
Aug  3 22:02:31 link-10 kernel: esi: e0374798   edi: df700e00   ebp: d9886800  
esp: dd328e48
Aug  3 22:02:31 link-10 kernel: ds: 007b   es: 007b   ss: 0068
Aug  3 22:02:31 link-10 kernel: Process doio (pid: 3638, threadinfo=dd328000
task=d79188b0)
Aug  3 22:02:31 link-10 kernel: Stack: e0379175 20202020 31312020 20202020
20202020 20202020 62332020 00000018
Aug  3 22:02:31 link-10 kernel:        00000000 dead4ead d9886800 d9886840
d988683c e03747d0 dd328e98 d44aa580
Aug  3 22:02:31 link-10 kernel:        dd328f08 dcc1112c e037626d d9886800
0000003b 00000000 00000011 d44aa580
Aug  3 22:02:31 link-10 kernel: Call Trace:
Aug  3 22:02:31 link-10 kernel:  [<e03747d0>] do_dlm_lock_sync+0x33/0x42 [lock_dlm]
Aug  3 22:02:31 link-10 kernel:  [<e037626d>] lock_resource+0x70/0x93 [lock_dlm]
Aug  3 22:02:31 link-10 kernel:  [<e0377b38>] lm_dlm_punlock+0xcc/0x14f [lock_dlm]
Aug  3 22:02:31 link-10 kernel:  [<e03ba684>] gfs_lm_punlock+0x30/0x3f [gfs]
Aug  3 22:02:31 link-10 kernel:  [<e03c59f0>] gfs_lock+0xd4/0xf2 [gfs]
Aug  3 22:02:31 link-10 kernel:  [<e03c591c>] gfs_lock+0x0/0xf2 [gfs]
Aug  3 22:02:31 link-10 kernel:  [<c016cb66>] fcntl_setlk+0x156/0x291
Aug  3 22:02:31 link-10 kernel:  [<c0168fdf>] do_fcntl+0x10c/0x155
Aug  3 22:02:31 link-10 kernel:  [<c01690ee>] sys_fcntl64+0x6c/0x7d
Aug  3 22:02:31 link-10 kernel:  [<c02d0ecf>] syscall_call+0x7/0xb
Aug  3 22:02:31 link-10 kernel:  [<c02d007b>] __lock_text_end+0x326/0x101b
Aug  3 22:02:31 link-10 kernel: Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04
ff 75 0c ff 77 18 68 a0 92 37 e0 e8 9e db da df 83 c4
 38 68 75 91 37 e0 e8 91 db da df <0f> 0b ac 01 b2 90 37 e0 68 77 91 37 e0 e8 4c
d3 da df 83 c4 20
Aug  3 22:02:31 link-10 kernel:  <0>Fatal exception: panic in 5 seconds

Comment 3 Christine Caulfield 2005-08-05 15:17:52 UTC

Oh, this is the "ape shit" bug we had ages ago that got marked "resolved" even
though it never was. When cman shuts down peremptorily the subsystems can't cope.

it needs reassiging to whatever thing lock_dlm is part of.

Comment 4 Christine Caulfield 2005-08-08 08:07:00 UTC

Looks like lock_dlm needs to do something a little less drastic when it gets
EINVAL back from the DLM

Comment 5 David Teigland 2005-08-08 08:37:37 UTC

There have been so many bz entries dealing with this one problem it's
hard to know which to pick.  I tried once to consolidate them into
bz 148788, but that didn't help very much.

For the record again, when cman shuts down, it also shuts down the dlm.
This causes gfs's lock requests to return an error.  GFS/lock_dlm have
little to do when lock requests fail except to panic.

Cman has shut down, of course, because the network has failed.
So, ultimately, network failure -> gfs assertion failure/panic.
The one alternative to panicking is to just sit and wait to
be fenced.

Comment 6 Corey Marthaler 2005-08-22 15:29:34 UTC

I hit this again twice over the weekend. After reading your last comment, is
this just going to be the expected behavior when the network interface "fails"?
We just assert or panic? And if so, shouldn't this bug be closed NOTABUG
(expected behavior) or WONTFIX? 

What got checked in to make it marked MODIFIED? And if it is not supposed to
behave like this then it should go to FAILSQA or back to ASSIGNED.

Is the network actually failing or is it just because one node can't talk to the
other nodes? If it is really failing then why is this only seen during recovery
tests?


CMAN: sendmsg failed: -101
CMAN: No functional network interfaces, leaving cluster
CMAN: sendmsg failed: -101
CMAN: sendmsg failed: -101
1 add node 2
gfs1 total nodes 3
gfs1 rebuild resource SM: 00000001 sm_stop: SG still joined
SM: 01000003 sm_stop: SG still joined
SM: 02000005 sm_stop: SG still joined
SM: 0300000a sm_stop: SG still joined
directory
gfs1 rebuilt 365 resources
gfs1 purge requests
gfs1 purged 0 requests
gfs2 recover event 11 done
gfs1 mark waiting requests
gfs1 marked 0 requests
gfs1 recover event 12 done
gfs2 move flags 0,0,1 ids 0,11,11
gfs2 process held requests
gfs2 processed 0 requests
gfs2 recover event 11 finished
gfs1 move flags 0,0,1 ids 7,12,12
gfs1 process held requests
gfs1 processed 0 requests
gfs1 resend marked requests
gfs1 resent 0 requests
gfs1 recover event 12 finished
gfs2 move flags 1,0,0 ids 11,11,11
gfs2 move flags 0,1,0 ids 11,15,11
gfs2 move use event 15
gfs2 recover event 15
gfs2 add node 2
gfs2 total nodes 3
gfs2 rebuild resource directory
gfs2 rebuilt 388 resources
gfs2 purge requests
gfs2 purged 0 requests
gfs2 mark waiting requests
gfs2 marked 0 requests
gfs2 recover event 15 done
gfs2 move flags 0,0,1 ids 11,15,15
gfs2 process held requests
gfs2 processed 0 requests
gfs2 resend marked requests
gfs2 resent 0 requests
gfs2 recover event 15 finished
71 add node 2
gfs1 total nodes 3
gfs1 rebuild resource directory
gfs1 rebuilt 365 resources
gfs1 purge requests
gfs1 purged 0 requests
gfs2 recover event 11 done
gfs1 mark waiting requests
gfs1 marked 0 requests
gfs1 recover event 12 done
gfs2 move flags 0,0,1 ids 0,11,11
gfs2 process held requests

[...]

3768 req 7,3c ex 418c1a-55d28b lkf 2000 wait 1
3731 req 7,fffc ex 74affa-752481 lkf 2000 wait 1
3768 ex plock 0
3731 ex plock 0
3731 en punlock 7,fffc
3731 remove 7,fffc
3768 en punlock 7,3c
3731 ex punlock 0
3731 en plock 7,fffc

lock_dlm<4:> 7 68As sreeqr t7io,n3 cf eaxi le0d- 55o8n 9laib nlek 4f2 820 0o0f 
wfialite  1/
r/3s7r68c /ebuxi lpdlo/c6k03 106                                               
            us
0-3x7836_16 4en/ BUpIlLoDck/ gf7,s-fkffecr
nel3-7231. 6r.9eq- 379,/sfmffp/c serxc/ dbl34m0/clo-c5k24.c59
 llkocfk _20d0l0m:  w aiats s1er                             e
i3o7n3: 1  "ex! eprrloorc"k     t
                            0
lo3ck7_3d1 lemn:   ptunilmoe c=k  47,2f9f49f8c7
4367                                           2
1 gfrsem2o:v neu m7=,f1f1f,3cc
e3r7r=3-12 e2x  cpuru=n0lo ckr eq0=
5 3l7kf3=1 4
 p          en
lock 7,fffc
3731 req 7,fffc ex 52459e-58333e lkf 2000 wait 1
3731 ex plock 0
3731 en punlock 7,fffc
3731 remove 7,fffc
3731 ex punlock 0
3731 en plock 7,fffc
3731 req 7,fffc ex 73fb89-743b9e lkf 2000 wait 1
3731 ex plock 0
3768 en punlock 7,3c
3768 remove 7,3c
3731 en punlock 7,fffc
3731 remove 7,fffc
3768 ex punlock 0
3731 ex punlock 0
3731 en plock 7,fffc
3731 req 7,fffc ex 752978-752cec lkf 2000 wait 1
3731 ex plock 0
3731 en punlock 7,fffc
3731 remove 7,fffc
3731 ex punlock 0
3731 en plock 7,fffc
3731 req 7,fffc ex 752cec-752f52 lkf 2000 wait 1
3731 ex plock 0
3731 en punlock 7,fffc
3731 remove 7,fffc
3731 ex punlock 0
3768 en plock 7,3c
3768 req 7,3c ex 74f39b-750714 lkf 2000 wait 1

[...]


3731 ex punlock 0
3731 en plock 7,fffc
3768 req 7,3c ex 418c1a-55d28b lkf 2000 wait 1
3731 req 7,fffc ex 74affa-752481 lkf 2000 wait 1
3768 ex plock 0
3731 ex plock 0
3731 en punlock 7,fffc
3731 remove 7,fffc
3768 en punlock 7,3c
3731 ex punlock 0
3731 en plock 7,fffc

lock_dlm:  Assertion failed on line 428 of file
/usr/src/build/603160-x86_64/BUILD/gfs-kernel-2.6.9-39/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 4294987247
gfs1: num=11,fffc err=-22 cur=0 req=5 lkf=4

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U)
gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket
pcmcia_core ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
Pid: 3731, comm: doio Not tainted 2.6.9-16.ELsmp
RIP: 0010:[<ffffffffa02389e7>] <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:0000010033b47c78  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000020000
RDX: 0000000000006d42 RSI: 0000000000000246 RDI: ffffffff803d7be0
RBP: 0000010033f3c780 R08: 00000000fffffffb R09: 00000000ffffffea
R10: 0000000000000000 R11: 0000000000000000 R12: 00000100022cce00
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000752481
FS:  0000002a95576b00(0000) GS:ffffffff804d3300(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003a1b929000 CR3: 0000000000101000 CR4: 00000000000006e0
Process doio (pid: 3731, threadinfo 0000010033b46000, task 0000010033a62030)
Stack: 0000000000000005 0000000000000004 3131202020202020 2020202020202020
       6366666620202020 0000000000000018 0000010033f3c780 0000010030638880
       0000010033f3c7d8 00000100022cce00
Call Trace:<ffffffffa0238a44>{:lock_dlm:do_dlm_lock_sync+54}
<ffffffffa023a660>{:lock_dlm:lock_resource+129}
       <ffffffffa023b909>{:lock_dlm:lm_dlm_plock+588}
<ffffffff8013199d>{try_to_wake_up+734}
       <ffffffff80134797>{autoremove_wake_function+9}
<ffffffff80132f4a>{__wake_up_common+67}
       <ffffffffa01ff9bb>{:gfs:gfs_lm_plock+46}
<ffffffffa020a6a0>{:gfs:gfs_lock+191}
       <ffffffff8018b66a>{fcntl_setlk+296} <ffffffff80302ea4>{thread_return+42}
       <ffffffff80187a34>{sys_fcntl+733} <ffffffff80110052>{system_call+126}


Code: 0f 0b 52 d1 23 a0 ff ff ff ff ac 01 48 c7 c7 57 d1 23 a0 31
RIP <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365} RSP <0000010033b47c78>
 -<0-->-Ke-r-n--e-l -p-a [ncicu t -h neoret  s]y -n-c-in-g--: -O-o- p[s
lea se bite here ] ---------                                           p
Kernel BUG at lock:428
invalid operand: 0000 [2] SMP
CPU 1
Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U)
gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket
pcmcia_core ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
Pid: 3768, comm: doio Not tainted 2.6.9-16.ELsmp
RIP: 0010:[<ffffffffa02389e7>] <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365}
RSP: 0018:00000100323d1d48  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000000246
RDX: 0000000000006db6 RSI: 0000000000000246 RDI: ffffffff803d7be0
RBP: 0000010031493d80 R08: 000000000000000d R09: 00000000ffffffea
R10: 0000000000000000 R11: 0000000000000000 R12: 000001003ff52200
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000418c1a
FS:  0000002a95576b00(0000) GS:ffffffff804d3380(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000006c1a10 CR3: 000000003ff58000 CR4: 00000000000006e0
Process doio (pid: 3768, threadinfo 00000100323d0000, task 0000010032361030)
Stack: 0000000000000005 0000000000000004 3131202020202020 2020202020202020
       6333202020202020 0000000000550018 0000010031493d80 00000100301891c0
       0000010031493dd8 0000010031a06060
Call Trace:<ffffffffa0238a44>{:lock_dlm:do_dlm_lock_sync+54}
<ffffffffa023a660>{:lock_dlm:lock_resource+129}
       <ffffffffa023b213>{:lock_dlm:lm_dlm_punlock+207}
<ffffffffa01ff9f6>{:gfs:gfs_lm_punlock+46}
       <ffffffffa020a68e>{:gfs:gfs_lock+173} <ffffffff8018b66a>{fcntl_setlk+296}
       <ffffffff80302ea4>{thread_return+42} <ffffffff80187a34>{sys_fcntl+733}
       <ffffffff80110052>{system_call+126}

Code: 0f 0b 52 d1 23 a0 ff ff ff ff ac 01 48 c7 c7 57 d1 23 a0 31
RIP <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365} RSP <00000100323d1d48>
Badness in do_unblank_screen at drivers/char/vt.c:2876

Call Trace:<ffffffff80231642>{do_unblank_screen+61}
<ffffffff80122e60>{bust_spinlocks+28}
       <ffffffff80111834>{oops_end+18} <ffffffff80111961>{die+54}
       <ffffffff80111d24>{do_invalid_op+145}
<ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365}
       <ffffffff8013725a>{release_console_sem+369} <ffffffff80137488>{vprintk+498}
       <ffffffff80137532>{printk+141} <ffffffff80110aed>{error_exit+0}
       <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365}
<ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365}
       <ffffffffa0238a44>{:lock_dlm:do_dlm_lock_sync+54}
<ffffffffa023a660>{:lock_dlm:lock_resource+129}
       <ffffffffa023b213>{:lock_dlm:lm_dlm_punlock+207}
<ffffffffa01ff9f6>{:gfs:gfs_lm_punlock+46}
       <ffffffffa020a68e>{:gfs:gfs_lock+173} <ffffffff8018b66a>{fcntl_setlk+296}
       <ffffffff80302ea4>{thread_return+42} <ffffffff80187a34>{sys_fcntl+733}
       <ffffffff80110052>{system_call+126}

Comment 7 Corey Marthaler 2005-08-30 21:31:03 UTC

Marking this FAILSQA inorder to get it back on the radar

Comment 8 Corey Marthaler 2005-08-30 21:32:15 UTC

and reassigning to Patrick...

Comment 9 Christine Caulfield 2005-08-31 07:30:03 UTC

I've no idea what I'm expected to do with this. 

If I leave cman to not shutdown when the network goes away* then I get a bug
saying the cman doesn't shutdown and loops with an error message. If I make it
shutdown then I get a bugzilla saying that someone else's code is panicking!

Think about this sensibly please. If the cluster manager cannot continue because
it's communication paths are severed - what should happen ? cman itself shuts
down...it notifies (as best it can) the subsystems that rely on it - it can do
no more. What those subsystems do then is up to them. 

This problem is not going to go away by being batted around between different
people who deny it's their problem, and it's certainly not going to go away with
the new code for RHEL5. Cluster managers shut down - code that depends on them
need to be able to deal with that.

Now, if we need a channel from the cluster manager that says explicitly "I am
shutting down NOW, whether you like it or not" then I can provide that - but the
code that hooks into it needs to know what it will do under those circumstances.
Whatever that is I imagine it won't be pretty but perhaps it needn't be as ugly
as a kernel panic - even if it is functionally the same thing ;-)



* By that I mean the interface has been downed by ifconfig or similar, or the
whole IP stack has been shut down. Though this "bug" also happens when cman is
killed for other more clustery reasons such as being isolated from a quorate
cluster for too long.

Comment 10 Corey Marthaler 2005-08-31 19:38:28 UTC

I just dont't understand why the network interface is going "down" on these
machines once they come back up from being rebooted? 

Cluster = smith-01,smith-02,smith-03

Smith-01 and smith-02 are rebooted be revolver, taken out of the cluster, come
back up, rejoin the cluster, start the proper services, and mount the
filesystems. Then all of a sudden (in one in 20 to 50 revolver iterations) you
see this "CMAN: sendmsg failed: -101" (which isn't that supposed to be fixed for
bz 163587) followed by the "No functional network interfaces" mesasge (on
smith-01). The network really dies? It was fine just a second ago and is just
fine on the other nodes in the cluster (smith-02, smith-03). :) This causes the
next node (who just came back up) to notice the dead node and then die as well:

smith-02:
CMAN: removing node smith-01 from the cluster : Missed too many heartbeats
GFS: fsid=smith-cluster:gfs1.0: jid=1: Trying to acquire journal lock...
GFS: fsid=smith-cluster:gfs0.0: jid=1: Trying to acquire journal lock...
GFS: fsid=smith-cluster:gfs1.0: jid=1: Looking at journal...
GFS: fsid=smith-cluster:gfs0.0: jid=1: Looking at journal...
GFS: fsid=smith-cluster:gfs1.0: jid=1: Acquiring the transaction lock...
GFS: fsid=smith-cluster:gfs0.0: jid=1: Acquiring the transaction lock...
GFS: fsid=smith-cluster:gfs1.0: jid=1: Replaying journal...
GFS: fsid=smith-cluster:gfs0.0: jid=1: Replaying journal...
GFS: fsid=smith-cluster:gfs0.0: jid=1: Replayed 26 of 243 blocks
GFS: fsid=smith-cluster:gfs0.0: jid=1: replays = 26, skips = 77, sames = 140
GFS: fsid=smith-cluster:gfs0.0: jid=1: Journal replayed in 1s
GFS: fsid=smith-cluster:gfs0.0: jid=1: Done
GFS: fsid=smith-cluster:gfs1.0: jid=1: Replayed 2317 of 2317 blocks
GFS: fsid=smith-cluster:gfs1.0: jid=1: replays = 2317, skips = 0, sames = 0
GFS: fsid=smith-cluster:gfs1.0: jid=1: Journal replayed in 1s
GFS: fsid=smith-cluster:gfs1.0: jid=1: Done
CMAN: removing node smith-01 from the cluster : Missed too many heartbeats
CMAN: removing node smith-03 from the cluster : No response to messages
CMAN: quorum lost, blocking activity
CMAN: quorum regained, resuming activity
GFS: fsid=smith-cluster:gfs1.0: jid=1: Trying to acquire journal lock...
GFS: fsid=smith-cluster:gfs0.0: jid=1: Trying to acquire journal lock...
GFS: fsid=smith-cluster:gfs1.0: jid=1: Looking at journal...
GFS: fsid=smith-cluster:gfs0.0: jid=1: Looking at journal...
GFS: fsid=smith-cluster:gfs1.0: jid=1: Acquiring the transaction lock...
GFS: fsid=smith-cluster:gfs0.0: jid=1: Acquiring the transaction lock...
GFS: fsid=smith-cluster:gfs0.0: jid=1: Replaying journal...
GFS: fsid=smith-cluster:gfs1.0: jid=1: Replaying journal...
GFS: fsid=smith-cluster:gfs1.0: jid=1: Replayed 1557 of 1712 blocks
GFS: fsid=smith-cluster:gfs1.0: jid=1: replays = 1557, skips = 141, sames = 14
GFS: fsid=smith-cluster:gfs1.0: jid=1: Journal replayed in 1s
GFS: fsid=smith-cluster:gfs1.0: jid=1: Done
GFS: fsid=smith-cluster:gfs1.0: jid=2: Trying to acquire journal lock...
GFS: fsid=smith-cluster:gfs1.0: jid=2: Looking at journal...
GFS: fsid=smith-cluster:gfs1.0: jid=2: Acquiring the transaction lock...
GFS: fsid=smith-cluster:gfs1.0: jid=2: Replaying journal...
GFS: fsid=smith-cluster:gfs0.0: jid=1: Replayed 1177 of 2572 blocks
GFS: fsid=smith-cluster:gfs0.0: jid=1: replays = 1177, skips = 910, sames = 485
GFS: fsid=smith-cluster:gfs0.0: jid=1: Journal replayed in 2s
GFS: fsid=smith-cluster:gfs0.0: jid=1: Done
GFS: fsid=smith-cluster:gfs0.0: jid=2: Trying to acquire journal lock...
GFS: fsid=smith-cluster:gfs0.0: jid=2: Looking at journal...
GFS: fsid=smith-cluster:gfs1.0: jid=2: Replayed 555 of 579 blocks
GFS: fsid=smith-cluster:gfs1.0: jid=2: replays = 555, skips = 1, sames = 23
GFS: fsid=smith-cluster:gfs0.0: jid=2: Acquiring the transaction lock...
GFS: fsid=smith-cluster:gfs0.0: jid=2: Replaying journal...
GFS: fsid=smith-cluster:gfs1.0: jid=2: Journal replayed in 1s
GFS: fsid=smith-cluster:gfs1.0: jid=2: Done
GFS: fsid=smith-cluster:gfs0.0: jid=2: Replayed 111 of 111 blocks
GFS: fsid=smith-cluster:gfs0.0: jid=2: replays = 111, skips = 0, sames = 0
GFS: fsid=smith-cluster:gfs0.0: jid=2: Journal replayed in 1s
GFS: fsid=smith-cluster:gfs0.0: jid=2: Done
CMAN: sendmsg failed: -101
CMAN: No functional network interfaces, leaving cluster
CMAN: sendmsg failed: -101
CMAN: sendmsg failed: -101


The node which has been up this whole time again notices that they are dead, again:
smith-03:
CMAN: removing node smith-01 from the cluster : Missed too many heartbeats
CMAN: removing node smith-02 from the cluster : No response to messages
CMAN: quorum lost, blocking activity


Could there be a bigger linux issue here with the networking?

Comment 11 Christine Caulfield 2005-09-01 07:23:18 UTC

Hang on which bug is this? 

If it's "CMAN: sendmsg -101" then it's a duplicate of 163587 (and see my latest
comments). I though it was about the lock_dlm assertion.

I know they get seen at the same time (because one is often "caused" by the
other) but we need to try and keep them separate or we're all going to get confused.

Comment 12 Corey Marthaler 2005-09-01 14:54:59 UTC

/me doesn't know which bug this is anymore. :)

All I know is that I can no longer run extended recovery stress testing on
cman/dlm clusters with out hitting this network issue/message, which then causes
the notorious cman assertion/panic. I wish I knew why cman thinks the network is
dead? Nothing I did caused the network to die. The nodes come up after being
rebooted, and the init scripts join the cluster and mount gfs just fine. So why
after that are there network problems for cman, which cause panics?

Comment 13 Christine Caulfield 2005-09-01 15:12:13 UTC

The thing is, there's nothing I've done to cman that would change the frequency
of occurence of this message. All I've done is added a colon to it and made it
exit (it doesn't panic, lock_dlm panics!) rather than looping endlessly.

Now, I suppose its possible that you were getting them before but then the
network came back up again so it went unnoticed. I've no idea how that would
happen though.

I suppose the thing to do is one or more of the following:
- Check the interfaces & routes as soon as possible afer cman quits.
- tcpdump check for icmp messages that might tell the kernel that the broadcast
route is down (I have no idea if these even exist, let alone where they might
come from)
- Remove the "quit_threads=1" from the code and see if the messages still appear
(I can't see how they would not) and whether the system recovers.

If the latter happens then I suppose we could make it dependant on 2 or 3 fatal
network errors before quitting, but in that case the node is still probably
going to be fenced anyway on three heartbeat failures.

Comment 14 Christine Caulfield 2005-09-01 15:16:57 UTC

Oh, one other thing...does it make a difference if you're using broadcast or
multicast ?

(and I still think this is the wrong bug report to discuss this problem!).

Comment 15 Christine Caulfield 2005-09-02 08:53:56 UTC

Right, I saw the error this morning using revolver. But it was on a node that
was being shut down, so this may not be relevant.

Now my test nodes never shut down properly anyway for a number of reasons
unrelated to this  but can you check that the node that dies after the -101
message isn't one that revolver has just tried to kill ?

Comment 16 Corey Marthaler 2005-09-02 15:35:45 UTC

See comment #10 and comment #12. The nodes which see the -101 messages and then
end up panicing _are indeed_ the node(s) which revolver just shot. HOWEVER,
after being shot by revolver and *before* the -101 messages and panic, these
nodes come back up, just like they are supposed to, and join the cluster, and
mount the filesystems, and everything appears just fine for a moment until that
message happens and the machine(s) then panic and goes down *again* for good.

This only happens when revolver shoots more than one node and quorum is lost and
since I don't run with multicast, this is only seen with broadcast. 

Again from the time that these nodes are shot by revolver to the time they see
that -101 message and panic is about 4 or 5 minutes, during which time the nodes
come back up and join the cluster.

Comment 18 Christine Caulfield 2005-11-15 16:55:56 UTC

The reasons for -101 (ENETUNREACH) are often to do with routing. is there any
chance that something is mucking about with the routing tables at all?

I can stick a MSG_DONTROUTE flag on the messages when we get failure, just to
try to force the messages out in that instance - if you could test it.

Comment 19 Christine Caulfield 2005-11-28 10:10:29 UTC

setting to NEEDINFO as no-one has yet answered any of the questions I've asked
in this bz.

Comment 20 Christine Caulfield 2005-11-28 14:30:08 UTC

OK, I did a bit more work on this today. The only way I can reproduce the -101
is to actually take the interface down.

However, it only takes one instance of this to bring CMAN down which is
obviously too intolerant. This checkin should give it "max_retries" attempts
before giving up.


Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.17; previous revision: 1.42.2.16
done

Comment 22 Red Hat Bugzilla 2006-01-06 20:28:16 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0166.html

Note You need to log in before you can comment on or make changes to this bug.