Bug 133543

Summary: cman_comms kernel oops after seeing different cluster view from recovery
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: gfsAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED CURRENTRELEASE QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: teigland
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-10-27 22:11:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2004-09-24 18:09:34 UTC
Description of problem:
Hit this on morph-01 after taking down morph-05 and morph-06 and then
attempting to bring them back into the cluster with a cman_tool join.

Sep 24 11:28:42 morph-01 kernel: CMAN: node morph-06 rejoining
Sep 24 11:28:43 morph-01 kernel: CMAN: nmembers in HELLO message does
not match our view (got 4, exp 5)
Sep 24 11:28:50 morph-01 kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000014
Sep 24 11:28:50 morph-01 kernel:  printing eip:
Sep 24 11:28:50 morph-01 kernel: f8a6b448
Sep 24 11:28:50 morph-01 kernel: *pde = 00000000
Sep 24 11:28:50 morph-01 kernel: Oops: 0000 [#1]
Sep 24 11:28:50 morph-01 kernel: SMP
Sep 24 11:28:50 morph-01 kernel: Modules linked in: gnbd lock_gulm
lock_nolock lock_dlm dlm cman gfs lock_harness ipv6
parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod
uhci_hcd ehci_hcd button battery asus_acpi ac ext
3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
Sep 24 11:28:50 morph-01 kernel: CPU:    0
Sep 24 11:28:50 morph-01 kernel: EIP:    0060:[<f8a6b448>]    Not tainted
Sep 24 11:28:50 morph-01 kernel: EFLAGS: 00010293   (2.6.8.1)
Sep 24 11:28:50 morph-01 kernel: EIP is at kcl_sendmsg+0x28/0x100 [cman]
Sep 24 11:28:50 morph-01 kernel: eax: 00000000   ebx: f6c84780   ecx:
00000002   edx: ffffffea
Sep 24 11:28:50 morph-01 kernel: esi: f76b7f98   edi: f76b7f98   ebp:
00010000   esp: f76b7f38
Sep 24 11:28:50 morph-01 kernel: ds: 007b   es: 007b   ss: 0068
Sep 24 11:28:50 morph-01 kernel: Process cman_comms (pid: 3674,
threadinfo=f76b6000 task=f731b7b0)
Sep 24 11:28:50 morph-01 kernel: Stack: c033e180 f731b7b0 00000002
f76b7f96 00000000 f76b7fa8 c02f1b2d f509f800
Sep 24 11:28:50 morph-01 kernel:        000005dc 00000000 00000000
00000246 00000000 00000000 00000000 f6c84780
Sep 24 11:28:50 morph-01 kernel:        00000008 f76b7f98 f76b6000
f8a6f80d f76b7f98 00000008 00010000 000300cd
Sep 24 11:28:50 morph-01 kernel: Call Trace:
Sep 24 11:28:50 morph-01 kernel:  [<c02f1b2d>] schedule+0x2dd/0x5d0
Sep 24 11:28:50 morph-01 kernel:  [<f8a6f80d>] send_leave+0x11d/0x150
[cman]
Sep 24 11:28:50 morph-01 kernel:  [<f8a68493>]
cluster_kthread+0x2b3/0x320 [cman]
Sep 24 11:28:50 morph-01 kernel:  [<c011efb0>]
default_wake_function+0x0/0x10
Sep 24 11:28:50 morph-01 kernel:  [<f8a681e0>]
cluster_kthread+0x0/0x320 [cman]
Sep 24 11:28:50 morph-01 kernel:  [<c01042b5>]
kernel_thread_helper+0x5/0x10
Sep 24 11:28:50 morph-01 kernel: Code: 8b 48 14 7f 66 a1 b4 4f a8 f8
ba 95 ff ff ff 85 c0 74 58 0f

How reproducible:
Didn't try

Comment 1 Christine Caulfield 2004-09-27 10:27:10 UTC
There was a race between the two main cman processes shutting down. if
membership went down first it closed a socket that comms wanted to use.

The solution is to make sure than only membership uses it's socket (of
course). Some previous changes have made this easier and tidier than
it used to be so this should be fixed now.

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.24; previous revision: 1.23
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.20; previous revision: 1.19
done


Comment 2 Corey Marthaler 2004-10-27 22:11:55 UTC
fix verified.

Comment 3 Kiersten (Kerri) Anderson 2004-11-16 19:12:34 UTC
Updating version to the right level in the defects.  Sorry for the storm.