Bug 233521

Summary: cman panic in start_transition after cmirror device failure caused node to shut down
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: cman-kernelAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0990 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-21 21:54:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2007-03-22 21:26:23 UTC
Description of problem:
I was running on a 4 node x86_64 cluster (link-02,04,07,08) and I created one
cluster mirror as such:

[root@link-02 ~]# lvs -a -o +devices
  LV              VG         Attr   LSize  Origin Snap%  Move Log       Copy% 
Devices                        
  LogVol00        VolGroup00 -wi-ao 35.16G                                    
/dev/hda2(0)                   
  LogVol01        VolGroup00 -wi-ao  1.94G                                    
/dev/hda2(1125)                
  test            vg         mwi-a-  1.00G                    test_mlog 100.00
test_mimage_0(0),test_mimage_1(0)
  [test_mimage_0] vg         iwi-ao  1.00G                                    
/dev/sda1(0)                   
  [test_mimage_1] vg         iwi-ao  1.00G                                    
/dev/sdb1(0)                   
  [test_mlog]     vg         lwi-ao  4.00M                                    
/dev/sdh1(0)                   

I then started a small amount of I/O with the following:
[root@link-02 ~]# dd if=/dev/zero of=/dev/vg/test bs=4M

And then failed the primary leg of that cmirror on all nodes:
[root@link-02 ~]# echo offline > /sys/block/sda/device/state

This then caused link-02 to appear hung for awhile, and then link-04 withdrew
from the cluster, and link-08 finally paniced.


LINK-04:
[...]
Mar 22 11:18:01 link-04 kernel: scsi1 (1:1): rejecting I/O to offline device
Mar 22 11:19:35 link-04 kernel: CMAN: Being told to leave the cluster by node 3
Mar 22 11:19:35 link-04 kernel: CMAN: we are leaving the cluster.
Mar 22 11:19:35 link-04 kernel: WARNING: dlm_emergency_shutdown
Mar 22 11:19:35 link-04 kernel: WARNING: dlm_emergency_shutdown
Mar 22 11:19:35 link-04 kernel: SM: 00000003 sm_stop: SG still joined
Mar 22 11:19:35 link-04 kernel: SM: 01000179 sm_stop: SG still joined


LINK-07:
[...]
Mar 22 16:30:25 link-07 kernel: scsi3 (0:1): rejecting I/O to offline device
Mar 22 16:30:25 link-07 last message repeated 31 times
Mar 22 16:30:41 link-07 kernel: CMAN: node link-04 has been removed from the cluu
ster : Shutdown
Mar 22 16:31:03 link-07 kernel: CMAN: node link-02 has been removed from the cluu
ster : No response to messages
Mar 22 16:31:04 link-07 kernel: CMAN: quorum lost, blocking activity
Mar 22 16:31:37 link-07 kernel: CMAN: removing node link-08 from the cluster : MM
issed too many heartbeats



LINK-08:
[...]
CMAN: removing node link-04 from the cluster : Shutdown
CMAN: removing node link-02 from the cluster : No response to messages
CMAN: quorum lost, blocking activity
Unable to handle kernel NULL pointer dereference at 000000000000003c RIP:
<ffffffffa0239ff1>{:cman:start_transition+447}
PML4 1a490067 PGD 1a484067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: dm_cmirror(U) dlm(U) cman(U) qla2300 qla2xxx
scsi_transport_fc md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd hw_random k8_edac edac_mc tg3 floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod mptscsih mptsas mptspi mptscsi
mptbase sd_mod scsi_mod
Pid: 2616, comm: cman_memb Not tainted 2.6.9-50.ELsmp
RIP: 0010:[<ffffffffa0239ff1>] <ffffffffa0239ff1>{:cman:start_transition+447}
RSP: 0018:0000010015807d08  EFLAGS: 00010246
RAX: 0000000000000001 RBX: 0000000000000005 RCX: 0000000000000007
RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000800030304 R09: 0000000000000040
R10: 00000000000005dc R11: 000001001efc3060 R12: 0000010015807da8
R13: ffffffffa02547a0 R14: 0000000000000000 R15: 0000010015807dc8
FS:  0000002a95562b00(0000) GS:ffffffff804ee200(0000) knlGS:00000000f7ff1b00
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000000003c CR3: 0000000000101000 CR4: 00000000000006e0
Process cman_memb (pid: 2616, threadinfo 0000010015806000, task 00000100302e8030)
Stack: 00000000000005dc 0000000000000000 0000010015807e38 0000010015807da8
       0000000000000008 ffffffffa0253e60 0000010015807dc8 ffffffffa023c7ac
       00000100302e8030 0000010001008c60
Call Trace:<ffffffffa023c7ac>{:cman:dispatch_messages+5705}
<ffffffff8030c531>{thread_return+0}
       <ffffffff8030c589>{thread_return+88}
<ffffffffa023d3f1>{:cman:membership_kthread+3033}
       <ffffffff80134660>{default_wake_function+0}
<ffffffff80134660>{default_wake_function+0}
       <ffffffff80134660>{default_wake_function+0}
<ffffffff801333cb>{schedule_tail+55}
       <ffffffff80110f47>{child_rip+8}
<ffffffffa023c818>{:cman:membership_kthread+0}
       <ffffffff80110f3f>{child_rip+0}

Code: 8b 45 3c 41 bc 14 00 00 00 41 bf 00 00 20 00 41 88 45 03 8b
RIP <ffffffffa0239ff1>{:cman:start_transition+447} RSP <0000010015807d08>
CR2: 000000000000003c
 <0>Kernel panic - not syncing: Oops


Version-Release number of selected component (if applicable):
[root@link-07 ~]# uname -ar
Linux link-07 2.6.9-50.ELsmp #1 SMP Tue Mar 6 18:04:58 EST 2007 x86_64 x86_64
x86_64 GNU/Linux
[root@link-07 ~]# rpm -q cman
cman-1.0.16-0
[root@link-07 ~]# rpm -q cman-kernel
cman-kernel-2.6.9-49.1
cman-kernel-2.6.9-49.2

Comment 1 Corey Marthaler 2007-03-22 21:37:45 UTC
Here's a little more info from the only node still "in" the cluster

[root@link-07 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   X   link-08
   2    1    4   M   link-07
   3    1    4   X   link-02
   4    1    4   X   link-04
[root@link-07 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 recover 0 -
[2]

DLM Lock Space:  "clvmd"                           377 257 recover 0 -
[2]

DLM Lock Space:  "clustered_log"                   379 259 recover 0 -
[2]

[root@link-07 cluster]# cat status
Protocol version: 5.0.1
Config version: 2
Cluster name: LINK_128
Cluster ID: 19208
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 4
Total_votes: 1
Quorum: 3  Activity blocked
Active subsystems: 4
Node name: link-07
Node ID: 2
Node addresses: 10.15.89.157


[root@link-07 cluster]# cat sm_debug
count 3
00000003 remove node 4 count 2
0100017b remove node 3 count 3
0100017b remove node 4 count 2
01000179 remove node 3 count 3
01000179 remove node 4 count 2
00000003 remove node 1 count 1
0100017b remove node 1 count 1
01000179 remove node 1 count 1


Comment 2 Corey Marthaler 2007-03-22 21:40:53 UTC
Just another note... I have automated tests that do this sort of thing over and
over, so it's surprising that when I did it once by hand this occured.

To date, this has only been seen one time.

Comment 3 Christine Caulfield 2007-03-23 11:48:00 UTC
I think it's one of those fluke occurances... 

Looking at the code it seems most likely that a NULL node structure address was
passed into start_transition() but It's really not clear to me how that can happen. 

Putting extra debugging into the code is almost certainly going to hide the
problem, if it's even reproducable at all - which seems unlikely given what
you've said.

Comment 4 Christine Caulfield 2007-05-03 10:20:23 UTC
I found one unchecked use of the node structure being passed into
start_transition(). It's pretty unlikely, but this doesn't seem to a common bug
so they might be related !

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision: 1.44.2.27; previous revision: 1.44.2.26
done

Feel free, of course, to reopen this if it happens again.

Comment 5 Chris Feist 2007-08-17 21:34:21 UTC
Setting flags for 4.6.

Comment 7 Corey Marthaler 2007-11-05 15:32:35 UTC
Marking this bug verified as it hasn't been seen in over 7 months.

Comment 9 errata-xmlrpc 2007-11-21 21:54:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0990.html