Bug 199433

Summary: NULL pointer dereference in cman:process_messages for cmirror
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: cman-kernelAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: high    
Version: 4CC: cluster-maint, jbrassow
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0990 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-21 21:54:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
debugging patch
none
debugging patch none

Description Corey Marthaler 2006-07-19 15:54:57 UTC
Description of problem:
Hit this on taft-01 (in a cluster with taft-02, -03, -04) while doing cmirror
operations on taft-04

Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP:
<ffffffffa021bfad>{:cman:process_messages+849}
PML4 215bfb067 PGD 0
Oops: 0000 [1] SMP
CPU 2
Modules linked in: dm_cmirror(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport
autofs4 i2c_dev i2c_d
Pid: 4314, comm: cman_serviced Tainted: GF     2.6.9-42.BRsmp
RIP: 0010:[<ffffffffa021bfad>] <ffffffffa021bfad>{:cman:process_messages+849}
RSP: 0018:0000010212409e68  EFLAGS: 00010202
RAX: 0000000000000002 RBX: 0000000000000001 RCX: 0000010200792000
RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffffa022f7a0
RBP: 0000010212409eb8 R08: 0000010212408000 R09: ffffffff80415ba0
R10: ffffffff80415ba0 R11: 0000000000000000 R12: 0000010213aed7c8
R13: 0000000000000004 R14: 0000010213aed7b8 R15: ffffffff8014b4f0
FS:  0000000000000000(0000) GS:ffffffff804e5180(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 00000000dff9e000 CR4: 00000000000006e0
Process cman_serviced (pid: 4314, threadinfo 0000010212408000, task
0000010216936030)
Stack: 0000000000000001 0000010216c65e80 0000002716930107 0000000000000de2
       0000020252201e91 00000100069397f0 00000102169362c8 00000100010707e0
       0000000000000001 0000000000000000
Call Trace:<ffffffffa0219831>{:cman:serviced+0}
<ffffffff8014b4f0>{keventd_create_kthread+0}
       <ffffffffa021985f>{:cman:serviced+46} <ffffffff8014b4c7>{kthread+200}
       <ffffffff80110f47>{child_rip+8} <ffffffff8014b4f0>{keventd_create_kthread+0}
       <ffffffff8014b3ff>{kthread+0} <ffffffff80110f3f>{child_rip+0}


Code: 8b 42 20 f6 c4 02 8b 42 28 74 17 83 f8 0b 77 17 44 39 ee 72
RIP <ffffffffa021bfad>{:cman:process_messages+849} RSP <0000010212409e68>
CR2: 0000000000000020
 <0>Kernel panic - not syncing:Jul 19 04:28:21 taft-01 kernel: Unable to handle
kernel NULL pointer
Jul 19 04:28:21 taft-01 kernel: <ffffffffa021bfad>{: cman:process_messages+849}

Message from syslogd@taft-01 at Wed Jul 19 04:28:21 2006 ...
taft-01 kernel: Oops: 0000 [1] SMP
Oops


Version-Release number of selected component (if applicable):
This was while running Jon's 2.6.9-42.BRsmp kernel

Comment 1 David Teigland 2006-07-19 16:25:33 UTC
SM messages are only sent/recvd when joining/leaving a group, I'm
not sure why normal cmirror use would involve joining/leaving a group.
process_messages() is a pretty short function, and I don't see any
pointers dereferenced by 0x20.  I wonder if it's a duff backtrace.


Comment 2 Corey Marthaler 2006-07-19 16:41:47 UTC
probably doesn't need to be set to high since it's only been seen once.

Comment 3 Jonathan Earl Brassow 2006-07-19 16:45:15 UTC
cmirror will do a kcl_join_service on the first mirror creation and do a kcl_leave_service on the last 
mirror remove.

So, if you are doing a looping mirror create/remove, it will be joining and leaving all the time.


Comment 4 Kiersten (Kerri) Anderson 2006-09-22 19:13:47 UTC
Devel ACK for 4.5, needs to be resolved with respect to cluster mirroring.

Comment 5 David Teigland 2006-10-19 16:26:06 UTC
Is this still happening?

Comment 6 Corey Marthaler 2006-11-22 00:16:29 UTC
Have not seen this in 4 months, will reopen if reproduced.

Comment 7 Corey Marthaler 2007-02-01 15:39:26 UTC
Hit this last night while doing GFS I/O to a cmirror, while also doing looping
up and down mirror convertions of that cmirror. This was in a two node cluster.

Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP:
<ffffffffa02251d3>{:cman:process_messages+849}
PML4 72474067 PGD 72c7e067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dm_cmirror(U) dlm(U)
cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd ehci_hcd k8_edac edac_mc tg3 qla2400
qla2xxx scsi_transport_fc dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_nv
libata sd_mod scsi_mod
Pid: 4015, comm: cman_serviced Not tainted 2.6.9-43.ELsmp
RIP: 0010:[<ffffffffa02251d3>] <ffffffffa02251d3>{:cman:process_messages+849}
RSP: 0018:000001007aaf5e68  EFLAGS: 00010202
RAX: 0000000000000002 RBX: 0000000000000001 RCX: 000001007d1a1200
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa0239020
RBP: 000001007aaf5eb8 R08: 000001007aaf4000 R09: ffffffff804192c0
R10: ffffffff804192c0 R11: 0000000000000000 R12: 0000010078bed7c8
R13: 0000000000000002 R14: 0000010078bed7b8 R15: ffffffff8014be28
FS:  0000002a95562b00(0000) GS:ffffffff804ed280(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 0000000000101000 CR4: 00000000000006e0
Process cman_serviced (pid: 4015, threadinfo 000001007aaf4000, task
000001007ab367f0)
Stack: 0000010001021a80 000001006e8bfe80 0000166c7ab30107 00000000000001c9
       00001acd6cf8f797 000001007b90a7f0 000001007ab36a88 0000010001020760
       0000010001020760 0000000000000000
Call Trace:<ffffffffa0222a31>{:cman:serviced+0}
<ffffffff8014be28>{keventd_create_kthread+0}
       <ffffffffa0222a5f>{:cman:serviced+46} <ffffffff8014bdff>{kthread+200}
       <ffffffff80110f47>{child_rip+8} <ffffffff8014be28>{keventd_create_kthread+0}
       <ffffffff8014bd37>{kthread+0} <ffffffff80110f3f>{child_rip+0}


Code: 8b 42 20 f6 c4 02 8b 42 28 74 17 83 f8 0b 77 17 44 39 ee 72
RIP <ffffffffa02251d3>{:cman:process_messages+849} RSP <000001007aaf5e68>
CR2: 0000000000000020
 <0>Kernel panic - not syncing: Oops
                                                                               
        




Comment 8 Corey Marthaler 2007-02-01 15:40:40 UTC
bumping priority now that this is reproducable.

Comment 9 David Teigland 2007-02-01 16:08:07 UTC
Created attachment 147109 [details]
debugging patch

This patch adds some printks that will hopefully
give us some clues about what's wrong if we can
reproduce with it.

Comment 10 David Teigland 2007-02-01 16:18:18 UTC
Created attachment 147110 [details]
debugging patch

this one compiles

Comment 11 David Teigland 2007-02-02 18:55:47 UTC
We need to get a rhel4.5 cmirror cluster set up and running with
my patch and corey's test.


Comment 12 David Teigland 2007-02-06 15:24:04 UTC
I have a cluster we could potentially use for testing/reproducing
this, but it has one (big) LUN on the SAN -- is that enough to set
up and test cmirror?


Comment 13 Corey Marthaler 2007-02-06 16:40:34 UTC
we could dice up that big LUN and then use it for cmirror testing, yes.

Comment 14 Corey Marthaler 2007-02-12 17:49:09 UTC
I'm going to move this to NEEDINFO as I can't seem to reproduce this on Dave's
cluster with the debugging patches.

Comment 15 Corey Marthaler 2007-02-28 19:39:28 UTC
Reproduced this today while running the clvm test activator. Unfortunately, I
don't have anymore info then the same panic messages/stack trace as in the
original bz report, but this bug is still out there.

Kernel 2.6.9-48.ELsmp on an x86_64
RHEL4-U5-re20070216.2

Comment 16 Corey Marthaler 2007-02-28 22:13:46 UTC
The lvm test activator seems to be a good way to reproduce this bug as I just
hit it again.

Comment 17 Jonathan Earl Brassow 2007-03-01 15:53:44 UTC
I'll see if I can grab that and get it going on my cluster


Comment 18 Jonathan Earl Brassow 2007-03-01 17:03:34 UTC
wait a minute....  has dct's patch been tried on the cluster that we know
produces the problem?  I think I'll wait until that's been done...


Comment 19 Kiersten (Kerri) Anderson 2007-03-13 15:56:55 UTC
Moving to NEEDINFO waiting on recreating the defect with the instrumentation.

Comment 20 Corey Marthaler 2007-04-03 16:31:26 UTC
Just an update that the test used to reproduce this bug ran 10000 iterations
over the entire weekend without any problems on the 2-node bull cluster, running
Dave's patch. 

Could the extra output that this patch produces some how affect the ability for
the panic to occur?

I'll attempt to reproduce with out the patch again. 

Comment 21 Corey Marthaler 2007-04-03 16:32:20 UTC
moving back to NEEDINFO.

Comment 22 Corey Marthaler 2007-04-10 14:55:50 UTC
Nice! I finally hit this damn bug again, on the bull nodes no less:


bull-01:
Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP:
<ffffffffa025d22b>{:cman:process_messages+849}
PML4 125376067 PGD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: dm_cmirror(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport
autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac
ohci_hcd ehci_hcd k8_edac edac_mc tg3 dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod sata_nv libata qla2400 qla2xxx scsi_transport_fc mptscsih mptsas mptspi
mptscsi mptbase sd_mod scsi_mod
Pid: 4892, comm: cman_serviced Not tainted 2.6.9-50.ELsmp
RIP: 0010:[<ffffffffa025d22b>] <ffffffffa025d22b>{:cman:process_messages+849}
RSP: 0018:00000101395ffe68  EFLAGS: 00010297
RAX: 0000000000000002 RBX: 0000000000000001 RCX: 000001007e2e6400
RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffffa02710a0
RBP: 00000101395ffeb8 R08: 00000101395fe000 R09: ffffffff80419a80
R10: ffffffff80419a80 R11: 0000000000000000 R12: 0000010138c037c8
R13: 0000000000000001 R14: 0000010138c037b8 R15: ffffffff8014be24
FS:  0000002a95562b00(0000) GS:ffffffff804ee200(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 0000000000101000 CR4: 00000000000006e0
Process cman_serviced (pid: 4892, threadinfo 00000101395fe000, task
0000010139f4f030)
Stack: 0000010008002a80 0000010123037280 00000efa39f40107 000000000000035a
       0000167ec2177e92 000001013a7cf030 0000010139f4f2c8 0000010008001760
       0000010008001760 0000000000000000
Call Trace:<ffffffffa025aa89>{:cman:serviced+0}
<ffffffff8014be24>{keventd_create_kthread+0}
       <ffffffffa025aab7>{:cman:serviced+46} <ffffffff8014bdfb>{kthread+200}
       <ffffffff80110f47>{child_rip+8} <ffffffff8014be24>{keventd_create_kthread+0}
       <ffffffff8014bd33>{kthread+0} <ffffffff80110f3f>{child_rip+0}


Code: 8b 42 20 f6 c4 02 8b 42 28 74 17 83 f8 0b 77 17 44 39 ee 72
RIP <ffffffffa025d22b>{:cman:process_messages+849} RSP <00000101395ffe68>
CR2: 0000000000000020
 <0>Kernel panic - not syncing: Oops


Comment 23 David Teigland 2007-04-11 13:55:22 UTC
I logged on to bull-02 and /var/log/messages appears to show that the last
cman module loaded there was from the rpm, not my debug version:

Apr  9 16:11:02 bull-02 kernel: CMAN 2.6.9-49.2 (built Mar 13 2007 17:14:14)
installed

Once we reboot bull-01 I suspect we'll see the same thing.  The debug
cman module should be loaded from here: /root/cluster/cman-kernel/src/cman.ko
and should show this when loaded:

Apr  3 13:01:51 bull-02 kernel: CMAN <CVS> (built Mar 29 2007 14:10:27) installed

The debug cman module checks for this null pointer before dereferencing it
and if it's null prints out some stuff and then panics.


Comment 24 David Teigland 2007-04-17 14:13:27 UTC
I've verified that the cman modules that were loaded when we last hit
the panic were from the rpm, not the debug versions I built.


Comment 25 David Teigland 2007-04-17 14:29:26 UTC
Here's the debug patch, it adds two checks for a NULL msg and if it finds
one, prints out a little more info about it before panicking.  Since the
modules I build with this patch seem to always work fine, perhaps I should
just check this into 4.5 and have you test with the next build.

diff -u -r1.4.2.3 sm_message.c
--- sm_message.c        1 Dec 2006 20:40:00 -0000       1.4.2.3
+++ sm_message.c        17 Apr 2007 14:25:35 -0000
@@ -717,6 +717,18 @@
        while (1) {
                re = NULL;
 
+               if (!re->msg) {
+                       log_print("process_messages: re %p len %d nodeid %d",
+                                 re, re->len, re->nodeid);
+                       SM_ASSERT(0, printk("NULL MSG\n"););
+                       /* FIXME: we might just kfree(re); continue; here */
+               }
+               /*
+               m = (sm_msg_t *) re->msg;
+               printk("take re %p len %d nodeid %d msg %p type %d\n",
+                       re, re->len, re->nodeid, re->msg, m->ms_type);
+               */
+
                spin_lock(&message_lock);
                if (!list_empty(&messages)) {
                        re = list_entry(messages.next, rq_entry_t, list);
@@ -749,6 +761,13 @@
        re->len = len;
        re->nodeid = nodeid;
 
+       SM_ASSERT(re->msg, printk("re %p len %d nodeid %d\n", re,len,nodeid););
+       /*
+       m = (sm_msg_t *) re->msg;
+       printk("save re %p len %d nodeid %d msg %p type %d\n",
+               re, re->len, re->nodeid, re->msg, m->ms_type);
+       */
+
        spin_lock(&message_lock);
        list_add_tail(&re->list, &messages);
        spin_unlock(&message_lock);


Comment 26 Corey Marthaler 2007-04-17 22:23:21 UTC
I filed bz 236844 for the clvmd memory leak issue which may be related or what
is helping cause this issue.

Comment 27 Jonathan Earl Brassow 2007-04-24 20:14:14 UTC
more likely is the cmirror leak which has been checked-in as part of 237028 and
a partial fix for this bug.

While cman's response to the low memory condition may need to be tightened, I'm
hoping the memory leak fix to cmirror will cause this bug not to get hit again.
It might be reasonable to mark this as modified... although, cman changes may
still be wanted/required.


Comment 28 Corey Marthaler 2007-05-09 15:00:26 UTC
After weeks of running tests specifically to trip this bug and unsuccessfully
doing so, I was able to reproduce this last night by doing looping up and down
mirror converts as well as creation/deletions on all four nodes in the cluster. 

So, this bug still exists, but seems unlikely that a customer will hit it.
Should we take this off the blocker list? 

Comment 29 Kiersten (Kerri) Anderson 2007-05-21 22:17:50 UTC
Moving this bug to the 4.6 release since 4.5 has already been released.

Comment 30 David Teigland 2007-08-14 17:08:37 UTC
If this happens again, I expect it should be a hang accompanied by the new
error message instead of the oops.

Branch:         RHEL4
Changes by:     teigland 2007-08-14 17:05:10

Modified files:
        cman-kernel/src: sm_message.c

Log message:
        bz 199433: NULL pointer dereference in cman:process_messages for cmirror

        Adds a check for a null message, and if it finds one, prints an error
        and ignores it instead of oopsing.  This may help us get further in
        discovering the real problem.  Ignoring the null message will probably
        lead to a hang of some kind, which is better and easier to debug than
        an oopsed machine.


Comment 31 David Teigland 2007-10-02 20:44:33 UTC
Corey hit this again and didn't see the new message that was added.

I disassembled the cman module and discovered that the compiler has
inlined a bunch of functions into process_messages().  So, the problem
isn't actually in the function identified by the oops message.

After studying the assembly, I think the oops is likely to be from
dereferencing an "sev" structure that's not checked for NULL before being
used in process_leave_request().  process_join_request() does the same thing.
I've checked in code that will verify these pointers before using them.

On 64 bit machines:
offending instruction offset 892 = 0xE25E

    e255:       74 60                   je     e2b7 <process_messages+0x3d5>
    e257:       48 8b 91 88 00 00 00    mov    0x88(%rcx),%rdx
    e25e:       8b 42 20                mov    0x20(%rdx),%eax
    e261:       f6 c4 02                test   $0x2,%ah
    e264:       8b 42 28                mov    0x28(%rdx),%eax

I think that 0x88 is probably the sg->sevent pointer being loaded, and
0x20 is sev->se_flags which oopses given the null sev.

Checking in sm_message.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/sm_message.c,v  <--  sm_message.c
new revision: 1.4.2.5; previous revision: 1.4.2.4


Comment 33 Corey Marthaler 2007-11-05 16:01:58 UTC
Marking this bug verified. It hasn't been seen in the past month of heavy
cmirror testing with a fixed check in. 

Comment 35 errata-xmlrpc 2007-11-21 21:54:05 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0990.html