Bug 199433
Summary: | NULL pointer dereference in cman:process_messages for cmirror | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||||
Component: | cman-kernel | Assignee: | David Teigland <teigland> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4 | CC: | cluster-maint, jbrassow | ||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHBA-2007-0990 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2007-11-21 21:54:05 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Corey Marthaler
2006-07-19 15:54:57 UTC
SM messages are only sent/recvd when joining/leaving a group, I'm not sure why normal cmirror use would involve joining/leaving a group. process_messages() is a pretty short function, and I don't see any pointers dereferenced by 0x20. I wonder if it's a duff backtrace. probably doesn't need to be set to high since it's only been seen once. cmirror will do a kcl_join_service on the first mirror creation and do a kcl_leave_service on the last mirror remove. So, if you are doing a looping mirror create/remove, it will be joining and leaving all the time. Devel ACK for 4.5, needs to be resolved with respect to cluster mirroring. Is this still happening? Have not seen this in 4 months, will reopen if reproduced. Hit this last night while doing GFS I/O to a cmirror, while also doing looping up and down mirror convertions of that cmirror. This was in a two node cluster. Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: <ffffffffa02251d3>{:cman:process_messages+849} PML4 72474067 PGD 72c7e067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dm_cmirror(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd ehci_hcd k8_edac edac_mc tg3 qla2400 qla2xxx scsi_transport_fc dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_nv libata sd_mod scsi_mod Pid: 4015, comm: cman_serviced Not tainted 2.6.9-43.ELsmp RIP: 0010:[<ffffffffa02251d3>] <ffffffffa02251d3>{:cman:process_messages+849} RSP: 0018:000001007aaf5e68 EFLAGS: 00010202 RAX: 0000000000000002 RBX: 0000000000000001 RCX: 000001007d1a1200 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa0239020 RBP: 000001007aaf5eb8 R08: 000001007aaf4000 R09: ffffffff804192c0 R10: ffffffff804192c0 R11: 0000000000000000 R12: 0000010078bed7c8 R13: 0000000000000002 R14: 0000010078bed7b8 R15: ffffffff8014be28 FS: 0000002a95562b00(0000) GS:ffffffff804ed280(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000020 CR3: 0000000000101000 CR4: 00000000000006e0 Process cman_serviced (pid: 4015, threadinfo 000001007aaf4000, task 000001007ab367f0) Stack: 0000010001021a80 000001006e8bfe80 0000166c7ab30107 00000000000001c9 00001acd6cf8f797 000001007b90a7f0 000001007ab36a88 0000010001020760 0000010001020760 0000000000000000 Call Trace:<ffffffffa0222a31>{:cman:serviced+0} <ffffffff8014be28>{keventd_create_kthread+0} <ffffffffa0222a5f>{:cman:serviced+46} <ffffffff8014bdff>{kthread+200} <ffffffff80110f47>{child_rip+8} <ffffffff8014be28>{keventd_create_kthread+0} <ffffffff8014bd37>{kthread+0} <ffffffff80110f3f>{child_rip+0} Code: 8b 42 20 f6 c4 02 8b 42 28 74 17 83 f8 0b 77 17 44 39 ee 72 RIP <ffffffffa02251d3>{:cman:process_messages+849} RSP <000001007aaf5e68> CR2: 0000000000000020 <0>Kernel panic - not syncing: Oops bumping priority now that this is reproducable. Created attachment 147109 [details]
debugging patch
This patch adds some printks that will hopefully
give us some clues about what's wrong if we can
reproduce with it.
Created attachment 147110 [details]
debugging patch
this one compiles
We need to get a rhel4.5 cmirror cluster set up and running with my patch and corey's test. I have a cluster we could potentially use for testing/reproducing this, but it has one (big) LUN on the SAN -- is that enough to set up and test cmirror? we could dice up that big LUN and then use it for cmirror testing, yes. I'm going to move this to NEEDINFO as I can't seem to reproduce this on Dave's cluster with the debugging patches. Reproduced this today while running the clvm test activator. Unfortunately, I don't have anymore info then the same panic messages/stack trace as in the original bz report, but this bug is still out there. Kernel 2.6.9-48.ELsmp on an x86_64 RHEL4-U5-re20070216.2 The lvm test activator seems to be a good way to reproduce this bug as I just hit it again. I'll see if I can grab that and get it going on my cluster wait a minute.... has dct's patch been tried on the cluster that we know produces the problem? I think I'll wait until that's been done... Moving to NEEDINFO waiting on recreating the defect with the instrumentation. Just an update that the test used to reproduce this bug ran 10000 iterations over the entire weekend without any problems on the 2-node bull cluster, running Dave's patch. Could the extra output that this patch produces some how affect the ability for the panic to occur? I'll attempt to reproduce with out the patch again. moving back to NEEDINFO. Nice! I finally hit this damn bug again, on the bull nodes no less: bull-01: Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: <ffffffffa025d22b>{:cman:process_messages+849} PML4 125376067 PGD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: dm_cmirror(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd ehci_hcd k8_edac edac_mc tg3 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_nv libata qla2400 qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod Pid: 4892, comm: cman_serviced Not tainted 2.6.9-50.ELsmp RIP: 0010:[<ffffffffa025d22b>] <ffffffffa025d22b>{:cman:process_messages+849} RSP: 0018:00000101395ffe68 EFLAGS: 00010297 RAX: 0000000000000002 RBX: 0000000000000001 RCX: 000001007e2e6400 RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffffa02710a0 RBP: 00000101395ffeb8 R08: 00000101395fe000 R09: ffffffff80419a80 R10: ffffffff80419a80 R11: 0000000000000000 R12: 0000010138c037c8 R13: 0000000000000001 R14: 0000010138c037b8 R15: ffffffff8014be24 FS: 0000002a95562b00(0000) GS:ffffffff804ee200(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000020 CR3: 0000000000101000 CR4: 00000000000006e0 Process cman_serviced (pid: 4892, threadinfo 00000101395fe000, task 0000010139f4f030) Stack: 0000010008002a80 0000010123037280 00000efa39f40107 000000000000035a 0000167ec2177e92 000001013a7cf030 0000010139f4f2c8 0000010008001760 0000010008001760 0000000000000000 Call Trace:<ffffffffa025aa89>{:cman:serviced+0} <ffffffff8014be24>{keventd_create_kthread+0} <ffffffffa025aab7>{:cman:serviced+46} <ffffffff8014bdfb>{kthread+200} <ffffffff80110f47>{child_rip+8} <ffffffff8014be24>{keventd_create_kthread+0} <ffffffff8014bd33>{kthread+0} <ffffffff80110f3f>{child_rip+0} Code: 8b 42 20 f6 c4 02 8b 42 28 74 17 83 f8 0b 77 17 44 39 ee 72 RIP <ffffffffa025d22b>{:cman:process_messages+849} RSP <00000101395ffe68> CR2: 0000000000000020 <0>Kernel panic - not syncing: Oops I logged on to bull-02 and /var/log/messages appears to show that the last cman module loaded there was from the rpm, not my debug version: Apr 9 16:11:02 bull-02 kernel: CMAN 2.6.9-49.2 (built Mar 13 2007 17:14:14) installed Once we reboot bull-01 I suspect we'll see the same thing. The debug cman module should be loaded from here: /root/cluster/cman-kernel/src/cman.ko and should show this when loaded: Apr 3 13:01:51 bull-02 kernel: CMAN <CVS> (built Mar 29 2007 14:10:27) installed The debug cman module checks for this null pointer before dereferencing it and if it's null prints out some stuff and then panics. I've verified that the cman modules that were loaded when we last hit the panic were from the rpm, not the debug versions I built. Here's the debug patch, it adds two checks for a NULL msg and if it finds one, prints out a little more info about it before panicking. Since the modules I build with this patch seem to always work fine, perhaps I should just check this into 4.5 and have you test with the next build. diff -u -r1.4.2.3 sm_message.c --- sm_message.c 1 Dec 2006 20:40:00 -0000 1.4.2.3 +++ sm_message.c 17 Apr 2007 14:25:35 -0000 @@ -717,6 +717,18 @@ while (1) { re = NULL; + if (!re->msg) { + log_print("process_messages: re %p len %d nodeid %d", + re, re->len, re->nodeid); + SM_ASSERT(0, printk("NULL MSG\n");); + /* FIXME: we might just kfree(re); continue; here */ + } + /* + m = (sm_msg_t *) re->msg; + printk("take re %p len %d nodeid %d msg %p type %d\n", + re, re->len, re->nodeid, re->msg, m->ms_type); + */ + spin_lock(&message_lock); if (!list_empty(&messages)) { re = list_entry(messages.next, rq_entry_t, list); @@ -749,6 +761,13 @@ re->len = len; re->nodeid = nodeid; + SM_ASSERT(re->msg, printk("re %p len %d nodeid %d\n", re,len,nodeid);); + /* + m = (sm_msg_t *) re->msg; + printk("save re %p len %d nodeid %d msg %p type %d\n", + re, re->len, re->nodeid, re->msg, m->ms_type); + */ + spin_lock(&message_lock); list_add_tail(&re->list, &messages); spin_unlock(&message_lock); I filed bz 236844 for the clvmd memory leak issue which may be related or what is helping cause this issue. more likely is the cmirror leak which has been checked-in as part of 237028 and a partial fix for this bug. While cman's response to the low memory condition may need to be tightened, I'm hoping the memory leak fix to cmirror will cause this bug not to get hit again. It might be reasonable to mark this as modified... although, cman changes may still be wanted/required. After weeks of running tests specifically to trip this bug and unsuccessfully doing so, I was able to reproduce this last night by doing looping up and down mirror converts as well as creation/deletions on all four nodes in the cluster. So, this bug still exists, but seems unlikely that a customer will hit it. Should we take this off the blocker list? Moving this bug to the 4.6 release since 4.5 has already been released. If this happens again, I expect it should be a hang accompanied by the new error message instead of the oops. Branch: RHEL4 Changes by: teigland 2007-08-14 17:05:10 Modified files: cman-kernel/src: sm_message.c Log message: bz 199433: NULL pointer dereference in cman:process_messages for cmirror Adds a check for a null message, and if it finds one, prints an error and ignores it instead of oopsing. This may help us get further in discovering the real problem. Ignoring the null message will probably lead to a hang of some kind, which is better and easier to debug than an oopsed machine. Corey hit this again and didn't see the new message that was added. I disassembled the cman module and discovered that the compiler has inlined a bunch of functions into process_messages(). So, the problem isn't actually in the function identified by the oops message. After studying the assembly, I think the oops is likely to be from dereferencing an "sev" structure that's not checked for NULL before being used in process_leave_request(). process_join_request() does the same thing. I've checked in code that will verify these pointers before using them. On 64 bit machines: offending instruction offset 892 = 0xE25E e255: 74 60 je e2b7 <process_messages+0x3d5> e257: 48 8b 91 88 00 00 00 mov 0x88(%rcx),%rdx e25e: 8b 42 20 mov 0x20(%rdx),%eax e261: f6 c4 02 test $0x2,%ah e264: 8b 42 28 mov 0x28(%rdx),%eax I think that 0x88 is probably the sg->sevent pointer being loaded, and 0x20 is sev->se_flags which oopses given the null sev. Checking in sm_message.c; /cvs/cluster/cluster/cman-kernel/src/Attic/sm_message.c,v <-- sm_message.c new revision: 1.4.2.5; previous revision: 1.4.2.4 Marking this bug verified. It hasn't been seen in the past month of heavy cmirror testing with a fixed check in. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0990.html |