193275 – "Kernel BUG at prio_tree:<n>" on RHEL4

Bug 193275 - "Kernel BUG at prio_tree:<n>" on RHEL4

Summary: "Kernel BUG at prio_tree:<n>" on RHEL4

Keywords:
Status:	CLOSED DUPLICATE of bug 171778
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-05-26 18:46 UTC by John Caruso
Modified:	2007-11-30 22:07 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-05-26 19:01:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
prio tree BUG fix (2.14 KB, patch) 2006-05-26 19:00 UTC, Jason Baron	no flags	Details \| Diff
View All

Description John Caruso 2006-05-26 18:46:18 UTC

Description of problem:
An Oracle database server running RHEL4 x86_64 generated the following output on
its netdump server on 5/21 at 04:02:

===========================================================================
audit(1148209321.828:4): avc:  denied  { append } for  pid=1851 comm="syslogd"
name="messages" dev=sda2 ino=65292 scontext=user_u:system_r:syslogd_t
tcontext=root:object_r:tmp_t tclass=file
audit(1148209322.046:5): avc:  denied  { ioctl } for  pid=1851 comm="syslogd"
name="messages" dev=sda2 ino=65292 scontext=user_u:system_r:syslogd_t
tcontext=root:object_r:tmp_t tclass=file
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at prio_tree:528
invalid operand: 0000 [1] SMP
CPU 5
Modules linked in: mptctl netconsole netdump nfs lockd md5 ipv6 autofs4 i2c_dev
i2c_core sunrpc dm_mirror dm_mod button battery ac ohci_hcd hw_random tg3 e1000
floppy sg ext3 jbd mptscsih mptbase sd_mod scsi_mod
Pid: 23963, comm: oracle Not tainted 2.6.9-22.0.1.ELsmp
RIP: 0010:[<ffffffff8015e780>] <ffffffff8015e780>{vma_prio_tree_add+70}
RSP: 0018:000001037c1e5e80  EFLAGS: 00010202
RAX: 000000000000002f RBX: 00000104de824268 RCX: 0000000000000000
RDX: 00000000000001ff RSI: 000001029aac25d8 RDI: 00000104de824268
RBP: 0000010098a8bb40 R08: 000000000000002f R09: 0000000000000000
R10: ffffffff803ef900 R11: 0000010001003950 R12: 0000010073c90178
R13: 00000107fedbc240 R14: 0000010073c90188 R15: 0000010073c90148
FS:  0000000004ec6ae0(005b) GS:ffffffff804d3300(0000) knlGS:00000000f7ea06c0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000002a98aca018 CR3: 00000001fff9a000 CR4: 00000000000006e0
Process oracle (pid: 23963, threadinfo 000001037c1e4000, task 00000105317377f0)
Stack: ffffffff80169022 0000000000000000 00000104de824268 0000000000000000
       0000000000100073 00000103e256aac0 0000002a9903b000 0000000000200000
       ffffffff8016a9a7 0000000000000200
Call Trace:<ffffffff80169022>{vma_link+204} <ffffffff8016a9a7>{do_mmap_pgoff+1444}
       <ffffffff80116b0d>{sys_mmap+159} <ffffffff80110052>{system_call+126}

Code: 0f 0b 4e eb 31 80 ff ff ff ff 10 02 48 c7 47 60 00 00 00 00
RIP <ffffffff8015e780>{vma_prio_tree_add+70} RSP <000001037c1e5e80>
===========================================================================

[ NOTE: the two "audit" lines and the separator were generated in the netdump
output file itself; I'm guessing they're just cruft from the netdump buffers and
that the "kernel BUG" stanza is the only interesting part, but I'm including
them for completeness. ]

The server in question is a Sun v40z with four dual-core AMD Opteron CPUs and
32GB of RAM.  This server hung hard today (5/26) at 09:17--no output in any of
the logs, nothing on the netdump server, no response either via the network or
on the serial console.  A powercycle was required to bring it back to life. 
It's not clear if the hang was related in any way to the "kernel BUG" above, but
it's certainly suggestive that there had been a kernel bug reported during the
server's current uptime interval.

Also, this same server crashed last week (on 5/17) with no indication of why,
since it didn't have a netdump server at the time.  Again, we're not sure if
that crash is related in any way to the kernel bug above or to the hang today,
but it does seem likely, since we've experienced a large measure of various
different kinds of instability with RHEL (both 3 and 4) on this hardware
platform (it took two months and multiple rounds of system rearchitecture to
arrive at a platform that seemed stable--and then when it went into production
it crashed on 5/17, produced the kernel bug above on 5/21, and then hung today).

Regarding the timing of the kernel bug report (04:02): that's precisely when the
cron.daily job is launched, so it corresponds to a burst of activity on the
system.  The Oracle database in this case is hosted on NFS, so the system
doesn't generally touch the local disk much at all.

I see one report of a similar issue on the 2.6 kernel, but on SuSE:
http://lists.suse.com/archive/suse-oracle/2005-Jul/0220.html.  Interestingly,
that server was running Oracle as well (9.2.0.6 vs 10.1.0.4 on this server).

Version-Release number of selected component (if applicable):
kernel 2.6.9-22.0.1.ELsmp (x86_64)

How reproducible:
Good question.  We're trying to reproduce it on a similarly-configured (but
non-production) server by running the busiest cron.daily jobs in a tight loop.

Additional info:
Note that we're unable to use the 2.6.9-34ELsmp kernel (or its descendants) on
this server because of bug 193077.  I know that might make y'all twitchy (and
reasonably so), but the bug report in this situation is so specific that I
thought it would be useful no matter what.

I'm marking this as priority "high", but in fact getting RHEL4 to be stable on
this server is "urgent" for us.  The only reason I'm not marking it as such is
because we don't know for sure that this particular bug is actually the root
cause of the other instability we've seen on this server with RHEL4.  We're
getting to the point where we may have no choice but to abandon RHEL entirely
and try moving to Solaris 10, since RHEL has exhibited lots of problems running
Oracle in our environment (see bug 117902, bug 139113, and particularly bug
141394 for examples).

Comment 1 Jason Baron 2006-05-26 18:58:30 UTC

hi John,

thanks for this bug report...this is a duplicate 171778, which is fixed in U3.
i'm posting the patch for that here, until we can get to the bottom of
193077...thanks.

Comment 2 Jason Baron 2006-05-26 19:00:32 UTC

Created attachment 130043 [details]
prio tree BUG fix

Comment 3 Jason Baron 2006-05-26 19:01:23 UTC


*** This bug has been marked as a duplicate of 171778 ***

Note You need to log in before you can comment on or make changes to this bug.