Description of problem: An Oracle database server running RHEL4 x86_64 generated the following output on its netdump server on 5/21 at 04:02: =========================================================================== audit(1148209321.828:4): avc: denied { append } for pid=1851 comm="syslogd" name="messages" dev=sda2 ino=65292 scontext=user_u:system_r:syslogd_t tcontext=root:object_r:tmp_t tclass=file audit(1148209322.046:5): avc: denied { ioctl } for pid=1851 comm="syslogd" name="messages" dev=sda2 ino=65292 scontext=user_u:system_r:syslogd_t tcontext=root:object_r:tmp_t tclass=file ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at prio_tree:528 invalid operand: 0000 [1] SMP CPU 5 Modules linked in: mptctl netconsole netdump nfs lockd md5 ipv6 autofs4 i2c_dev i2c_core sunrpc dm_mirror dm_mod button battery ac ohci_hcd hw_random tg3 e1000 floppy sg ext3 jbd mptscsih mptbase sd_mod scsi_mod Pid: 23963, comm: oracle Not tainted 2.6.9-22.0.1.ELsmp RIP: 0010:[<ffffffff8015e780>] <ffffffff8015e780>{vma_prio_tree_add+70} RSP: 0018:000001037c1e5e80 EFLAGS: 00010202 RAX: 000000000000002f RBX: 00000104de824268 RCX: 0000000000000000 RDX: 00000000000001ff RSI: 000001029aac25d8 RDI: 00000104de824268 RBP: 0000010098a8bb40 R08: 000000000000002f R09: 0000000000000000 R10: ffffffff803ef900 R11: 0000010001003950 R12: 0000010073c90178 R13: 00000107fedbc240 R14: 0000010073c90188 R15: 0000010073c90148 FS: 0000000004ec6ae0(005b) GS:ffffffff804d3300(0000) knlGS:00000000f7ea06c0 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000002a98aca018 CR3: 00000001fff9a000 CR4: 00000000000006e0 Process oracle (pid: 23963, threadinfo 000001037c1e4000, task 00000105317377f0) Stack: ffffffff80169022 0000000000000000 00000104de824268 0000000000000000 0000000000100073 00000103e256aac0 0000002a9903b000 0000000000200000 ffffffff8016a9a7 0000000000000200 Call Trace:<ffffffff80169022>{vma_link+204} <ffffffff8016a9a7>{do_mmap_pgoff+1444} <ffffffff80116b0d>{sys_mmap+159} <ffffffff80110052>{system_call+126} Code: 0f 0b 4e eb 31 80 ff ff ff ff 10 02 48 c7 47 60 00 00 00 00 RIP <ffffffff8015e780>{vma_prio_tree_add+70} RSP <000001037c1e5e80> =========================================================================== [ NOTE: the two "audit" lines and the separator were generated in the netdump output file itself; I'm guessing they're just cruft from the netdump buffers and that the "kernel BUG" stanza is the only interesting part, but I'm including them for completeness. ] The server in question is a Sun v40z with four dual-core AMD Opteron CPUs and 32GB of RAM. This server hung hard today (5/26) at 09:17--no output in any of the logs, nothing on the netdump server, no response either via the network or on the serial console. A powercycle was required to bring it back to life. It's not clear if the hang was related in any way to the "kernel BUG" above, but it's certainly suggestive that there had been a kernel bug reported during the server's current uptime interval. Also, this same server crashed last week (on 5/17) with no indication of why, since it didn't have a netdump server at the time. Again, we're not sure if that crash is related in any way to the kernel bug above or to the hang today, but it does seem likely, since we've experienced a large measure of various different kinds of instability with RHEL (both 3 and 4) on this hardware platform (it took two months and multiple rounds of system rearchitecture to arrive at a platform that seemed stable--and then when it went into production it crashed on 5/17, produced the kernel bug above on 5/21, and then hung today). Regarding the timing of the kernel bug report (04:02): that's precisely when the cron.daily job is launched, so it corresponds to a burst of activity on the system. The Oracle database in this case is hosted on NFS, so the system doesn't generally touch the local disk much at all. I see one report of a similar issue on the 2.6 kernel, but on SuSE: http://lists.suse.com/archive/suse-oracle/2005-Jul/0220.html. Interestingly, that server was running Oracle as well (9.2.0.6 vs 10.1.0.4 on this server). Version-Release number of selected component (if applicable): kernel 2.6.9-22.0.1.ELsmp (x86_64) How reproducible: Good question. We're trying to reproduce it on a similarly-configured (but non-production) server by running the busiest cron.daily jobs in a tight loop. Additional info: Note that we're unable to use the 2.6.9-34ELsmp kernel (or its descendants) on this server because of bug 193077. I know that might make y'all twitchy (and reasonably so), but the bug report in this situation is so specific that I thought it would be useful no matter what. I'm marking this as priority "high", but in fact getting RHEL4 to be stable on this server is "urgent" for us. The only reason I'm not marking it as such is because we don't know for sure that this particular bug is actually the root cause of the other instability we've seen on this server with RHEL4. We're getting to the point where we may have no choice but to abandon RHEL entirely and try moving to Solaris 10, since RHEL has exhibited lots of problems running Oracle in our environment (see bug 117902, bug 139113, and particularly bug 141394 for examples).
hi John, thanks for this bug report...this is a duplicate 171778, which is fixed in U3. i'm posting the patch for that here, until we can get to the bottom of 193077...thanks.
Created attachment 130043 [details] prio tree BUG fix
*** This bug has been marked as a duplicate of 171778 ***