Bug 149087 - Kernel panic regression in 2.4.21-27.0.2.ELsmp
Kernel panic regression in 2.4.21-27.0.2.ELsmp
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Dave Anderson
Brian Brock
: Regression
Depends On:
Blocks: 156321
  Show dependency treegraph
 
Reported: 2005-02-18 11:25 EST by Phil Mayers
Modified: 2007-11-30 17:07 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-05-25 12:42:37 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Screenshot of (remote) console showing kernel panic (46.00 KB, image/png)
2005-02-18 11:25 EST, Phil Mayers
no flags Details
Syslog output for a bootup with the crashing kernel (26.67 KB, text/plain)
2005-02-18 11:28 EST, Phil Mayers
no flags Details
Screenshot of console during panic (53.05 KB, image/jpeg)
2005-03-04 14:52 EST, Charles C. Figueiredo
no flags Details
netdump log of "qt8" crashing (13.75 KB, text/plain)
2005-03-04 16:12 EST, Charles C. Figueiredo
no flags Details
netdump log of "qt9" panicing (14.04 KB, text/plain)
2005-03-04 16:13 EST, Charles C. Figueiredo
no flags Details
screenshot of "qt10" crash (56.32 KB, image/jpeg)
2005-03-06 00:04 EST, Charles C. Figueiredo
no flags Details
netdump log of "qt10" shortly preceding crash (50.95 KB, text/plain)
2005-03-06 00:16 EST, Charles C. Figueiredo
no flags Details
netdump log of "qt10" shortly preceding crash (50.95 KB, text/plain)
2005-03-06 00:16 EST, Charles C. Figueiredo
no flags Details
messages file from "qt10" (3.63 MB, application/x-gzip)
2005-03-06 00:39 EST, Charles C. Figueiredo
no flags Details

  None (edit)
Description Phil Mayers 2005-02-18 11:25:14 EST
Description of problem:

"Kernel panic: Fatal exception" in "handle_mm_fault"

Version-Release number of selected component (if applicable):

2.4.21-27.0.2.ELsmp

How reproducible:

We have been running Exim 4.44 with Sophie antivirus and SpamAssasin, variously
communicating over Unix sockets. This setup has been rock solid. We applied the
kernel errata listed above yesterday (1750 17 Feb) and today at 1445 the server
halted with a kernel panic at the console.

Unfortunately the only access we have to the console is via graphical ILO
management, and there is no scrollback on the remote console, so we only have
part of the traceback; however it seems to indicate a double fault?

Screenshot of the ILO console attached.

Steps to Reproduce:
1. Boot 2.4.21-27.0.2.ELsmp
2. Allow service to run
3. Crash
  
Actual results:

Kernel panic

Expected results:

Normal function

Additional info:

We've rolled back to 2.4.21-20.ELsmp on this box for the moment. This is our
production mail service, one of 4 near-identical machines, all of which have
been upgraded.

The load on the machine was rising at the time of the crash, so possibly it's a
load-induced crash?

Hardware is a dual-processor HP DL380 G3 with fairly standard server-level
components; kernel bootup output attached for the *crashing* kernel
Comment 1 Phil Mayers 2005-02-18 11:25:14 EST
Created attachment 111210 [details]
Screenshot of (remote) console showing kernel panic
Comment 2 Phil Mayers 2005-02-18 11:28:05 EST
Created attachment 111211 [details]
Syslog output for a bootup with the crashing kernel
Comment 3 Dave Anderson 2005-03-01 12:12:11 EST
In handling a page fault for a user address in the "sophie" process,
handle_mm_fault() call pte_alloc_map() to finf, or potentially create,
the page table containing the pte_t for the faulting address, kmap()
it, and add the offset to the pte_t entry within the kmap'd page table
page, returning the temporary mapping in "pte":

handle_mm_fault()
...
        if (pmd) {
                pte_t * pte = pte_alloc_map(mm, pmd, address);
                if (pte)              <<line 1935 in memory.c>>
                        return handle_pte_fault(mm, vma, address,
write_access, pte, pmd);
        }
...

"pte" was passed as an argument to handle_pte_fault(), which is
is inlined into handle_mm_fault(), and the instruction causing the 
fault is the attempted assignment of what pte points to into
the "entry" variable:

static inline int handle_pte_fault(struct mm_struct *mm,
        struct vm_area_struct * vma, unsigned long address,
        int write_access, pte_t *pte, pmd_t *pmd)
{
        pte_t entry;

        entry = *pte;                      <<line 1876 in memory.c>>
        if (!pte_present(entry)) {
...

Line 1935 is the "if (pte)" in handle_mm_fault(), the oops occurred
in the mov intruction below at line 1876 in handle_pte_fault():

/usr/src/debug/kernel-2.4.21/linux-2.4.21/mm/memory.c: 1935
0xc0143bea <handle_mm_fault+0x7a>:      je     0xc0143dd0
<handle_mm_fault+0x260>
/usr/src/debug/kernel-2.4.21/linux-2.4.21/mm/memory.c: 1876
0xc0143bf0 <handle_mm_fault+0x80>:      mov    (%eax),%ecx

Although the oops message is clipped, I'm presuming that it
indicated a faulting address of fffb8d20, since that's what's
in EAX.  That would be impossible as a kmap address, because
the range of available kmap addresses are fixed:

#define PKMAP_BASE (0xff000000UL)
#define NR_SHARED_PMDS ((0xffffffff-PKMAP_BASE+1)/PMD_SIZE)
#ifdef CONFIG_X86_PAE
#define LAST_PKMAP 512
#else
#define LAST_PKMAP 1024
#endif
#define LAST_PKMAP_MASK (LAST_PKMAP-1)
#define PKMAP_NR(virt)  ((virt-PKMAP_BASE) >> PAGE_SHIFT)
#define PKMAP_ADDR(nr)  (PKMAP_BASE + ((nr) << PAGE_SHIFT))

With PAE turned on, as is the case in an smp kernel, then, there
could be a maximum of 512 pages available to kmap(), starting at 
the PKMAP_BASE address of 0xff000000, and ranging to a maximum of 
PKMAP_BASE plus PKMAP_ADDR(511) or ff1ff000.  A kmap address of 
fffb8d20 is way out of range.

That all being said, it's impossible to determine how that happened
with a simple oops display.  

We would either need to be given a reproducer, or have you reproduce
it with either netdump or diskdump enabled, and hope that a dumpfile
would afford us additional clues, in order to have a prayer of
debugging this any further.

Would either of those two options be amenable to you?



Comment 4 Charles C. Figueiredo 2005-03-04 14:52:13 EST
Created attachment 111676 [details]
Screenshot of console during panic
Comment 5 Charles C. Figueiredo 2005-03-04 14:53:33 EST
I believe I've seen this as well -- I'm attaching a screenshot of the console 
with panic. I've also configured a crontab with a battery of echo "foo" 
> /proc/sysrq-trigger to get more meat on this. Unfortunately, with netdump 
configured, my systems seem to reboot automagically while trying to send a 
vmcore over -- the most I get is some oops info. I'm running the sysrq's every 
2 seconds to try to get the state of the system right before it happens.

If this happens again and I get anything useful out of it, I'll attach it here 
as well. I have a group of systems panicing every couple of days or weeks with 
3 or 4 different sets of call traces, all running U4. The only common factor is 
that these systems are heavy NFS clients and run large java JVMs (i.e. multiple 
2.3GB processes).

-charles.
Comment 6 Dave Anderson 2005-03-04 15:03:08 EST
Take a look in your netdump-server's /var/log/messages for the 
reason why it sent the reboot request to the panicking client.

Once a panicking client starts a netdump operation, it will only
reboot automatically upon receiving a command to do so from the
netdump-server.

Comment 7 Dave Anderson 2005-03-04 15:46:43 EST
Also, even though the vmcore file wasn't created, you mention that
there is the log file on the netdump-server.  In that log file, you
should see the same output as the screenshot of the console.  What
does the line just above the "CPU: 1" indicate?  It should be 
something of the sort:

  kernel BUG in header file at line xxx

Comment 8 Charles C. Figueiredo 2005-03-04 16:01:19 EST

In this particular instance, this host was not configured as a 
netdump client, although now it is. If it panics again, I'll have the 
output. I do have netdump logs of the 2 other systems and will attach.

I find it too coincidental that 3 machines in the same group panic 
with different call traces running the same jobs. This is all heavy 
NFS / large java jvm process / perl working with large datasets.

-charles.
Comment 9 Dave Anderson 2005-03-04 16:06:08 EST
Gotcha -- thanks...

Further note -- the out_of_line() reference made me realize that
my ramblings about the mapped addresses starting at PKMAP_BASE going
up to PKMAP_ADDR(511) are incorrect.  pte_alloc_map() uses
kmap_atomic(), which use a fixed-set of specific-use, per-cpu, pte's
existing at the fixed address range (by my calculations) from ff02a000
up to ff469000.  Actually only NR_CPUS-worth of pages in that
range are used by pte_alloc_map(), but it's the base-to-top values
that still would exclude fffb8d20.
Comment 10 Charles C. Figueiredo 2005-03-04 16:12:53 EST
Created attachment 111679 [details]
netdump log of "qt8" crashing
Comment 11 Charles C. Figueiredo 2005-03-04 16:13:50 EST
Created attachment 111680 [details]
netdump log of "qt9" panicing
Comment 12 Dave Anderson 2005-03-04 17:09:24 EST
Those two look to be completely unrelated to the handle_mm_fault
issue we're trying to track down here.

In fact, they look suspiciously like the issue in bugzilla #132838.
In that bugzilla, comment #61 has pointers to test kernels that
may address the problems seen in qt8 and qt9.  However, if your
system is not accessing /proc/kcore (perhaps when doing a tar of the
whole system from /) then it's not going to be the same issue.

In any case, let's keep this case specific to panics occurring
from a user-space do_page_fault() leading to handle_mm_fault(),
with the panic occuring there.

Comment 13 Charles C. Figueiredo 2005-03-04 17:15:41 EST
Sure thing, I'll append the next time I see the handle_mm_fault() 
panic again. I've got my netdump clients waiting.

I read through the other bugzilla for more details, but I can't think 
of any scenario where tar would be accessing /proc/kcore on these 
systems. I merely included those two netdump logs because I thought 
it was too coincidental that 3 boxes in the same group, running the 
same jobs were crashing within days of each other. Perhaps this is 
just the servers are having more than 1 issue manifest itself here.

-charles.
Comment 14 Charles C. Figueiredo 2005-03-05 07:43:41 EST

Ok, this didn't take long -- I got another panic with handle_mm_fault() in the 
call trace. Unfortunately, no vmcore was transferred, the console shows "tg3: 
transmission timeout" or some such. I'm appending the screenshot, the netdump 
log file, and the /var/adm/messages with a battery of sysrq-triggers every 2 
minutes.

-charles.
Comment 15 Charles C. Figueiredo 2005-03-06 00:04:30 EST
Created attachment 111700 [details]
screenshot of "qt10" crash
Comment 16 Charles C. Figueiredo 2005-03-06 00:16:21 EST
Created attachment 111701 [details]
netdump log of "qt10" shortly preceding crash


The last log sent to netdump doesn't seem to actually contain the panic & call
trace. From the screen shot of the crash, you can see the tg3 error following
the call trace complaining about a timeout. As it stands, I can't get a vmcore
or even much of any oops info during the actual crash.

-charles.
Comment 17 Charles C. Figueiredo 2005-03-06 00:16:54 EST
Created attachment 111702 [details]
netdump log of "qt10" shortly preceding crash


The last log sent to netdump doesn't seem to actually contain the panic & call
trace. From the screen shot of the crash, you can see the tg3 error following
the call trace complaining about a timeout. As it stands, I can't get a vmcore
or even much of any oops info during the actual crash.

-charles.
Comment 18 Charles C. Figueiredo 2005-03-06 00:39:47 EST
Created attachment 111703 [details]
messages file from "qt10"

Here's a messages file corresponding to the following cronjob:

[root@qt9 ccf]# crontab -l
*/2 * * * *	 /bin/echo "m" > /proc/sysrq-trigger
*/2 * * * *	 /bin/echo "t" > /proc/sysrq-trigger
*/2 * * * *	 /bin/echo "w" > /proc/sysrq-trigger
*/2 * * * *	 /bin/echo "p" > /proc/sysrq-trigger

Not sure if it'll help, but I'm trying to gather as much debugging data as
possible.

-charles.
Comment 19 Dave Anderson 2005-03-07 09:16:16 EST
Right -- there's no oops message anywhere here to work with.

The "handle_mm_fault" seen in the "echo" process trace in the
"qt10" attachment appears to be a stale leftover return address
on the stack, and the process 8185 has successfully made it
to schedule(). 

The abundance of sysrq-trigger writes looks to be overwhelming
the tg3, and possibly keeping the subsequent netdump operation
from proceeding.  Just let the system run normally with netdump
enabled, and wait for one of the real handle_mm_fault faults to
occur.
Comment 21 Charles C. Figueiredo 2005-03-07 10:27:14 EST

Ok, I've disabled all the cronjobs and made sure netdump was running 
on the 3 crashing systems.

Is there anything else I can do to ensure that a vmcore gets 
transferred? I haven't been able to get a single vmcore transferred 
yet -- I get the occasional vmcore-incomplete file.

-charles.
Comment 22 Dave Anderson 2005-03-07 10:50:28 EST
If the vmcore-incomplete files are only 4k in length, then there
is a problem with packets being sent back from the netdump-server
to the panicking client being dropped.  If the file has more than
4k, and up to ~1GB in length, then it's very possible that the
vmcore-incomplete may still be able to be analyzed.

The /var/log/messages file on the netdump-server will have error
messages from the netdump-server daemon indicating why it bailed
out.

Comment 23 Charles C. Figueiredo 2005-03-07 10:56:38 EST

Unless I'm mistaken, the files are 0 bytes in size. I also checked 
the messages file and see no netdump log entries other than when 
clients connect:

[root@master 03]# grep netdump /var/log/messages
Mar  7 10:08:03 master sshd[30847]: pam_krb5: authentication fails 
for `netdump'
Mar  7 10:08:03 master sshd[30847]: Accepted password for netdump 
from 65.202.163.175 port 33354
Mar  7 10:08:03 master sshd(pam_unix)[30883]: session opened for user 
netdump by (uid=34)
Mar  7 10:08:03 master sshd(pam_unix)[30883]: session closed for user 
netdump
Mar  7 10:08:38 master sshd[31005]: pam_krb5: authentication fails 
for `netdump'
Mar  7 10:08:38 master sshd[31005]: Accepted password for netdump 
from 65.202.163.174 port 32789
Mar  7 10:08:38 master sshd(pam_unix)[31014]: session opened for user 
netdump by (uid=34)
Mar  7 10:08:38 master sshd(pam_unix)[31014]: session closed for user 
netdump
Mar  7 10:24:05 master sshd[3405]: pam_krb5: authentication fails for 
`netdump'
Mar  7 10:24:05 master sshd[3405]: Accepted password for netdump from 
65.202.163.159 port 32798
Mar  7 10:24:05 master sshd(pam_unix)[3418]: session opened for user 
netdump by (uid=34)
Mar  7 10:24:05 master sshd(pam_unix)[3418]: session closed for user 
netdump

-charles.
Comment 24 Charles C. Figueiredo 2005-03-07 11:01:10 EST

Some more info... here's dmesg on a client:

netlog: using network device <eth0>
netlog: using source IP 65.202.163.159
netlog: using source UDP port: 6666
netlog: using netdump target IP 65.202.162.18
netlog: using netlog target IP 65.202.162.18
netlog: using target UDP port: 6666
netlog: using target UDP port: 6666
netlog: using netdump target ethernet address 00:0e:7f:fe:df:4f.
netlog: using netdump target ethernet address 00:0e:7f:fe:df:4f.
netlog: using broadcast ethernet frames to send syslog packets.
netlog: network logging started up successfully!

but here's what /var/crash/65.202.163.159-2005-03-07-10:50/log says:

[...network console shutdown...]

Is this normal?
Comment 25 Dave Anderson 2005-03-07 11:18:52 EST
I'm not sure of the timeline here.  If you got a "network console
shutdown" message from 65.202.163.159, then the netconsole module
on that client was explicitly shutdown during a module unload: 

static void cleanup_netconsole(void)
{
        printk(KERN_INFO "netlog: network logging shut down.\n");
        unregister_console(&netconsole);

#define SHUTDOWN_MSG "[...network console shutdown...]\n"
        write_netconsole_msg(NULL, SHUTDOWN_MSG, strlen(SHUTDOWN_MSG));
        dev_put(netconsole_dev);
        netconsole_dev = NULL;
        netdump_unregister_hooks();
}

The write_netconsole_msg() would have sent the message that was
put into the remote log file, but there should also have been
the "netlog: network logging shut down." message on the client,
or at least in the client's /var/log/messages.
Comment 26 Charles C. Figueiredo 2005-03-07 11:26:19 EST

This is strange -- [...network console shutdown...] shows up 
immediately after a service netdump start or restart. I can reproduce 
this consistently. dmesg on the client shows a successful startup and 
the server immediately shows a console shutdown message.

i.e.

[root@qt9 root]# date
Mon Mar  7 11:23:58 EST 2005
[root@qt9 root]# service netdump restart
disabling netdump                                          [  OK  ]
netdump@65.202.162.18's password:
initializing netdump                                       [  OK  ]
[root@qt9 root]# date
Mon Mar  7 11:24:11 EST 2005

[root@master crash]# ls -l
total 24
drwx------    2 netdump  netdump      4096 Mar  7 10:50 
65.202.163.159-2005-03-07-10:50
drwx------    2 netdump  netdump      4096 Mar  7 11:19 
65.202.163.159-2005-03-07-11:19
drwx------    2 netdump  netdump      4096 Mar  7 11:20 
65.202.163.174-2005-03-07-11:20
drwx------    2 netdump  netdump      4096 Mar  7 11:24 
65.202.163.174-2005-03-07-11:24
drwx------    2 netdump  netdump      4096 Mar  4 15:52 magic
drwxr-xr-x    2 netdump  netdump      4096 Jun 17  2004 scripts
[root@master crash]# cat 65.202.163.174-2005-03-07-11:24/log
[...network console shutdown...]
[root@master crash]#

[root@qt9 root]# lsmod | grep netconsole
netconsole             16268   0  (unused)

-charles.

Comment 27 Charles C. Figueiredo 2005-03-07 11:29:28 EST
Nevermind, this is just a goof on my part. I restart, when I thought 
I did a start had me thinking that it was shutting down immediately 
after starting when it was just logging the shutdown preceding the 
startup. :-)

-charles.
Comment 28 Dave Anderson 2005-03-07 16:41:31 EST
An update to my previous updates...

My claim that the kmap'd PTE address of fffb8d20 (as seen in
attachment 111210 [details]) being out of range is incorrect.  It absolutely is
correct for a fault occurring on cpu 0, in which case the temporary
kmap'ing of page table pages would be fffb8000.  (My two previous
calculations were done incorrectly.)  That being said, it is not clear
why the reference to it would cause an oops, especially considering
that it was just mapped, and there are no places where it could have
blocked.

The second handle_mm_fault() in attachment 111676 [details] is obviously
related, but again it's not clear how it happened.  In that
case, the out_of_line() call was made after the kmap'd vaddr was
calculated, where it verifies that the PTE that it is about to set
up for that vaddr is not empty, i.e., !pte_none():

static inline void *__kmap_atomic(struct page *page, enum km_type type)
{
        enum fixed_addresses idx;
        unsigned long vaddr;

        idx = type + KM_TYPE_NR*smp_processor_id();
        vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
#if HIGHMEM_DEBUG
        if (!pte_none(*(kmap_pte-idx)))
                out_of_line_bug();
#else
...

The PTE *should* have been cleared by the prior user.  Again, there's
no obvious explanation.
Comment 30 Dave Anderson 2005-05-10 08:50:59 EDT
Does the hardware in this case have greater than 4GB of memory?
Comment 31 Charles C. Figueiredo 2005-05-10 11:13:23 EDT
All of my effected systems have >4GB -- most are 8GB, some are 6GB.

Thanks,

-charles.
Comment 32 Dave Anderson 2005-05-10 11:44:31 EDT
There's a very strong possibility that the corruption issue
that you are seeing will be resolved in the upcoming RHEL3-U5
kernel.  The fix is being tested at several customer sites
at this time; an official Red Hat kernel will be made available
when the fix is confirmed.

The problem has only been reproduced on systems:

1. having > 4GB
2. running RHEL3-U4 (2.4.21-27) or later.
3. typically running large numbers of java and/or oracle processes with
   large virtual address spaces.

In those situations, user data from a java process gets written
to essentially random locations in the kernel data section,
such as in the mem_map[] array, the dentry_hashtable[] array, the
page_hash_table[] array or in any number of different memory slab
pages.  The eventual crash/oops has therefore resulted in unrelated 
back traces because a random, innocent, kernel data structure has been
corrupted.
  
While it cannot be absolutely pinned on that particular bug
without a netdump/diskdump to view the corruption around the
kernel data which was innocently corrupted, if the requirements
above match the systems exhibiting the problem, there's a better
than good chance that this is the problem.

  

Comment 33 Charles C. Figueiredo 2005-05-10 11:46:59 EDT

I have been running the -31 beta kernel on some of these hosts for a few weeks 
now without a single crash. I'm remaining hopeful so far. :-)

-charles.
Comment 34 Dave Anderson 2005-05-10 13:33:07 EDT
Well, actually the -31 beta kernel doesn't have the fix
I'm referring to, but there are several other potential
corruptor fixes in there as well.

Comment 35 Charles C. Figueiredo 2005-05-10 14:17:52 EDT

Oh, I see. Is there any rough estimate as to when the U5 kernel will be 
available? I'd like to provide some visibility to my user-base as to when we 
can potentially have this resolved on all systems.

Thanks,

-charles.
Comment 36 Dave Anderson 2005-05-10 15:18:10 EDT
It's soon, but I'll have to defer that question to the RHEL3 maintainer
(Ernie Petrides) who is on the cc list...


Comment 37 Ernie Petrides 2005-05-10 16:46:22 EDT
The current U5 beta kernel is version 2.4.21-32.EL, and it is currently
scheduled for official release on 18-May-2005 (next Wednesday).
Comment 40 Dave Anderson 2005-05-10 17:07:24 EDT
My mistake -- the current RHEL3-U5 candidate (2.4.21-32.EL) does not
have the fix I referred to in comment #32.
Comment 43 Ernie Petrides 2005-05-14 02:32:26 EDT
Hello, Phil and Charles.  We have finally resolved the problem of a very
obscure PTE race condition that could cause arbitrary memory corruption
(in either user or kernel data) on SMP x86 systems with greater than 4G of
memory (with either the smp or hugemem configs).  We have already verified
definitively that our fix corrects two open bugzilla reports (151865
and 156023), and we are very interested if this fix resolves the data
corruption problems you have encountered.

The fix has just been committed to the RHEL3 U6 patch pool this evening
(in kernel version 2.4.21-32.4.EL).  Although this is just an interim
Engineering build for U6 development, I have made the following kernel
RPMs available on my "people" page for you to test:

http://people.redhat.com/~petrides/.pte_race/kernel-hugemem-2.4.21-32.4.EL.i686.rpm
http://people.redhat.com/~petrides/.pte_race/kernel-hugemem-unsupported-2.4.21-32.4.EL.i686.rpm
http://people.redhat.com/~petrides/.pte_race/kernel-smp-2.4.21-32.4.EL.i686.rpm
http://people.redhat.com/~petrides/.pte_race/kernel-smp-unsupported-2.4.21-32.4.EL.i686.rpm
http://people.redhat.com/~petrides/.pte_race/kernel-source-2.4.21-32.4.EL.i386.rpm

We intend to incorporate this bug fix in our next post-U5 security errata
release (which has not yet been built, but which is likely to enter formal
Q/A as soon as U5 is shipped on Wednesday, May 18th).

In the meantime, please test this interim U6 build as soon as is practical
so that we can determine if the fix addresses this bugzilla.

Thanks in advance.  -ernie
Comment 44 Ernie Petrides 2005-05-17 18:31:40 EDT
The fix (referred to above) for a data corruption problem has also
just been committed to the RHEL3 E6 patch pool (in kernel version
2.4.21-32.0.1.EL).

I'm tentatively moving this BZ to MODIFIED state and will list it
in the associated security advisory that we intend to release post-U5.

Please still follow up with test results as requested in comment #43.

Thanks.  -ernie
Comment 45 Charles C. Figueiredo 2005-05-18 17:15:15 EDT

Thanks, Ernie -- we'll load this up and test. It sounds like this issue may 
also be responsible for the random JDK core dumps we experience on our x86 
systems with >4GB RAM. I've had a tough time keeping our developers from 
crawling back to their Solaris boxes, between the panics and JDK core dumps. We 
don't really see the issue on x86_64 systems, which makes sense, apparently.

Thanks,

-charles.
Comment 46 Ernie Petrides 2005-05-19 17:43:50 EDT
Charles, I've added a detailed explanation of the problem in bug 141394
comments 129 and 130.  The info you just provided correlates perfectly
with our understanding of the bug, so I'm very hopeful that our fix will
resolve the problems you encountered.

The release of the 2.4.21-32.0.1.EL kernel is tentatively targeted for
the end of next week.  (The -32.4.EL pre-U6 kernel I've made available
to you is a strict superset of the -32.0.1.EL security errata kernel.)
Comment 47 Josh Bressers 2005-05-25 12:42:37 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-472.html
Comment 49 Charles C. Figueiredo 2005-06-16 09:49:27 EDT
Does this issue also apply to x86_64 systems?

Thanks,

-charles.
Comment 50 Dave Anderson 2005-06-16 11:10:51 EDT
No -- it's only on 32-bit x86 machines with PAE enabled, which means that
they utilize 64-bit PTEs.

x86_64 machines update their 64-bit PTEs atomically with one memory access.

Note You need to log in before you can comment on or make changes to this bug.