141394 – Memory corruption with kernel 2.4.21-27.EL

Bug 141394 - Memory corruption with kernel 2.4.21-27.EL

Summary: Memory corruption with kernel 2.4.21-27.EL

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Anderson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	158328 163699 (view as bug list)
Depends On:
Blocks:	156320
TreeView+	depends on / blocked

Reported:	2004-11-30 21:44 UTC by John Caruso
Modified:	2007-11-30 22:07 UTC (History)
CC List:	28 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-05-25 16:42:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Output of AltSysrq-m, -w, and -t @ 11:44am on 11/30/2004 (35.51 KB, text/plain) 2004-11-30 21:48 UTC, John Caruso	no flags	Details
Contents of /proc/slabinfo @ 11:42am on 11/30/2004 (4.87 KB, text/plain) 2004-11-30 21:48 UTC, John Caruso	no flags	Details
AltSysrq-T output (73.06 KB, text/plain) 2004-12-29 17:33 UTC, John Caruso	no flags	Details
proposed fixes for memory corruption cause by /proc/kcore access (4.12 KB, patch) 2005-01-21 05:45 UTC, Ernie Petrides	no flags	Details \| Diff
We're not out of the woods yet (6.02 KB, text/plain) 2005-02-01 01:24 UTC, John Caruso	no flags	Details
"log" output from another system crash (9.56 KB, text/plain) 2005-02-03 04:18 UTC, John Caruso	no flags	Details
"log" output from yet another crash (9.90 KB, text/plain) 2005-03-19 23:25 UTC, John Caruso	no flags	Details
"log" output from still another system crash (5.09 KB, text/plain) 2005-04-17 01:13 UTC, John Caruso	no flags	Details
data corruption fix committed to RHEL3 U6/E6 kernel patch pools (2.02 KB, patch) 2005-05-18 22:33 UTC, Ernie Petrides	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2005:472	0	normal	SHIPPED_LIVE	Important: kernel security update	2005-05-25 04:00:00 UTC

Description John Caruso 2004-11-30 21:44:55 UTC

Description of problem:
Apparent memory corruption with kernel 2.4.21-25.ELhugemem (also 
occurred under 2.4.21-20.ELhugemem).

Version-Release number of selected component (if applicable):
kernel-hugemem-2.4.21-25.EL

How reproducible:
Occurs randomly.

Additional info:

RHEL3 kernels have exhibited a problem in which portions of memory 
appear to become corrupted.  We first ran into this problem on 
11/5/2004 on a system running 2.4.21-20.ELhugemem, and saw it again 
today on a different (identically configured) system running 2.4.21-
25EL.  The servers in question host Oracle databases.  These are the 
details for each event:

---------------------
11/5/2004: We began to see internal Oracle errors occurring on one 
datafile of a database.  However, the Oracle DBVERIFY utility showed 
that there was no corruption in the datafile on disk.  Reviewing the 
system logs showed the following error (which occurred just before we 
started seeing the Oracle internal errors):

Nov  5 07:55:42 servername kernel: memory.c:189: bad pmd d151aff8
(2f00000000000000).

This made us suspect either bad memory or a kernel bug, so we shut 
down the database and brought it back up on an identically configured 
server--which made the problem disappear.  In retrospect it's 
possible that the "bad pmd" error was not the cause of the problem, 
but rather a result of it (or was unrelated).

11/30/2004: We saw an anomalous error from tripwire--two files were 
marked as modified, despite the fact that every ancillary check I 
could do on the files verified that they had NOT changed, and that 
they were still identical to the same two files on the companion 
system.  Eventually this problem simply went away on its own.

At about the same time we also began to see a problem with Oracle 
that was identical to the problem we'd seen on 11/5/2004--internal 
Oracle errors on one datafile in the database.  In this case, though, 
the DBVERIFY utility *did* show corruption in the datafile in 
question.  However, this datafile is hosted on NFS, and running 
DBVERIFY on the other database server against the exact same datafile 
showed NO errors.  And after an hour or so, running DBVERIFY on the 
original server again showed no errors--so the error had "corrected" 
itself, just as the tripwire error had done.

There were no apparent errors in the system logs related to this 
issue.  Also, the system had been rebooted just two days before, and 
had been running the -25.EL kernel for about two weeks.
---------------------

So a few points about these problems:

1) They occurred on two different but identically-configured servers
2) They affected both Oracle and tripwire
3) They involved both NFS-hosted and local files
4) Manifestations of the problem went away without any action on our 
part (on 11/30/2004)
5) They occurred on two different versions of the RHEL3 kernel

This would seem to rule out the problem being caused by hardware or 
an Oracle bug, or being specifically related to either NFS or a local 
filesystem.  It may be that the root problem is corruption of cached 
filesystem data, and so the resulting problems "correct" themselves 
when the filesystem data in question gets flushed from memory.  

I'll attach output of AltSysrq-m, -w, and -t as well 
as /proc/slabinfo from the server on which we're currently seeing the 
problem.  If you need additional information, please let me know 
ASAP, since we'll be moving the Oracle database to the other server 
tonight to clear this issue.

Comment 1 John Caruso 2004-11-30 21:48:05 UTC

Created attachment 107655 [details]
Output of AltSysrq-m, -w, and -t @ 11:44am on 11/30/2004

Output of AltSysrq-m, -w, and -t @ 11:44am on 11/30/2004

Comment 2 John Caruso 2004-11-30 21:48:56 UTC

Created attachment 107656 [details]
Contents of /proc/slabinfo @ 11:42am on 11/30/2004

Comment 3 Rik van Riel 2004-11-30 22:00:32 UTC

John,

are you using the e100 network driver?  We have found some corruption
related to that during the U4 beta and have rolled it back to the
version in U3.

As for the -20.EL kernel, I am aware of one (admittedly very obscure,
I am not aware of anybody having triggered it outside of a stress test
on a 16 way SMP system) VM bug, which got fixed for U4.

Comment 4 John Caruso 2004-11-30 22:04:51 UTC

Nope, not using e100 (but we are using e1000).

Comment 5 Larry Woodman 2004-12-01 16:55:51 UTC

John, do you know if the corruption was in an NFS mounted file or was
it in a locally mounted file?  The reason I am asking is the fact that
the corruption does not make it back to the file on disk and the fact
that it "corrected itself" is indicitive of corruption in the
pagecache.  This would cause it to show up while the corrupted page
was still in the cache but once it was displaced and subsequently
re-read the corruption would be gone.

Larry Woodman

Comment 6 John Caruso 2004-12-01 17:06:03 UTC

I'd agree that pagecache corruption seems like a likely explanation--
that was our assumption.  The incident this time around involved both 
an NFS-mounted file and two local files (as I mentioned in the 
longish initial comment).  The first incident (on 11/5) involved only 
an NFS-mounted file, but that's not surprising on this system, since 
the database is NFS-mounted and is by far the lion's share of 
filesystem data in the working set.

Also, with the local files, tripwire reported changes *only* in the  
CRC32 and MD5 properties--not any of the file times or other file 
metadata.  Which would indicate that it was just a result of tripwire 
reading the corrupted data from the pagecache.  And as I said, these 
tripwire errors just went away on their own after a few hours.

Comment 7 Larry Woodman 2004-12-01 18:24:54 UTC

John, I am building a kernel with slab debugging turned on to see if
we can catch someone using a kernel data structure after its freed or
accessing beyond the bounds of a piece of kernel allocated memory. 
The bad_pmd message with the extraneous "2f000..." doesnt sound very
good either so we might catch whoever did that as well.

Are you willing to try out a kernel with slab debugging turned on? 
Also, how long did it take for these errors to show up?  Is that
version of Oracle using direct_IO or AIO?

Larry

Comment 8 Larry Woodman 2004-12-01 18:56:05 UTC

John, hugemem and smp kernels with slab debugging turned on are
located here:

>>>http://people.redhat.com/~lwoodman/.for_debug_purposes_only/

When you get a chance, please try to reproduce this corruption with
both kernels so we can also determine if the different address space
layout between 4G/4G and 3G/1G has anything to do with this problem as
well.


Thanks, Larry Woodman

Comment 9 Larry Woodman 2004-12-01 19:15:35 UTC

John, another request: Is it possible to get the actual corruption
that shows up from Oracle and Trip-Wire?  Its slightly possible that
it will be obvious as to what caused the corruption that you are seeing.

Larry

Comment 10 John Caruso 2004-12-01 20:41:50 UTC

Some answers:

- We're not using direct or async I/O with this Oracle installation.

- The errors showed up after two days of uptime in this case, but 
before that these two systems had been running the database for 25 
days on various kernels (-20 and higher) without any apparent 
corruption issues.  Also, they'd been running on -20.EL for over a 
month before the corruption issue on 11/5/2004.  We have no idea how 
to reproduce it--it's completely random.

- Also, it only happens on our production database server (with the 
hugemem kernel), so there's no way for us to try to reproduce it for 
you on an SMP kernel.  In fact we're running the -25.ELsmp kernel on 
our dev/QA database servers, and we haven't seen the problem there 
yet, though it's possible that it's happened but wasn't noticed.

- Unfortunately I don't have any way at this point to get you copies 
of the actual corrupted data.  If the tripwire issue comes up again 
that would be a great way to catch it, assuming I can get a copy of 
the corrupted version of the file before it gets flushed from the 
pagecache (as apparently happened this time).


Regarding the kernel with slab debugging: we're willing to try it, 
but what will be the effects of running that kernel?  These are 
production database servers, so if the debug kernel would 
significantly affect performance or cause other operational issues we 
couldn't risk it.

Comment 11 John Caruso 2004-12-04 00:49:36 UTC

Larry, you may have missed that last question: what will be the 
effects of running the test kernel?  These are 
production database servers, so if the debug kernel would 
significantly affect performance or cause other operational issues we 
couldn't risk it.

This Sunday is our window to put the new kernel in place, so if you 
could let me know ASAP I'd appreciate it.

Comment 12 John Caruso 2004-12-04 01:55:35 UTC

Also, what do we need to do to get you the data you need when we see 
the problem happening?  We can't leave the system in the corrupt 
state--this is a production system, and the corruption is directly 
affecting our users.  So it's best if you tell me exactly how to 
collect the slab debug info you need when the problem is happening.

It recurred today just after 4pm, BTW, but it's gone already.

Comment 13 Larry Woodman 2004-12-04 02:39:49 UTC

Sorry, I missed the update to this bug period!  There is a slight
performance degradation associated with slab debug.  Basically it
writes patterns to the kernel data structures as it frees then so it
can detect corruption outside of allocated memory.  I think you should
expect to see about 5% performance degradation running Oracle.

If the kernel detects any corruption it simply crashes and thats all
we need.  If you can get me the exact data that was corrupt it might
provide some hint as to where the corruption came from.

Larry

Comment 14 John Caruso 2004-12-04 02:46:41 UTC

Ack--crashing is bad.  This is a production database.  This will 
immediately affect all our customers, rather than just the ones whose 
data is corrupted.

And are you saying you'd get data from the crash somehow?  Or is it 
just that the fact that it crashed would tell you that it was indeed 
slab corruption that had caused the problem?

If that's all it is, it'd be worlds better if we could use a kernel 
that had some less violent way of signalling that it had found the 
error :-).  A file in /proc whose contents change from 0 to 1, or 
something, say.

Comment 15 John Caruso 2004-12-05 23:09:05 UTC

This system just crashed (possibly due to the same bug, possibly due 
to a different bug).  Atypically, this one did leave a minor trace of 
what happened, in /var/log/kernel:

Dec  5 14:39:07 servername kernel: Page has mapping still set. This 
is a serious situation. However if you

That was the full message.  Given what the message says, I thought it 
might be related to this problem.

Comment 16 Larry Woodman 2004-12-06 03:45:22 UTC

This is the whole message, its from freeing a page that has the
mapping still set.

Page has mapping still set. This is a serious situation. However if
you are using the NVidia binary only module please report this bug to
NVidia and not to the linux kernel mailinglist.

This was part of a BUG, the entire traceback would have told us
exactly who freed the page in that state.  Do you have serial consoles
attached to these machines? If not can you set one up?

Any yes, it might be related to the origional crash but we need the
traceback to taeel for sure.


Larry

Comment 17 John Caruso 2004-12-06 20:02:27 UTC

Larry, please also respond to note 14.  We can't use the slabdebug 
kernel if it's going to crash the server.

In this case the crash was followed by a reboot, so the serial 
console wouldn't have helped.  And yes, we do have serial consoles, 
but these are IBM Bladecenters where the serial console is accessed 
via a VNC-based Java applet, so you get just one screen's worth of 
output...and thanks to Linux's console screen blanking code, you 
usually can't see a thing anyway.  I've now added escape sequences 
to /etc/issue to circumvent the screen blanking code and have also 
set up netdump to log to our syslog server, so we'll see if that 
improves things.  I take it the AltSysrq and /proc/slabinfo output I 
attached to the case didn't help?

BTW, there is no "original crash" here; the original problem is 
memory corruption.  The reason I posted note 15 was because it seemed 
possible that the "Page has mapping still set" message was actually a 
bogus error, caused by memory corruption in some page that triggered 
that failure path in the kernel.

Comment 18 Larry Woodman 2004-12-06 20:17:19 UTC

The slab debug kernel will crash the server but only if it detects
fatal memory corruption.  In other words it will be crashing anyway(or
even worse silently corrupting data) but with slab debugging torned on
we will be crashing a bit earlier, before the consequential damage occurs.

Larry

Comment 19 John Caruso 2004-12-07 06:04:47 UTC

The three times we've seen this memory corruption, though, it 
*hasn't* crashed the server.  Silently corrupting data is less awful 
than having the entire system crash in this case, because if it 
happens within Oracle's memory space Oracle will notice it and throw 
errors; the result is that a few customers may be affected, but not 
all of them.  Crashing is far worse, because all users are screwed 
for several minutes at the least.  So unless the slab debugging 
kernel can be made to signal its results in a less drastic way, we 
can't use it on this server.

Comment 20 John Caruso 2004-12-10 19:20:55 UTC

FYI, we've fallen back to the SMP kernel on this system to see if 
that resolves the issue (which is painful since it requires us to 
trim the Oracle SGA significantly).  So far we haven't seen any 
recurrences of the corruption problems, but it's only been a few days 
now.

Comment 21 John Caruso 2004-12-13 18:16:17 UTC

So: today we hit the memory corruption bug on a different server, 
which is running the 2.4.21-25.ELsmp kernel.  This is a web server 
development system, and it shares no common user programs with the 
database servers on which we've seen the problem previously.  This is 
the same server from bug 141905, so it looks as though that may 
indeed be a duplicate of this bug (or at least I'm fine with treating 
it that way until this bug is resolved).  This means that this bug is 
NOT specific to the hugemem kernel, nor is it related to Oracle.

We caught the bug today via tripwire again--a "change" 
in /usr/X11R6/bin/Xvfb (from XFree86-Xvfb-4.3.0-68.EL).  I have a 
copy of the corrupted file if you want it, but here's the full cmp -l 
output (produced on a different server):

   # cmp -l /usr/X11R6/bin/Xvfb Xvfb.CORRUPT
   1334321  10   2
   1334322 213   0
   1334323 135   0
   1334324 370   0

If you want me to attach either the corrupt or good versions of Xvfb 
to the case, let me know.  I also have AltSysrq-m, -w, -t 
and /proc/slabinfo output if that's useful--let me know.  And the 
server is still running and still has the corruption of Xvfb in 
memory at the moment (no guarantees on how long that'll last though), 
so if you want me to get further debugging info from it just tell me 
what it is you want.  And this is a development system, so violent 
tests (i.e. ones that crash) would be acceptable.

Comment 22 Ernie Petrides 2004-12-13 21:31:14 UTC

*** Bug 141905 has been marked as a duplicate of this bug. ***

Comment 23 John Caruso 2004-12-14 23:46:49 UTC

[ Oops--just added this to bug 141905.  Here you go. ]

We just experienced a kernel panic on the database server of this 
pair which was NOT running the database--in other words, it was 
sitting idle except for VCS and periodic tripwire runs.  Since it's 
possible that this was caused by the memory corruption bug, I'll give 
you the info here--but if I'm wrong about that, just say so and I'll 
file yet another bug.  Here's the panic info (we don't have a memory 
dump for it):

----------------------------------------------------
Unable to handle kernel NULL pointer dereference at virtual address 
0000002d
printing eip:
021491e4
*pde = 00003001
*pte = 00000000
Oops: 0000
nfs lockd sunrpc gab llt netconsole autofs4 audit tg3 e1000 sg sr_mod 
cdrom usb-storage keybdev mousedev hid input usb-ohci usbcore ext3 
jbd mptscsih mptbase
CPU:    3
EIP:    0060:[<021491e4>]    Tainted: PF
EFLAGS: 00010206

EIP is at do_generic_file_read [kernel] 0x174 (2.4.21-
25.ELhugemem/i686)
eax: 0000001d   ebx: 00000016   ecx: 1312b680   edx: 0000001d
esi: dfb4e1c4   edi: 12ed2c94   ebp: 000000de   esp: cea33ef4
ds: 0068   es: 0068   ss: 0068
Process tripwire (pid: 25603, stackpage=cea33000)
Stack: dfb4e100 08208590 00000000 00001000 00000000 00001000 00000000 
00000000
00000000 dfb4e100 fffffff2 00001000 df368d80 ffffffea 00001000 
02149e35
df368d80 df368da0 cea33f5c 02149c80 00000000 02439680 00002710 
cea32000
Call Trace:   [<02149e35>] generic_file_new_read [kernel] 0xc5 
(0xcea33f30)
[<02149c80>] file_read_actor [kernel] 0x0 (0xcea33f40)
[<02149f5f>] generic_file_read [kernel] 0x2f (0xcea33f7c)
[<02164ea3>] sys_read [kernel] 0xa3 (0xcea33f94)

Code: Bad EIP value.

CPU#0 is frozen.
CPU#1 is frozen.
CPU#2 is frozen.
CPU#3 is executing netdump.
CPU#4 is frozen.
CPU#5 is frozen.
CPU#6 is frozen.
CPU#7 is frozen.

Comment 24 John Caruso 2004-12-20 21:57:43 UTC

I've added a fair amount of info to this case (an example of the data 
corruption, another panic trace), but haven't heard any word from 
Redhat on this lately.  What's the status?

Also, I see that one of the fixes in the the U4 kernel release 
is "Data corruption on RHEL3 U4 beta -26.EL kernel," but I can't view 
the associated bug (bug 140022).  Does this bug have anything to do 
with the one I've reported here?

BTW, from our expeience so far with the SMP kernel on various 
servers, it appears that the data corruption happens less frequently 
with the SMP kernel than with the hugemem kernel--though it does 
happen with both of them.  Don't know if this helps.

Comment 25 Larry Woodman 2004-12-20 22:26:00 UTC

John, Please try to reproduce this problem with the official RHEL3-U4
kernel.  Its now available on RHN.

Larry Woodman

Comment 26 John Caruso 2004-12-20 22:28:55 UTC

We'll try to do that (I admit I'm surprised Redhat would release U4 
with a known data corruption bug in it).  I'd appreciate answers to 
my questions, though.

Comment 27 John Caruso 2004-12-21 22:38:16 UTC

Update: today the system from comment 21 crashed again, while running 
the -26.slabdebug kernel that you'd provided.  And we're in the 
process of rolling out the U4/-27 kernel to all RHEL3 systems.

To reiterate my questions:

1) You asked for an example of the actual corruption, and I gave you 
one (Xvfb, in comment 21).  Did that info tell you anything?  

2) I also gave you yet another kernel panic trace in comment 23.  Did 
that tell you anything?

3) A bug identified as "Data corruption on RHEL3 U4 beta -26.EL 
kernel" (bug 140022) was listed as fixed in the U4 kernel release 
summary.  What is that bug, and is it the same as the I've reported 
here?

Comment 28 Larry Woodman 2004-12-22 16:58:48 UTC

John:

1.) the Xvfb file diffs really didnt tell me anything.  Evidently a
few(3) bytes of that file are different than the origional.  Do you
know if this data is NFS mounted and if its different on the server
and the client?

2.) The panic in do_generic_file_read is due to the page->->next_hash
containing a 0x1d instead of a valid page pointer.  This is corruption
in the page struct in the mem_map array.

3.) The Data corruption in the -26.EL kernel(140022) was due to a bug
in the latest e100 driver.  We did fix that bug in -27.EL but you said
that you are not using e100, right?

So far, this is the only report of corruption outside of the e100
driver.  Any help in reproducing this problem would be extremely helpful.


Larry

Comment 29 John Caruso 2004-12-22 22:40:17 UTC

Thanks for the responses.  The Xvfb file was on a local filesystem 
(ext3).  The cmp -l output I sent you reflected the corrupted version 
on one system compared to a known-good version from another system 
(and after a reboot, the version of the file on the system where the 
corruption had occurred reverted to its original form).  That was the 
full extent of the corruption--4 bytes on a 16-byte boundary.

And nope, we don't use the e100 driver on these systems.  In fact the 
system from comment 21 doesn't even use the e1000 driver (though our 
database servers do).

We'd love to give you a way to reproduce this...we'd love to have one 
ourselves.  But the behavior so far is just random.  I'm guessing 
that you've not had other reports of this problem because it's very 
difficult to diagnose correctly; if it corrupts normal file data it 
may simply go unnoticed, and if it causes a crash, the crash may be 
written off as something else or may be misdiagnosed as having some 
other cause.  We were only able to determine that it was random 
memory corruption thanks to the fact that tripwire MD5/CRC values 
were changing for files when no other file metadata changes were 
being flagged, and Oracle was reporting corrupted data blocks even 
though the Oracle DBVERIFY utility didn't see corruption when run 
against that same (NFS-mounted) file from another system.

We're now running -27 everywhere except the production database 
servers, and we'll swap those over to -27 this weekend.  We're also 
running tripwire in a tight loop on our quiescent database server, to 
see if it'll flush out the error.

Comment 30 John Caruso 2004-12-24 00:18:57 UTC

Corruption again on one of the database servers, detected by tripwire 
reporting only MD5/CRC32 changes, this time in a static text file 
(/usr/share/ImageMagick/www/api/types/ImageAttribute.html).  Here's 
the cmp -l output for a good version vs. the corrupted version:

# cmp -l /usr/share/ImageMagick/www/api/types/ImageAttribute.html 
ImageAttribute.html.CORRUPT
1177 144 150
1178 164 161
1179 150   5
1180  75  11

Again, just 4 bytes have changed, though this time on an 8-byte 
boundary.  So it looks like some memory management routine may have 
an off by one error (one word, that is).

This system is still running the -25 hugemem kernel (it can't be 
updated until this weekend, during a downtime window).  We've 
definitely got strong evidence now that the problem happens far more 
frequently with the hugemem kernel than with the SMP kernel; we'd 
gone for two weeks with no recurrences on the SMP kernel, but after 
switching back to hugemem this weekend we've already seen two 
occurences of corruption (including this one).  And this is the same 
pattern we've seen in the past with the hugemem kernel.

Comment 31 John Caruso 2004-12-28 04:36:54 UTC

Corruption again on one of the database servers, detected by tripwire 
reporting only MD5/CRC32 changes, this time in /lib/ssa/gcc-lib/i386-
redhat-linux-gnu/3.5-tree-ssa/libgcj.a.  Here's 
the cmp -l output for a good version vs. the corrupted version:

# cmp -l /lib/ssa/gcc-lib/i386-redhat-linux-gnu/3.5-tree-ssa/libgcj.a 
libgcj.a.CORRUPT
19040081   0   2
19040083 100   0
19040084   1   0

Note that it's only reporting three bytes of difference this 
time...but the odd byte out (19040082) is a null, so it's likely that 
the corrupted version and the original version just coincidentally 
both had nulls in that byte position.  This was still on 25.ELhugemem.

Comment 32 John Caruso 2004-12-28 04:51:28 UTC

We now have a bug on the -27 (hugemem) kernel as well, though I can't 
be certain if it's caused by the same corruption issue.  I've been 
running tripwire in a loop on our quiescent database server, and a 
few hours ago it started hanging on the file /usr/share/doc/4Suite-
0.11.1/demos/4ODS/tutorial/blob_test.py.  At this point, any attempt 
to read this file causes the calling process to hang (unkillably).  
Here's an strace of an attempted access:

# strace -f cat /usr/share/doc/4Suite-
0.11.1/demos/4ODS/tutorial/blob_test.py
execve("/bin/cat", ["cat", "/usr/share/doc/4Suite-
0.11.1/demos/4ODS/tutorial/blob_test.py"], [/* 29 vars */]) = 0
uname({sys="Linux", node="bom-db02", ...}) = 0
brk(0)                                  = 0x9bce000
open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such file or 
directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=36895, ...}) = 0
old_mmap(NULL, 36895, PROT_READ, MAP_PRIVATE, 3, 0) = 0xf65f1000
close(3)                                = 0
open("/lib/tls/libc.so.6", O_RDONLY)    = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200X\1"..., 
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1568924, ...}) = 0
old_mmap(NULL, 1276876, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 
0xc6a000
old_mmap(0xd9c000, 16384, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED, 3, 0x131000) = 0xd9c000
old_mmap(0xda0000, 7116, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xda0000
close(3)                                = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf65f0000
set_thread_area({entry_number:-1 -> 6, base_addr:0xf65f0520, 
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, l
imit_in_pages:1, seg_not_present:0, useable:1}) = 0
munmap(0xf65f1000, 36895)               = 0
open("/usr/lib/locale/locale-archive", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=32148976, ...}) = 0
mmap2(NULL, 2097152, PROT_READ, MAP_PRIVATE, 3, 0) = 0xf63f0000
mmap2(NULL, 204800, PROT_READ, MAP_PRIVATE, 3, 0x9c4) = 0xf63be000
brk(0)                                  = 0x9bce000
brk(0x9bef000)                          = 0x9bef000
brk(0)                                  = 0x9bef000
mmap2(NULL, 4096, PROT_READ, MAP_PRIVATE, 3, 0xa12) = 0xf63bd000
close(3)                                = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0
open("/usr/share/doc/4Suite-0.11.1/demos/4ODS/tutorial/blob_test.py", 
O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=3190, ...}) = 0
read(3,

That's the full output--the process hangs at that point.  I'm 
reporting it as part of this bug because it seems possible that the 
memory corruption bug has affected some in-memory structure 
associated with this file, and that's causing problems for processes 
trying to access the file.

I'll leave the system in this state for the moment; please let me 
know ASAP if there's any information you'd like to see that might 
help to identify what's going on here.

Comment 33 John Caruso 2004-12-28 20:15:08 UTC

I know you all may be on vacation, but if not I really need to know 
as soon as possible what you'd like me to do about the system on 
which reading of the file /usr/share/doc/4Suite-
0.11.1/demos/4ODS/tutorial/blob_test.py is causing processes to hang 
(as per comment 32).  It's an inactive system for the moment, but 
that can change at any time, and I don't want to leave it in this 
state any longer than necessary.

BTW, every process that's currently hung while accessing this file 
shows up in a ps listing with a state code of "D" (uninterruptible 
sleep), and each one is also permanently counted in the load average.

Comment 34 Larry Woodman 2004-12-29 14:47:17 UTC

Hi John, yes most of us are on vacation this week.  Please get me an
AltSysrq-T output so I can see internally why the processes are in D
state.

Thanks, Larry

Comment 35 John Caruso 2004-12-29 17:33:05 UTC

Created attachment 109166 [details]
AltSysrq-T output

AltSysrq-T output from the system on which various processes (cat, less,
tripwire) are hung while trying to read /usr/share/doc/4Suite-
0.11.1/demos/4ODS/tutorial/blob_test.py.

Comment 36 Rik van Riel 2004-12-29 19:55:44 UTC

Just to narrow things down a bit more, is the bug reproducible without
syscall auditing enabled?  IIRC the syscall auditing code has had
problems (and might still have, I'm not sure), so it would be good to
find out if the bug can also be reproduced without syscall auditing.

Also, what is the "llt" driver?  I don't think I have seen that one
before...

Comment 37 John Caruso 2004-12-29 20:05:05 UTC

Yep, I can easily hang any process just by trying to open and read
that file (which is a local file, BTW).  The only reason a few of them
have syscall in the stack traces is because I either ran them with
strace or strace'd an already-running process, to try to see what was
happening.

LLT (low-level transport) is a part of Veritas Cluster Server, which
runs on this server and its companion server.

If there's any debugging you want me to do, this system is basically
wide open for testing (intrusive or otherwise).

Comment 38 John Caruso 2004-12-29 20:33:02 UTC

Alas, spoke too soon.  I started one more cat process (straight, no
strace/ltrace), then went to get another AltSysrq-T output...and
unfortunately, that hung the entire machine.  Here was the last bit of
output on the console:

bash          R current   1728 10624  10623                     (NOTLB)
Call Trace:   [<f8debb60>] netconsole [netconsole] 0x0 (0xa9a37e84)
[<021299af>] __call_console_drivers [kernel] 0x5f (0xa9a37e94)
[<f8debb60>] netconsole [netconsole] 0x0 (0xa9a37e98)
[<02129ab3>] call_console_drivers [kernel] 0x63 (0xa9a37eb0)
[<02129de1>] printk [kernel] 0x151 (0xa9a37ee8)
[<02129de1>] printk [kernel] 0x151 (0xa9a37efc)
[<0210c8a9>] show_trace [kernel] 0xd9 (0xa9a37f08)
[<0210c8a9>] show_trace [kernel] 0xd9 (0xa9a37f10)
[<02126032>] show_state [kernel] 0x62 (0xa9a37f20)
[<021cffba>] __handle_sysrq_nolock [kernel] 0x7a (0xa9a37f34)
[<021cff1d>] handle_sysrq [kernel] 0x5d (0xa9a37f54)
[<02199a2c>] write_sysrq_trigger [kernel] (0xa9a37f78)
[<02165123>] sys_write [kernel] 0xa3 (0xa9a37f94)

The system was toast at this point, so the only thing I could do was
powercycle it.  You'll doubtless be unsurprised to hear that after the
reboot, I can now cat
/usr/share/doc/4Suite-0.11.1/demos/4ODS/tutorial/blob_test.py to my
heart's content, without having it hang.

Comment 40 John Caruso 2005-01-03 20:05:54 UTC

Verified: the U4 kernel also has this bug.  Detected this time by 
tripwire reporting only MD5/CRC32 changes on our primary development 
server (2.4.21-27.ELhugemem, no Oracle, no VCS), this time in the 
file /lib/ssa/gcc-lib/i386-redhat-linux-gnu/3.5-tree-ssa/jc1.  Here's 
the cmp -l output for a good version vs. the corrupted version:

# cmp -l /lib/ssa/gcc-lib/i386-redhat-linux-gnu/3.5-tree-ssa/jc1 
jc1.CORRUPT
3587121 114   2
3587122 135   0
3587123 311   0
3587124 377   0

So the corruption does seem to be consistently just 4 bytes (and 
always at least word-aligned).  The file is still in this state on 
this server--i.e. the in-memory copy is still corrupted--so if 
there's anything you want us to do to get more info, now's the time.

Comment 41 Arjan van de Ven 2005-01-04 13:53:25 UTC

is this using serverworks IDE for storage ?

Comment 42 John Caruso 2005-01-04 20:17:52 UTC

No.  The servers that have exhibited the data corruption bug are all IBM
blades--two HS40s (8839) and one HS20 (8832).  The HS40s use SCSI drives via an
IBM SCSI daughter card, and the HS20 uses IDE drives that use the HS20's onboard
IDE controller.

Also, the HS20 doesn't run Oracle or VCS.  All three servers do mount various
volumes via NFS, though--but the corruption has shown up both on NFS-mounted
files and on local files, and the corruption in comment 40 was associated with a
local file.

Comment 43 John Caruso 2005-01-05 20:40:17 UTC

Another instance of the corruption bug on 2.4.21-27.ELhugemem, this 
time discovered by Oracle rather than tripwire (on one of the HS40s).

We resolved this instance of corruption, but the corruption I 
mentioned in comment 40 is still there.  I'm happy to do whatever you 
need in that case to get you the info you need to fix this problem.

Comment 44 John Caruso 2005-01-07 21:21:41 UTC

We had two extremely harmful system interruptions today, presumably 
both caused by this issue.  The second led to a system lockup and was 
accompanied by the following info in the netdump log:

Unable to handle kernel NULL pointer dereference at virtual address 
00000032
 printing eip:
02153aef
*pde = 00003001
*pte = 00000000

Unfortunately that's all we got from these two problems.

What's the status on all this?  We're at the point where we're 
seriously looking at migrating all our servers to another hardware/OS 
platform, despite the huge amount of effort involved, because with 
this bug RHEL3 is unsuitable for serious use.  I've done everything 
you've asked and provided every kind of info I can--is it helping at 
all?  Is there anything else you need?  If so, what?  I really cannot 
possibly overstate the criticality of this bug.

Comment 46 Larry Woodman 2005-01-10 16:55:23 UTC

John, do you have the full console output/stack traceback for the last "Unable
to handle kernel NULL pointer dereference at virtual address 00000032" you recieved?


Larry Woodman

Comment 47 Larry Woodman 2005-01-11 03:37:17 UTC

 
John, is this reproducable without the LLT part of Veritas Cluster
Server running?

Larry Woodman

Comment 48 John Caruso 2005-01-11 20:49:37 UTC

As noted in comment 45, the output I included was all we got from those two
problems.  There was nothing else on the console, on the netdump server, or in
the system logs (local or remote).

Also, as noted in comment 40 and comment 42, one of the servers on which we've
been seeing the corruption does not run VCS (of which LLT is a part)--so yes, it
happens without LLT.  The most notable commonality between the HS20 and the two
HS40s is the use of NFS--though as I've mentioned, the problems occur both with
NFS-mounted files and local files.

Comment 49 John Caruso 2005-01-19 21:06:04 UTC

Why is this case still marked NEEDINFO?  I've answered Larry's 
questions from comment 46 and comment 47.

For our part, I'd appreciate a response to the following (from 
comment 44):

What's the status on all this?  We're at the point where we're 
seriously looking at migrating all our servers to another hardware/OS 
platform, despite the huge amount of effort involved, because with 
this bug RHEL3 is unsuitable for serious use.  I've done everything 
you've asked and provided every kind of info I can--is it helping at 
all?  Is there anything else you need?  If so, what?  I really cannot 
possibly overstate the criticality of this bug.

Comment 50 Ernie Petrides 2005-01-20 00:14:09 UTC

Hello, John.  I put this BZ into NEEDINFO state after Larry posted his
question in comment #47.  For some reason, this didn't get reverted to
ASSIGNED state when you provided the answer in comment #48, but the
state is now ASSIGNED.

Let me reassure you that this problem is being given our utmost attention.
Two of our top engineers are assigned to this case full-time, and several
others (including me) have been running experiments to try to narrow down
the problem space.  I'll let you know as soon as we understand the problem
well enough to propose and test a possible fix.

Thanks again for all your assistance.

Comment 51 John Caruso 2005-01-20 00:24:07 UTC

Ok, great, thanks for the update.  I suspected that it staying in NEEDINFO might
have been a glitch of some sort....

I'll be waiting for word on what we can do.  Believe me, we're highly motivated. :-)

Comment 52 Ernie Petrides 2005-01-20 00:38:28 UTC

We are, too!  (We can now reproduce a data structure corruption in about
half an hour, which we believe is related to the problem you're seeing.)

Comment 55 Ernie Petrides 2005-01-21 04:02:42 UTC

Hi, John.  We believe we have found the source of data corruption,
which results from any access to /proc/kcore when the number of
vmalloc'd regions in the kernel is in a specific range.  I'm in
the process of verifying the fix.

I would like to build you a test kernel based on the latest released
RHEL3 kernel (2.4.21-27.0.2.EL, which is U5 + two security updates).
Which RPM(s) do you need to verify that our fix resolves the data
corruption problem that you reported?  (Which config do you want?)

Comment 56 Ernie Petrides 2005-01-21 05:45:26 UTC

Created attachment 110040 [details]
proposed fixes for memory corruption cause by /proc/kcore access

This is the /proc/kcore patch that is currently being built into
externally releasable test kernels.  The first patch hunk is the
critical fix for the ELF header buffer sizing problem.	These
fixes have been verified to fix potential data corruption due
to /proc/kcore under certain memory conditions in local testing.

Comment 57 John Caruso 2005-01-21 06:04:52 UTC

Both smp and hugemem would be good.  We've found that the corruption
errors occur much more frequently with hugemem, so I suppose if we
want to shake this out quickly that's what we'll go with.  Has that
been the case in your testing as well (hugemem is more fragile than smp)?

It might also help if you could provide us with your testing
methodology, so we can see if it seems to reproduce the problem in our
environment as well.

Comment 58 Ernie Petrides 2005-01-21 06:16:52 UTC

John, the following test kernels (plus the kernel-source RPM) are available
under my Red Hat people page here:

http://people.redhat.com/~petrides/.kcore/kernel-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm
http://people.redhat.com/~petrides/.kcore/kernel-smp-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm
http://people.redhat.com/~petrides/.kcore/kernel-hugemem-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm
http://people.redhat.com/~petrides/.kcore/kernel-source-2.4.21-27.0.2.EL.ernie.kcore.1.i386.rpm

Please let us know whether this proposed fix resolves the data corruption
problem you've encountered.  If you need a different RPM, just list that
here and one of us will make it available to you.

Thanks in advance.  -ernie

Comment 59 Ernie Petrides 2005-01-21 06:20:21 UTC

Ah, missed your comment and the BZ state change just now.  In answer
to your question, I was easily able to reproduce a corruption with
the i686 smp config kernel on a system booted with "maxcpus=1 mem=384m"
in the kernel boot args.  My test case was using "tar" from /proc/kcore.

Comment 60 John Caruso 2005-01-21 07:31:19 UTC

Ok, in a testament to our motivation, our production database server is already
running this test kernel--and the hugemem version to boot, to give it the best
chance to fail, if that's what it's going to do.  Not that I *want* that to
happen, mind you....

If it croaks, burps, or so much as hiccups, you will of course be the first to know.

Comment 61 Peter Levart 2005-01-26 19:33:43 UTC

Hi John, has the test kernel solved your problem? Have you had any data 
corruptions since installing it? 
 
Regards, Peter

Comment 62 John Caruso 2005-01-26 19:50:29 UTC

We've not had any instances of corruption yet on the new kernel.  However, that
doesn't mean much; it will take weeks before we can say with any confidence that
this kernel is actually addressing the problem (the SMP kernel previously went
without failure for over a month, and the hugemem kernel went for nearly that long).

Comment 63 Ernie Petrides 2005-01-28 23:52:13 UTC

Hi, John.  I'm going to commit our /proc/kcore memory corruption fix to U5
later this evening.  Since we haven't heard of any recent crashes on your
systems since you've been running the test kernel, we're hoping that this
problem was the root cause of your instances of memory corruption.  I'll be
transitioning this BZ to MODIFIED once I do the commit.  But if you do
encounter another crash while continuing to run the test kernel, please
put this BZ back into ASSIGNED state so that we'd know it still needs work.

Comment 64 Ernie Petrides 2005-01-29 06:15:25 UTC

A fix for the /proc/kcore memory corruption bug, which we believe
is the root cause of this problem, has just been committed to the
RHEL3 U5 patch pool this evening (in kernel version 2.4.21-27.10.EL).

Comment 65 John Caruso 2005-02-01 01:24:14 UTC

Created attachment 110473 [details]
We're not out of the woods yet

The database server running the 2.4.21-27.0.2.EL.ernie.kcore.1hugemem kernel
crashed unexpectedly today at 4pm.  I'm attaching the complete "log" file
generated on the netdump server; this is all that there was, other than a
zero-length file called vmcore-incomplete (netdump isn't all it's cracked up to
be on RHEL3, unfortunately).

Comment 66 Ernie Petrides 2005-02-01 04:10:23 UTC

Ugh, bad news.

But that's an oops in rt_check_expire__thr(), which is not part of
the RHEL3 kernel.  Is that in the module that tainted your kernel?

Comment 67 John Caruso 2005-02-01 04:19:12 UTC

VCS loads modules which taint the kernel, I believe, but I don't know whether or
not that particular routine is part of the VCS code.  If it's still the memory
corruption bug, though, it could hit anywhere at all.

Also, just as a reminder: we've had numerous instances of the corruption bug in
the past on a system that does not run VCS (or Oracle), and which in fact runs a
pretty plain vanilla application mix (NFS + web services).

Comment 68 Ernie Petrides 2005-02-01 04:36:27 UTC

Understood, John.  Well, we have fixed a memory corruption bug caused by
/proc/kcore access (which is the only one we could reproduce), but it
seems that you are encountering a different problem.  So, I'm changing
this back out of MODIFIED state.

Since we don't have access to the source code that oops'ed, you should
try to get the appropriate 3rd party involved in the debugging.  Please
let us know whether you can get any clues.

Also, please continue to run the kcore-fix kernel(s) to eliminate that as
a possible culprit in future crashes.  And please also try to collect a
crash dump if possible so that we have something to go on.

I'll bounce this back to DaveA and leave it in NEEDINFO state until we have
something to work with.

Comment 69 John Caruso 2005-02-01 05:44:42 UTC

The system that crashed runs VCS, Oracle, and nothing else beyond standard RHEL3
stuff.  A google search on "rt_check_expire__thr" shows a lot of references to
it in contexts that don't particularly look like they'd be associated with VCS
(and in which I don't see any references to llt or gab, the two main kernel
modules loaded by VCS).  A system-wide grep finds it only in these files:

boot/System.map-2.4.21-15.0.4.ELhugemem
boot/System.map-2.4.21-25.ELhugemem
boot/vmlinux-2.4.21-25.ELhugemem
boot/System.map-2.4.21-25.ELsmp
boot/vmlinux-2.4.21-25.ELsmp
boot/vmlinux-2.4.21-15.0.4.ELhugemem
boot/System.map-2.4.21-20.ELhugemem
boot/vmlinux-2.4.21-20.ELhugemem
boot/System.map-2.4.21-27.ELhugemem
boot/vmlinux-2.4.21-27.ELhugemem
boot/System.map-2.4.21-27.ELsmp
boot/vmlinux-2.4.21-27.ELsmp
boot/System.map-2.4.21-27.0.2.EL.ernie.kcore.1hugemem
boot/vmlinux-2.4.21-27.0.2.EL.ernie.kcore.1hugemem

Are you sure this isn't a standard kernel routine?  Doesn't the fact that that
string is showing up in the vmlinux files (and nowhere else outside of /boot)
imply otherwise?

In any case, I don't know who else to involve.  If there's info I can give you
from these systems that might tell you more about it, though, feel free to ask.
 Or if you can tell me exactly what you want me to ask Veritas about this
routine, I can run it past them and see if they know about it.

As for trying to get a crash dump, we are (and have been) running netdump, but
while we do occasionally get crash logs, it's been a complete bust when it comes
to producing crash dumps.  If there's some other way to get them that you can
suggest, that'd be great.

Comment 70 Dave Anderson 2005-02-01 14:01:18 UTC

John, you are correct, rt_check_expire__thr is a standard kernel
routine.  It's hand-crafted from this in include/linux/interrupt.h:

#define SMP_TIMER_NAME(name) name##__thr

and found in net/ipv4/route.c; the oops is happening on the
line indicated below:

/* This runs via a timer and thus is always in BH context. */
static void SMP_TIMER_NAME(rt_check_expire)(unsigned long dummy)
{
        static int rover;
        int i = rover, t;
        struct rtable *rth, **rthp;
        unsigned long now = jiffies;

        for (t = ip_rt_gc_interval << rt_hash_log; t >= 0;
             t -= ip_rt_gc_timeout) {
                unsigned long tmo = ip_rt_gc_timeout;

                i = (i + 1) & rt_hash_mask;
                rthp = &rt_hash_table[i].chain;

                write_lock(&rt_hash_table[i].lock);
                while ((rth = *rthp) != NULL) {
======================> if (rth->u.dst.expires) {
                                /* Entry is expired even if it is in
use */
                                if (time_before_eq(now,
rth->u.dst.expires)) {
                                        tmo >>= 1;
                                        rthp = &rth->u.rt_next;
                                        continue;
                                }
                        }


Unfortunately without a dump there's not much to go on.
Is the netdump-server on the same subnet?  If you run anything
other than the hugemem kernel, getting at least 1GB saved into
the vmcore-incomplete may be enough to work with, since it
will gather all of lowmem, which contains all of the kernel static
memory and slabcache memory.  A vmcore-incomplete of a hugemem
kernel would need uses most need to be at least 4GB in size.

Comment 71 John Caruso 2005-02-01 17:57:49 UTC

The netdump server for this particular server is on the same subnet. 
We are setting NETDUMPADDR and SYSLOGADDR to different values, however
(and the syslog server is on a different subnet); I could try
unsetting SYSLOGADDR if you think it would help, since based on bug
142921 it would seem that RHEL3 netdump doesn't like this sort of config.

We can run the SMP kernel if you'd prefer it for smaller crash dump
purposes, but the catch there is that the SMP kernel seemed to hit the
corruption much less often than the hugemem kernel.  So in that case
it could easily be 3-4 weeks before we'd get another corruption event.
 Since the main reason for choosing the SMP kernel would be netdump, I
may try intentionally panic'ing a test server to see if I can get
netdump to produce a crash dump under any circumstances.

Comment 72 Dave Anderson 2005-02-01 18:06:49 UTC

If you don't need SYSLOGADDR, take it out of the equation, although
I can't really give you a good reason why it would interfere with
the netdump process.

As far as which kernel to run, that's your choice.  Note that
even if the SMP kernel was used, and it was able to create a 1GB
dumpfile, there's still a possibility crucial data will be in module
memory, in which case, it would still be pretty much useless.

Comment 73 John Caruso 2005-02-01 20:30:49 UTC

Based on my testing, the netdump server will accept memory dumps if a
system running the SMP kernel crashes, but not if one running hugemem
crashes--the best I can get is a zero-length vmcore-incomplete file
(as we saw in the actual crash from comment 65).  Setting SYSLOGADDR
appears to have no effect, as you'd expect.

So we can either 1) run the SMP kernel, have a chance at a crash dump,
but possibly have to wait many weeks for a corruption event, or 2) run
the hugemem kernel, have no chance at a crash dump, but hopefully have
to wait less time in between corruption events.  It no longer matters
to us because we stopped utilizing hugemem's extended process address
space early on in this bug's lifetime, since that kernel had proven
too unstable for production use...and at this point we're not going to
start using those features again until we know we have a fix.  So we
can run either kernel; it's your call, based on which one you think
will give you a better chance of tracking down this bug.

Comment 74 Dave Anderson 2005-02-01 20:35:12 UTC

Then by all means, use the SMP kernel.  Getting a dump is
of prime importance here.

Comment 75 John Caruso 2005-02-02 17:47:52 UTC

Ok, this morning we've had several completely unambiguous instances 
of the memory corruption bug on our production database server, 
detected by Oracle throwing block corruption errors.  So I can say 
for certain that this bug is not resolved.

The server in question is on the 2.4.21-
27.0.2.EL.ernie.kcore.1hugemem kernel, but a memory dump isn't even 
an issue since in this case the corruption isn't causing a system 
crash (as it usually doesn't); it's just silently corrupting user 
data.

Comment 76 John Caruso 2005-02-03 04:18:11 UTC

Created attachment 110582 [details]
"log" output from another system crash

This is the "log" file generated by another system crash--this time on our
primary development server (so no VCS or Oracle), running the
2.4.21-27.0.2.EL.ernie.kcore.1smp kernel.  There was also a zero-length
vmcore-incomplete file--sorry.	I at least consider it progress that we're even
*getting* vmcore-incomplete files from netdump now, since they weren't showing
up in the past at all.

Comment 77 John Caruso 2005-02-07 23:53:37 UTC

FYI, we had another instance of memory corruption on a database 
server (detected by Oracle) while running the 2.4.21-
27.0.2.EL.ernie.kcore.1smp kernel.  The server has since been 
rebooted.  This is mainly news since it shows that the corruption bug 
still exists both in the hugemem and SMP kernels.

I take it the "log" info from comment 76 didn't help much?

Comment 78 Dave Anderson 2005-02-08 15:55:07 UTC

This is same panic scenario as seen in duplicate #141905, where Larry
indicated:

> BTW, this appears to be memory corruption in the mem_map(array of
> page structs).
>
> The crash in page_referenced was caused by a bad page->pte.chain
> value.
> ----------------------------------------------------------------
> int page_referenced(struct page * page, int * rsslimit)
> ...
>   for (pc = page->pte.chain; pc; pc = pte_chain_next(pc)) {
>   ...
>      chain_ptep_t pte_paddr = pc->ptes[i];
>      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> The assembler code for this is:
> 0xc015ff38 <page_referenced+0x2f8>:     mov    0x4(%esi,%ebx,4),%eax
>
> where esi: -> bcb64118
>---------------------------------------------------------------------
>
> This esi value can never be less than 0xc0000000!

The difference in this case is that esi is 0xa5bf27c.  It almost
looks as if it were a pte_addr_t, but the page flags values wouldn't
make sense.




The difference is the esi

Comment 79 John Caruso 2005-02-15 20:13:04 UTC

Corruption again on one of the database servers, detected by tripwire 
reporting only MD5/CRC32 changes, in the file /lib/modules/2.4.21-
27.0.2.EL.ernie.kcore.1hugemem/kernel/drivers/block/cpqarray.o.  
Here's the cmp -l output for a good version vs. the corrupted version:

# cmp -l /lib/modules/2.4.21-
27.0.2.EL.ernie.kcore.1hugemem/kernel/drivers/block/cpqarray.o 
cpqarray.o.CORRUPT

22713 103   0
22714 157   0
22715 155   0
22716 160   0

Again, just 4 bytes have changed, and on an 8-byte boundary.  And 
this is with the full complement of RHEL3 U4 RPMs.

Comment 80 John Caruso 2005-03-11 07:27:03 UTC

FYI: another corruption instance on one of our production database 
servers, detected and reported by Oracle.  This is the first 
corruption instance since the last one I reported, on 2/15/2005 
(comment 79)...thus illustrating what I'd mentioned about the SMP 
kernel often taking a very long time to manifest this bug.  The 
server in question had been running without any apparent corruption 
instances for 25 days (on 2.4.21-27.0.2.EL.ernie.kcore.1smp).

Comment 81 John Caruso 2005-03-17 19:37:39 UTC

FYI: another corruption instance on one of our production database servers, 
detected and reported by Oracle.  Still on 2.4.21-27.0.2.EL.ernie.kcore.1smp.

Comment 82 Ernie Petrides 2005-03-17 21:55:51 UTC

Thanks for the updates, John.  Incidentally, the official RHEL U5 external
beta period is now scheduled to start on Monday, 3/21.  The kernel version
is 2.4.21-31.EL, and it contains the /proc/kcore fix as well as lots of
other critical fixes.  We do not expect that the problem you've encountered
has been resolved, but it would be a good idea to try the U5 beta kernel at
some point (preferably in non-production use first).

Comment 83 John Caruso 2005-03-18 19:52:32 UTC

Thanks, we'll do that (and let you know if we see any changes in this bug's
behavior as a result).

Comment 86 John Caruso 2005-03-19 23:23:12 UTC

An unexplained crash on one of the systems where we see the corruption bug, so 
my assumption is that that was the cause.  As usual, netdump did not capture a 
core dump (vmcore-incomplete was present, but 0-length).  I'll include 
the "log" file from netdump separately.

Comment 87 John Caruso 2005-03-19 23:25:21 UTC

Created attachment 112152 [details]
"log" output from yet another crash

Comment 89 Dave Anderson 2005-03-21 16:32:36 UTC

In this case, an nfs_file_write() operation needed a new page to do so,
called __alloc_pages(), which was in the act of reclaiming a page from the
inactive_clean list.  That particular page on the list had an inode mapping,
but the list_head that should link it into the inode mapping's page list had
a NULL page->list.prev pointer (at least).  Hard to glean anything from that
without a dump. 

Questions arising from internal discussions re: this case:

Can this bug be reproduced with system call auditing enabled?
Are both servers writing to the same disk array?
Is the SCSI bus configured with parity enabled? 
Are any parity errors logged?

Comment 90 John Caruso 2005-03-21 21:16:54 UTC

Answers:

- If you want us to enable specific kinds of auditing, sure, we can do that 
(the systems use the default audit settings at present).  As to whether or not 
the bug can be reproduced in that scenario, dunno, since we have no way to 
reproduce the bug at all other than waiting for it to happen.  If you do want 
us to enable specific kinds of auditing, we'll need to know the expected 
performance impact as well.

- There are three servers involved: two database servers (a redundant pair, 
only one of which is runs the database at any given time) and a development 
server.  The database servers do not share any disk arrays--the database lives 
on NFS (on a Netapp box).  The development server doesn't use Oracle at all.  
And again: the corruption has shown up in NFS-based files, local (ext3-based 
files), and in-memory kernel structures that aren't associated with any disks.

- I don't see any way to check or change the parity settings for the SCSI bus 
on these servers (and no, we haven't seen any errors logged).

Comment 91 Dave Anderson 2005-03-21 22:10:51 UTC

Actually, we'd prefer that auditing be turned completely off just 
to rule it out.

And can you verify that you've turned off SYSLOGADDR in the
client's /etc/sysconfig/netdump file?

Also, just for sanity's sake, can you look at the /var/log/messages
file on the netdump-server, and tell us what netdump-server
error messages are there?  There should be some kind of time-out
error message.  Does the client reboot, or just sit there after
the "performing handshake" line?

Comment 92 John Caruso 2005-03-21 22:37:51 UTC

Ok, we can turn off auditing.  When you say this, are you talking about just 
killing off /sbin/auditd?

We haven't commented out SYSLOGADDR on these clients, because based on our 
testing SYSLOGADDR appears to have no effect on whether or not a client can 
generate a successful crash dump (as per comment 73), and it's desirable to 
have it on since our syslog server and netdump server are two different 
servers.  We can turn it off anyway if you think it's worthwhile anyway, though.

For this particular event I do see some messages in /var/log/messages regarding 
the netdump:

Mar 19 14:53:37 server netdump[1003]: No space for dump image
Mar 19 14:53:38 server netdump[1003]: Got too many timeouts waiting for 
SHOW_STATUS for client 0x0a141e28, rebooting it
Mar 19 14:53:42 server netdump[1003]: Got unexpected packet type 12 from ip 
0x0a141e28

I think the "no space for dump image" may be a red herring because 1) the 
server in this instance had about 4GB free in /var/crash and 2) I've 
successfully forced a crash dump from a server with twice as much memory as one 
in this incident (16GB vs. 8GB) to a netdump server with about the same amount 
of disk space available in /var/crash as the one in this incident.  However, if 
you think 4GB free is insufficient, I'll try symlinking /var/crash to a larger 
partition on this particular netdump server.

How much space does netdump actually require to dump a system with 16GB of RAM, 
or one with 8GB?  And does this match the amount of space required for netdump 
to internally pass the "No space for dump image" test?  As I recall from my 
testing, the dumps were always less than 1GB regardless of the memory size of 
the crashed system.

Comment 93 John Caruso 2005-03-21 23:05:39 UTC

Oh, and the client *does* generally reboot.  It did in this case, anyway, and I 
don't believe we've ever had a client hang while trying to generate a crash 
dump on a netdump server after crashing due to the memory corruption bug.

Also I forgot to mention the other reason I think the "no space for dump image" 
message may be a red herring: because on the netdump client/server pair where 
we've successfully forced memory dumps to occur (in forced panic tests), we've 
also ended with zero-length vmcore-incomplete files when the system crashed on 
its own (as a result of the memory corruption bug).  I didn't 
check /var/log/messages in those cases, though...but the point is that it seems 
we can get memory dumps when we force a crash, but not "in the wild" (or at 
least not with crashes associated with the corruption bug).

Comment 94 Dave Anderson 2005-03-22 13:53:51 UTC

The vmcore files should be the size of memory plus a page for the
prepended ELF header.  The "No space" error message should not be considered
a red herring (at least yet).  That is especially true considering that
the client reboots, because that means that the client received a
reboot request, which would be the case if the no space situation
was perceived (rightly or wrongly) by the netdump server.  If there was
no communications coming back from the netdump server to the client, the
client would hang indefinitely.

In any case, the partition containing /var/crash should always be capable
of holding a full memory-sized dump. 

Re: audit, to temporarily turn the daemon off, just do a:

  $ service audit off

on a running system.  To prevent it from being started on a subsequent reboot:

  $ chkconfig audit off

And to restore it a some time in the future:

  $ chkconfig audit on

Comment 95 John Caruso 2005-03-23 01:12:10 UTC

Ok, I've symlink'ed /var/crash to a directory that should have sufficient space 
for a full memory image for the clients (though it's a very tight shave for the 
database servers).  We'll see if that changes anything.  I think I'd gotten the 
impression that the dump wouldn't include all of the memory image from your 
statement in comment 70 that "getting at least 1GB saved into the vmcore-
incomplete may be enough to work with" (and "a vmcore-incomplete of a hugemem 
kernel would need uses most need to be at least 4GB in size" didn't help 
either... :-).  I've forced a crash on one of our production database servers, 
and verified that it does in fact create a full vmcore file.

My question about auditing was mainly to see if you had anything in mind other 
than the audit service itself.  As of yesterday, I've turned off the audit 
service and unset SYSLOG_ADDR in the netdump config files on the servers where 
we've seen the corruption incidents.  So now we just have to wait for the bug 
to strike again.

Comment 96 Ernie Petrides 2005-04-02 01:35:35 UTC

Hello, John.  Another kernel bug has been discovered (in handling invalid or
possibly unusual ELF library formats) that could possibly lead to arbitrary
data corruption, which might occur in user data pages as well as in kernel
data structures.  This will be fixed in a U5 respin.

I have also developed a patch to detect, report, and fix this problem as it
occurs, but I'm still working on validating this test patch (which I hope to
complete on Monday).  Once the test patch is ready, I will make a test kernel
available to you in order to verify whether the previously problematic code
path is being triggered on your system.  I'll update this BZ on Monday.

Cheers.  -ernie

Comment 97 Ernie Petrides 2005-04-05 00:28:08 UTC

Hello, John.  I've completed validation of my test patch for detection,
reporting, and recovery from a kernel bug in uselib() syscall handling.
I've built test kernels based on the latest U5 beta (-31.EL) with the
kernel version "2.4.21-31.EL.ernie.uselib.1" and have made several RPMs
available on my "people" page.

If this kernel detects a strange ELF library (that would have triggered
the bug), the following set of messages will appear on the console:

    load_elf_library: prevented corruption from weird ELF file                 
    load_elf_library:   dentry name 'badlib', inode num 33685                       
    load_elf_library:   kalloc() for 64 bytes returned 0xf796aa80                   
    load_elf_library:   kfree() was about to be passed 0xf796aaa0                   

where 'badlib' is the final pathname component of the strange ELF library
with such-and-such inode number (which would allow you to locate the file).
The test logic will then avoid the data corruption and allow the kernel to
continue running normally.

Please install one of the following RPMs on a system on which you're willing
to run a U5 beta kernel from these URLs:

http://people.redhat.com/~petrides/.uselib/kernel-2.4.21-31.EL.ernie.uselib.1.i686.rpm
http://people.redhat.com/~petrides/.uselib/kernel-smp-2.4.21-31.EL.ernie.uselib.1.i686.rpm
http://people.redhat.com/~petrides/.uselib/kernel-hugemem-2.4.21-31.EL.ernie.uselib.1.i686.rpm
http://people.redhat.com/~petrides/.uselib/kernel-source-2.4.21-31.EL.ernie.uselib.1.i386.rpm

If your test system incurs another data corruption or crash without those
console messages appearing, then we'd know that the uselib() handling bug
is not related to the problem you're seeing.

Please let me know how it goes.

Thanks.  -ernie

Comment 99 John Caruso 2005-04-05 06:02:08 UTC

We've set up two of our three systems with your new kernel, and we'll move the
third over to it during a downtime period.

Does this patch also send output to syslog, or only to the console?  And do you
have a test executable I could use to trigger the error?

Comment 100 Ernie Petrides 2005-04-05 18:34:16 UTC

Thanks, John.  I'll put this BZ back into NEEDINFO state until you let
us know whether you see any of the "load_elf_library" diagnostics or
whether you get another crash or corruption without the diagnostics.

The diagnostics from the test patch simply use printk() without any
log level designator.  Thus, they'd appear both on the console and in
/var/log/messages (via the syslog mechanism).  I do have a reproducer,
but I'd rather not make it available because of the obvious security
ramifications.  We'll be making another U5 beta respin available soon
containing a fix, but I've created the test kernel with diagnostics
specifically for you so that we could definitively resolve whether
your systems were hitting this bug or not.

Comment 101 John Caruso 2005-04-07 19:21:58 UTC

FYI, we're now running the 2.4.21-31.EL.ernie.uselib.1 kernel on all three of 
the systems on which we typically see the corruption issue, and we're running 
with the hugemem kernel on two of those in order to maximize the chance of 
hitting the bug.  We'll let you know as soon as we see another corruption 
incident (handled or unhandled).

Comment 102 Ernie Petrides 2005-04-07 22:55:50 UTC

Thanks, John.  Reverting BZ to NEEDINFO again.

Comment 103 Ernie Petrides 2005-04-16 05:16:24 UTC

Hi, John.  Any news?

Comment 104 John Caruso 2005-04-16 19:49:59 UTC

Nothing yet.  You will of course be the second to know if we see another
corruption incident (handled or unhandled).

Comment 105 John Caruso 2005-04-17 01:13:59 UTC

Created attachment 113275 [details]
"log" output from still another system crash

Another crash just a few hours after my last comment--this time on a system
where we've not seen the memory corruption bug before (though it does share the
same hardware configuration as the database servers where we often see the
bug).  And it was running the 2.4.21-31.EL.ernie.uselib.1hugemem kernel.  It
did produce a "log" file, but unfortunately no crash dump (another zero length
vmcore-incomplete file).

So assuming that this was indeed the memory corruption bug again, it looks like
the uselib patch didn't fix it for us.

Comment 106 Ernie Petrides 2005-04-17 04:14:16 UTC

Rats.  Looks like we're back to square 1 ... no leads.

Thanks for the update, John.

Comment 107 Dave Anderson 2005-04-18 19:51:53 UTC

EIP is at clear_inode [kernel] 0xb5 (2.4.21-31.EL.ernie.uselib.1hugemem/i686)
eax: 0004086a   ebx: cd42e180   ecx: 00000000   edx: 1b0c0000
esi: cd42e180   edi: 1b0c1f88   ebp: 00000000   esp: 1b0c1f58
ds: 0068   es: 0068   ss: 0068
Process kswapd (pid: 19, stackpage=1b0c1000)
Stack: cd42e180 cd42e188 021811fc cd42e180 d2950068 1b300068 ccf22488 ccf22688 
       ccf22680 021814dc 1b0c1f88 00001bf2 c6d06d88 dd5cb688 023a1b00 0006eb39 
       00000001 00000040 02181724 00001bf2 00000001 021573f8 00000006 000001d0 
Call Trace:   [<021811fc>] dispose_list [kernel] 0x3c (0x1b0c1f60)
[<021814dc>] prune_icache [kernel] 0x8c (0x1b0c1f7c)
[<02181724>] shrink_icache_memory [kernel] 0x24 (0x1b0c1fa0)
[<021573f8>] do_try_to_free_pages_kswapd [kernel] 0x168 (0x1b0c1fac)
[<021575a8>] kswapd [kernel] 0x68 (0x1b0c1fd0)
[<02157540>] kswapd [kernel] 0x0 (0x1b0c1fe4)
[<021095ad>] kernel_thread_helper [kernel] 0x5 (0x1b0c1ff0)

As best I can understand it, the failure is here:

void clear_inode(struct inode *inode)
{
        invalidate_inode_buffers(inode);

        if (inode->i_data.nrpages)
                BUG();
        if (!(inode->i_state & I_FREEING))
                BUG();
        if (inode->i_state & I_CLEAR)
                BUG();
        wait_on_inode(inode);
        DQUOT_DROP(inode);
 ===>   if (inode->i_sb && inode->i_sb->s_op && inode->i_sb->s_op->clear_inode)
                inode->i_sb->s_op->clear_inode(inode);
        if (inode->i_bdev)
                bd_forget(inode);
        else if (inode->i_cdev) {
                cdput(inode->i_cdev);
                inode->i_cdev = NULL;
        }
        inode->i_state = I_CLEAR;
}

...where the super_block's s_op field contains the 4086a value.

It would be nice to be able to see the rest of the super_block.

Given a zero-length vmcore file again, what was the error message
on the netdump-server this time?

Comment 108 John Caruso 2005-04-18 20:07:21 UTC

It was "No space for dump image" again (just tracked it down--I didn't realize
this was a daemon message rather than a kernel message).  This server hadn't
ever crashed before, so its dump server hadn't been set up to take the enormous
memory dump it would produce.  I've changed that now, and we're also actively
running the same process that killed it on Saturday, in the hopes of producing
another crash.

Comment 109 Ernie Petrides 2005-04-19 19:51:10 UTC

I have good news about Dave's analysis in comment #107: that crash was
caused by the problem reported in bug 124600, which was fixed in the
recent U5 respin (for kernel version 2.4.21-32.EL).  We can ignore the
new crash and expect it to be fixed in U5.

Since bug 124600 is not known to be able to cause user-space or file
buffer corruption, please continue to use the 2.4.21-31.EL.ernie.uselib.1
kernels (at least until the latest U5 respin is available in the RHN beta
channels).

Thanks.  -ernie

Comment 110 John Caruso 2005-04-20 19:28:37 UTC

I have bad news: we ran into the memory corruption bug on the database server 
from comment 105, this time detected by Oracle (so the system didn't actually 
crash).  The kernel didn't notice or prevent the corruption instance.  So 
apparently the patch in the 2.4.21-31.EL.ernie.uselib.1 kernels doesn't fix our 
bug.

The good news is that this is on a system where we have a way to try forcing 
the bug to occur that's more likely to be successful.  We're using the system 
for other testing right now, but once that's over we'll start hitting it around 
the clock to try to force it to crash again.

Would it be of any use to you for us to force a system panic when we know 
there's memory corruption in the system?  Or do you need a panic that was 
actually caused (somehow) by the corruption?

Comment 111 Dave Anderson 2005-04-20 21:25:17 UTC

> Would it be of any use to you for us to force a system panic when we know 
> there's memory corruption in the system?  Or do you need a panic that was 
> actually caused (somehow) by the corruption?

We wouldn't know what to look at with a forced crash.  With a panic
caused by the corruption, we at least have some kernel evidence
sitting in front of us.

Comment 115 Vangel Bojaxhi 2005-05-02 17:29:00 UTC

We are particularly interested for the resolution of this bug, in 8 way SMP 
machines with 4 GB (3G/1G) memory used for SNFS [ADIC] IO servers, which serve 
25~50 TB of storage.  Currently tried to expose the problem in kernel ââ15â 
with reducing mem=512, but without success. The local IO stress tests in IO 
servers run just fine, including tar-ing  of /proc/kcore, etc.  

My current plan is to use all hardware resources and load the machine with 4 G 
memory, but reducing only the default kernel ââ15â memory to less than 512M. To 
achieve this would need to write and load a module that will poison-allocate 
more than 256M and monitor for infringements. Hopefully this will generate 
early warning for memory leaks, without triggering a kernel crash, ie without 
slab debugging turned on. 

 Probably this kind of kernel memory âreductionâ would need to be refined 
latter along with some more sophisticated IO stressing and monitoring programs 
to exacerbate-analyze the problem. 

We can offer the necessary hardware and software resources to test the solution 
of this bug, using the above testing methodology.  

Appreciate any suggestions and cooperationâ¦

Rgds, 
Vangel

Comment 124 Ernie Petrides 2005-05-12 21:15:55 UTC

John, we've just recently discovered another subtle data corruption bug
that can affect x86 systems with > 4GB of memory (assuming they're running
either the smp or hugemem configs).  My next U6 build will include a fix
for this problem.

Although we are still near the beginning of the U6 development cycle (the
next build will be #4), I will make kernels available to you either tomorrow
or Monday for testing.  It is likely that we will issue a post-U5 erratum with
this fix plus a few security fixes, and so it would be very valuable to us if
you could determine whether this new bug fix resolves the problems you've seen.

I'll update this BZ again when the kernels are available.

Thanks.  -ernie

Comment 125 Ernie Petrides 2005-05-14 06:19:49 UTC

Hi, John.  We have finally resolved the problem of a very obscure PTE
race condition that could cause arbitrary memory corruption (in either
user or kernel data) on SMP x86 systems with greater than 4G of memory
(with either the smp or hugemem configs).  We have already verified
definitively that our fix corrects two open bugzilla reports (151865
and 156023), and we are very interested if this fix resolves the data
corruption problems you have encountered.

The fix has just been committed to the RHEL3 U6 patch pool this evening
(in kernel version 2.4.21-32.4.EL).  Although this is just an interim
Engineering build for U6 development, I have made the following kernel
RPMs available on my "people" page for you to test:

http://people.redhat.com/~petrides/.pte_race/kernel-hugemem-2.4.21-32.4.EL.i686.rpm
http://people.redhat.com/~petrides/.pte_race/kernel-hugemem-unsupported-2.4.21-32.4.EL.i686.rpm
http://people.redhat.com/~petrides/.pte_race/kernel-smp-2.4.21-32.4.EL.i686.rpm
http://people.redhat.com/~petrides/.pte_race/kernel-smp-unsupported-2.4.21-32.4.EL.i686.rpm
http://people.redhat.com/~petrides/.pte_race/kernel-source-2.4.21-32.4.EL.i386.rpm

We intend to incorporate this bug fix in our next post-U5 security errata
release (which has not yet been built, but which is likely to enter formal
Q/A as soon as U5 is shipped on Wednesday, May 18th).

In the meantime, please test this interim U6 build as soon as is practical
so that we can determine if the fix addresses this bugzilla (and BZ 141905).

Thanks again for your patience and invaluable assistance throughout this
investigation.

Cheers.  -ernie

Comment 126 Ernie Petrides 2005-05-17 22:27:01 UTC

The fix (referred to above) for a data corruption problem has also
just been committed to the RHEL3 E6 patch pool (in kernel version
2.4.21-32.0.1.EL).

I'm tentatively moving this BZ to MODIFIED state and will list it
in the associated security advisory that we intend to release post-U5.

Please still follow up with test results as requested in comment #125.

Thanks.  -ernie

Comment 127 John Caruso 2005-05-18 17:00:18 UTC

Thanks, this does seem hopeful.  We'll get this kernel installed on our
candidate systems as soon as possible.

Does moving the bug to MODIFIED state and listing it in the security advisory
mean that you consider it resolved, though?  We've been down this route before
(e.g. comment 63)...there's no way to tell that this bug is actually fixed
except to wait a long, long time.  We've often gone 4-6 weeks on the SMP kernel
without corruption incidents, and in fact the last corruption incident was 4
weeks ago, on April 20th.  So it will take a solid month or more before we can
say with any confidence that this fix addresses the type of corruption incidents
we've been seeing.

BTW, if you have a more detailed description of the fix, that would be nice
(since this bug is very high profile here, and so a lot of people will ask about
it).

Comment 128 Ernie Petrides 2005-05-18 21:13:09 UTC

Hi, John.  In response to your question about this bug being in MODIFIED
state, yes, it means that I'm (optimistically) considering this to be
resolved.  I remember last time, but hopefully, I'll be right this time.  :-)

If you incur another memory corruption before we release the 2.4.21-32.0.1.EL
kernel (which we're targeting for late next week), then I'll promptly change
this bug back into ASSIGNED state and remove it (and bug 141905) from the
advisory (which will be RHSA-2005:472).

I understand that in only one week, we cannot be 100% confident that this bug
is resolved.  But I'd rather that this bug be appropriately associated with the
advisory that contains the fix.  (We won't be able to retroactively update
the advisory after it's been released on RHN.)

I'll attach the patch that fixes this problem along with a more thorough
explanation shortly.

Comment 129 Ernie Petrides 2005-05-18 22:33:04 UTC

Created attachment 114542 [details]
data corruption fix committed to RHEL3 U6/E6 kernel patch pools

The critical part of this fix is in establish_pte(), where the call
to pte_clear() has been replaced with a call to ptep_get_and_clear().

On x86 smp and hugemem configs, the PTE being operated on is 64 bits
wide.  The higher-order half represents part of the physical page
frame number that would be non-zero only on systems with greater than
4GB of memory.	The PTE could be accessed concurrently on different
cpus only when it is being used by a multi-threaded application.

Consider the case of a multi-threaded app (e.g., Java) in which one
thread is executing a fork() syscall and another is about to modify
an already writable page whose mapping is not yet in that cpu's TLB.

Before the bug fix, cpu A performing the fork might be marking the
PTE as read-only via establish_pte() and get as far as clearing the
higher-order half of the PTE in set_pte() via pte_clear().  The top
half is zeroed first in the include/asm-i386/pgtable-3level.h version.
Then on cpu B, the other thread stores into a data page whose physical
address was supposed to be in memory above 4GB.  The MMU on cpu B would
load a translation from the correct lower-half of the PTE but a zeroed
higher-half of the PTE.  Despite cpu A then completing the PTE update,
cpu B has a wrong translation in its TLB up until the time that the
flush_tlb_page() call on cpu A completes.  During this window, cpu B
can inappropriately modify the page in the first 4GB of memory at the
physical address with the same lower-order 32 bits as the correct one.

The fix works by using ptep_get_and_clear(), which zeros the lower-order
part of the PTE first (additionally using a RMW cycle to memory).  The same
fix has been applied to unmap_hugepage_range(), although we do not believe
that this latter code path caused any of the corruptions you experienced.

Thus, to incur the corruption, the following things were required:

  1) x86 architecture
  2) PAE mode (smp or hugemem configs)
  3) more than one cpu in the system
  4) multi-threaded application execution
  5) (probably) a fork() syscall from the application
  6) extremely unlucky timing between establish_pte() and another cpu's MMU

As you can see from this explanation, it isn't feasible to catch this
sequence of events with a special diagnostic kernel because one of the
critical items (MMU operation) is invisible to the kernel.  I suppose
you'd need a multi-cpu logic analyzer.

Hope this helps.

Cheers.  -ernie

Comment 130 Ernie Petrides 2005-05-19 21:32:02 UTC

As a follow-up to my last comment, Larry Woodman and I have worked out
a scenario where the establish_pte() call from handle_pte_fault() could
allow the data corruption race condition to occur in the absence of any
fork() system call from the multi-threaded application.

Thus, item 5 in my list above is not a requirement.

Comment 132 Josh Bressers 2005-05-25 16:42:34 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-472.html

Comment 138 Ernie Petrides 2005-08-23 16:54:06 UTC

*** Bug 163699 has been marked as a duplicate of this bug. ***

Comment 139 Ernie Petrides 2005-12-06 23:34:06 UTC

*** Bug 158328 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

alan
albert.graham
anderson
ayoung
bastiaan
bstein
dff
dhoward
elarkin
jbaron
jburke
jparadis
jturner
k.georgiou
lwoodman
nobody+wcheng
peterm
peter
petrides
redhat_bugzilla
riel
robinson
sct
tao
tburke
tkincaid
uthomas
vbojaxhi