Bug 25749 - zero page curruption in 2.4.*
zero page curruption in 2.4.*
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
i386 Linux
high Severity high
: ---
: ---
Assigned To: Ben LaHaise
Brock Organ
Depends On:
  Show dependency treegraph
Reported: 2001-02-02 14:35 EST by Christopher Blizzard
Modified: 2005-10-31 17:00 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2001-02-13 17:19:44 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Christopher Blizzard 2001-02-02 14:35:10 EST
Here's some setup email from a good friend of mine:


I've got a Linux question - hoped you could point me at someone who might
be able to help.  In my latest DE testing I've been able to cause a
2.4.{0,1} production system to get into a state where it is unusable -
almost every command you type exits with SEGV - I can't even su - to shut
it down!  It happens after pounding the box running the deltaedge software
(runs as unpriveledged user on a stock kernel, so I don't think there is
any way I could have been responsible for mucking the kernel memory).  It
runs OK for a little ibt and then the server pukes with a SEGV, and then
the machine is hosed.  When the machine gets into this state, a simple
program like:

#include <stdio.h>

static void *foo[10000];

int main() {
  int i;

  for (i = 0; i < 10000; i++)
    if (foo[i] != NULL)

Will start showing error - after i >= 640...really weird considering the C
spec says that the entire memory occupied by foo MUST be 0!


The application that he's talking about is a big proxy server.  He says
that he knows that this problem doesn't show up in test9 and he's doing a
binary search to figure out exactly where the problem started showing up.

Some stats:

gcc 2.91.66, roughly 6.2 with a maybe hacked libc ( he'll have to give more
information here. )

The machine is a VA box with 2 PIII 750 mhz cpus and a gig of ram.

The application makes heavy use of threads, mmap and raw I/O.  He's going
to see about getting us a solid test case.

He'll be added to this bug as soon as I let him know the ID.
Comment 1 Christopher Blizzard 2001-02-02 14:46:08 EST
Client side test case software (before I forget):

Comment 2 Need Real Name 2001-02-02 15:22:37 EST
I'm looking for the kernel rev where the problem appears to be introduced - as
the problem doens't occurr at exactly the sampe place every time, it might be a
while before I can track it down.

So far, 2.4.0-test10 APPEARS to be OK.
Comment 3 Need Real Name 2001-02-02 16:46:14 EST
2.4.0-test11 and 2.4.0-test12 APPEAR OK too.  

There is something going on between 2.4.0-test12 and 2.4.0-prerelease which
seems to be effecting the performance of the application.  I suspect (based on
timing information tracked in the app) that the cost of mmap() has increased in
situations where process maintains a large number of active mappings (~5000+).

It also appears the disk I/O has slowed significantly as well - perhaps related
to the above.

This behaviour is also shown in production 2.4.0.

I'll start looking at the 2.4.1-testX kernels - so far, I can only reproduce
regularly on 2.4.1.
Comment 4 Need Real Name 2001-02-05 10:55:51 EST
It appears that the change causes the behavior went in with 2.4.1-pre1.  As
noted previously, 2.4.0-prerelease and 2.4.0 both performed horribly compared to

2.4.1-pre1 crashes as intially described and the system becomes unusable - even
simple commands such as "ls" and "sync" die with SEGV.
Comment 5 Ben LaHaise 2001-02-05 18:32:23 EST
There are two things I'm curious about: could you try booting the kernel with
the nofxsr option?  Also, does the corruption still occur if you run the machine
with no swap?
Comment 6 Christopher Blizzard 2001-02-05 19:55:21 EST
add dmgrime to the cc list
Comment 7 Need Real Name 2001-02-06 15:03:28 EST
Tests run under both 2.4.1-pre1 AND 2.4.1:

nofxsr && noswap: performance problem as described above, no crash
nofxsr          : performance problem as described above, crash
noswap          : performance problem as described above, no crash

So, seems like the crash can be prevented by disabling swap, but the performance
problem seems to persist from 2.4.0-prerelease through 2.4.1 production.

The "performance problem" I keep referring to I will try to dig into - my first
instinct points at something with the raw device I/O.  I suspect it has to do
with concurrent raw requests to mlutiple physical devices, I'm going to rerun
some tests with only 1 spindle - the application serializes requests per
spindle, so this will rule out a concurrency race.
Comment 8 Ben LaHaise 2001-02-06 19:12:30 EST
Can you please test again with swap after applying the following patchball:
http://www.kvack.org/~blah/fix-v2.4.1-A.tar.gz  Unpack the tarball and apply the
patches with for i in fix-v2.4.1-A/*.diff ; do patch -p1 -s -N -E -d linux/ <$i
; done .  This has the kiobuf fixes from Stephen, a patch for zeropage COW based
on Linus' ideas, and Jens' block fixes.  I'm also curious to know which of the
patches make a difference (I expect that the zeropage fix is the culprit).

Comment 9 Need Real Name 2001-02-07 09:51:55 EST
Patched downloaded and applied against stock 2.4.1.  The crash symptoms appear
to be gone - but the "performance" issue remains.

Did something change from 2.4.0-test12 to 2.4.0-prerelease that would effect
performance of an application with MANY ( >5000 ) active mmap() segments?

There appears to be quite a bit of activity in mm/mmap.c - in particular the
removal of "merge_segments()"; perhaps related?

I'm going to try stock 2.4.1 with one the zeropage patch next to check stability
- update coming soon.
Comment 10 Need Real Name 2001-02-07 11:24:48 EST

Stock 2.4.1 + 05-zeropage.diff is stable - crash symptoms gone.  Performance
problems remain.  Please see previous note regarding mm/mmap.c.
Comment 11 Michael K. Johnson 2001-02-08 16:23:46 EST
Ben, I'm assigning this bug to you directly since you are working on it.
Comment 12 Ben LaHaise 2001-02-13 17:19:33 EST
Here's a quick update: I was pretty much out of commission last week, but I'm
back now and putting together a patch based on the suggestion that the removal
of segment merging in the kernel is the source of the problem.  I should have it
for you later on today, and will update this entry then.
Comment 13 Ben LaHaise 2001-08-13 11:54:23 EDT
This was fixed for 7.1 final.

Note You need to log in before you can comment on or make changes to this bug.