Bug 25749 - zero page curruption in 2.4.*
Summary: zero page curruption in 2.4.*
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.0
Hardware: i386
OS: Linux
high
high
Target Milestone: ---
Assignee: Ben LaHaise
QA Contact: Brock Organ
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2001-02-02 19:35 UTC by Christopher Blizzard
Modified: 2005-10-31 22:00 UTC (History)
3 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2001-02-13 22:19:44 UTC


Attachments (Terms of Use)

Description Christopher Blizzard 2001-02-02 19:35:10 UTC
Here's some setup email from a good friend of mine:

---------

I've got a Linux question - hoped you could point me at someone who might
be able to help.  In my latest DE testing I've been able to cause a
2.4.{0,1} production system to get into a state where it is unusable -
almost every command you type exits with SEGV - I can't even su - to shut
it down!  It happens after pounding the box running the deltaedge software
(runs as unpriveledged user on a stock kernel, so I don't think there is
any way I could have been responsible for mucking the kernel memory).  It
runs OK for a little ibt and then the server pukes with a SEGV, and then
the machine is hosed.  When the machine gets into this state, a simple
program like:

#include <stdio.h>

static void *foo[10000];

int main() {
  int i;

  for (i = 0; i < 10000; i++)
    if (foo[i] != NULL)
       printf("ERROR\n");
}

Will start showing error - after i >= 640...really weird considering the C
spec says that the entire memory occupied by foo MUST be 0!

--------

The application that he's talking about is a big proxy server.  He says
that he knows that this problem doesn't show up in test9 and he's doing a
binary search to figure out exactly where the problem started showing up.

Some stats:

gcc 2.91.66, roughly 6.2 with a maybe hacked libc ( he'll have to give more
information here. )

The machine is a VA box with 2 PIII 750 mhz cpus and a gig of ram.

The application makes heavy use of threads, mmap and raw I/O.  He's going
to see about getting us a solid test case.

He'll be added to this bug as soon as I let him know the ID.

Comment 1 Christopher Blizzard 2001-02-02 19:46:08 UTC
Client side test case software (before I forget):

http://polygraph.ircache.net/

Comment 2 Need Real Name 2001-02-02 20:22:37 UTC
I'm looking for the kernel rev where the problem appears to be introduced - as
the problem doens't occurr at exactly the sampe place every time, it might be a
while before I can track it down.

So far, 2.4.0-test10 APPEARS to be OK.


Comment 3 Need Real Name 2001-02-02 21:46:14 UTC
2.4.0-test11 and 2.4.0-test12 APPEAR OK too.  

There is something going on between 2.4.0-test12 and 2.4.0-prerelease which
seems to be effecting the performance of the application.  I suspect (based on
timing information tracked in the app) that the cost of mmap() has increased in
situations where process maintains a large number of active mappings (~5000+).

It also appears the disk I/O has slowed significantly as well - perhaps related
to the above.

This behaviour is also shown in production 2.4.0.

I'll start looking at the 2.4.1-testX kernels - so far, I can only reproduce
regularly on 2.4.1.

Comment 4 Need Real Name 2001-02-05 15:55:51 UTC
It appears that the change causes the behavior went in with 2.4.1-pre1.  As
noted previously, 2.4.0-prerelease and 2.4.0 both performed horribly compared to
2.4.0-test{9,10,11,12}.

2.4.1-pre1 crashes as intially described and the system becomes unusable - even
simple commands such as "ls" and "sync" die with SEGV.


Comment 5 Ben LaHaise 2001-02-05 23:32:23 UTC
There are two things I'm curious about: could you try booting the kernel with
the nofxsr option?  Also, does the corruption still occur if you run the machine
with no swap?


Comment 6 Christopher Blizzard 2001-02-06 00:55:21 UTC
add dmgrime to the cc list

Comment 7 Need Real Name 2001-02-06 20:03:28 UTC
Tests run under both 2.4.1-pre1 AND 2.4.1:

nofxsr && noswap: performance problem as described above, no crash
nofxsr          : performance problem as described above, crash
noswap          : performance problem as described above, no crash

So, seems like the crash can be prevented by disabling swap, but the performance
problem seems to persist from 2.4.0-prerelease through 2.4.1 production.

The "performance problem" I keep referring to I will try to dig into - my first
instinct points at something with the raw device I/O.  I suspect it has to do
with concurrent raw requests to mlutiple physical devices, I'm going to rerun
some tests with only 1 spindle - the application serializes requests per
spindle, so this will rule out a concurrency race.

Comment 8 Ben LaHaise 2001-02-07 00:12:30 UTC
Can you please test again with swap after applying the following patchball:
http://www.kvack.org/~blah/fix-v2.4.1-A.tar.gz  Unpack the tarball and apply the
patches with for i in fix-v2.4.1-A/*.diff ; do patch -p1 -s -N -E -d linux/ <$i
; done .  This has the kiobuf fixes from Stephen, a patch for zeropage COW based
on Linus' ideas, and Jens' block fixes.  I'm also curious to know which of the
patches make a difference (I expect that the zeropage fix is the culprit).

		-ben

Comment 9 Need Real Name 2001-02-07 14:51:55 UTC
Patched downloaded and applied against stock 2.4.1.  The crash symptoms appear
to be gone - but the "performance" issue remains.

Did something change from 2.4.0-test12 to 2.4.0-prerelease that would effect
performance of an application with MANY ( >5000 ) active mmap() segments?

There appears to be quite a bit of activity in mm/mmap.c - in particular the
removal of "merge_segments()"; perhaps related?

I'm going to try stock 2.4.1 with one the zeropage patch next to check stability
- update coming soon.

Comment 10 Need Real Name 2001-02-07 16:24:48 UTC
Update: 

Stock 2.4.1 + 05-zeropage.diff is stable - crash symptoms gone.  Performance
problems remain.  Please see previous note regarding mm/mmap.c.


Comment 11 Michael K. Johnson 2001-02-08 21:23:46 UTC
Ben, I'm assigning this bug to you directly since you are working on it.

Comment 12 Ben LaHaise 2001-02-13 22:19:33 UTC
Here's a quick update: I was pretty much out of commission last week, but I'm
back now and putting together a patch based on the suggestion that the removal
of segment merging in the kernel is the source of the problem.  I should have it
for you later on today, and will update this entry then.

Comment 13 Ben LaHaise 2001-08-13 15:54:23 UTC
This was fixed for 7.1 final.


Note You need to log in before you can comment on or make changes to this bug.