Bug 84448 - kernel-summit-2.4.9-e.12 crashes on low memory conditions (mem=1024m)
Summary: kernel-summit-2.4.9-e.12 crashes on low memory conditions (mem=1024m)
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel
Version: 2.1
Hardware: i686
OS: Linux
high
high
Target Milestone: ---
Assignee: Larry Woodman
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-02-17 15:05 UTC by Nils Philippsen
Modified: 2007-11-30 22:06 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2003-07-22 11:49:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
`top -d 5 -b` output bzip2ed (899.09 KB, application/octet-stream)
2003-02-17 15:07 UTC, Nils Philippsen
no flags Details
`vmstat 5` output bzip2ed (38.78 KB, application/octet-stream)
2003-02-17 15:15 UTC, Nils Philippsen
no flags Details
profile log (every 5 seconds, count reset each time) bzip2ed, but for e.11 (seems I forgot this with e.12) (64.45 KB, application/octet-stream)
2003-02-17 15:17 UTC, Nils Philippsen
no flags Details
profile log (every 5 seconds, count reset each time) bzip2ed (87.17 KB, application/octet-stream)
2003-02-27 12:27 UTC, Nils Philippsen
no flags Details
`vmstat 5` output bzip2ed (67.24 KB, application/octet-stream)
2003-02-27 12:42 UTC, Nils Philippsen
no flags Details
`top -d 5 -b` output (abridged) bzip2ed (703.42 KB, application/octet-stream)
2003-02-27 12:53 UTC, Nils Philippsen
no flags Details

Description Nils Philippsen 2003-02-17 15:05:21 UTC
Description of problem:

kernel-summit-2.4.9-e.12 crashes on low memory conditions (mem=1024m) when
running the SAP "kernel benchmark" stress tests.

Version-Release number of selected component (if applicable):

kernel-summit-2.4.9-e.12

How reproducible:

Within some hours of running the tests.

Steps to Reproduce:
1. Boot said kernel with "mem=1024"
2. Run the SAP "kernel benchmark" which is basically an R/3 system (SAPDB
database backend) with 200 simulated users who try to perform business transactions
    
Actual results:

Machine crashes after some hours.

Expected results:

Machine "crawls" through the benchmark but finishes it (1G really is too few
physical mem for this kind of test).

Additional info:

Shows some symptoms of aberrant kswapd behaviour: increased number of blocked
processes (in this case the SAPDB kernel processes), machine "freezing" after a
while, but not high amount of CPU time wasted on kswapd.

Comment 1 Nils Philippsen 2003-02-17 15:07:48 UTC
Created attachment 90131 [details]
`top -d 5 -b` output bzip2ed

Comment 2 Nils Philippsen 2003-02-17 15:15:59 UTC
Created attachment 90132 [details]
`vmstat 5` output bzip2ed

Comment 3 Nils Philippsen 2003-02-17 15:17:26 UTC
Created attachment 90134 [details]
profile log (every 5 seconds, count reset each time) bzip2ed, but for e.11 (seems I forgot this with e.12)

Comment 4 Larry Woodman 2003-02-18 16:52:25 UTC
Based on the attachments here there is no evidence that there are any
problems in kswapd.  Both the top and vmstat logs show that there is
8GB of memory to start with and it never drops below ~1.7GB free.  The
profile output shows that there is never > 1/10% of total time spent in
kswapd or any of the routines it calls.  Durring the max system time we
see in top about 4 CPUs out of 16CPUs are in huft_build which is used to
compress and inflate images.  

Are you sure you attached the correct log files???

Thanks, Larry Woodman


Comment 5 Dave Anderson 2003-02-18 17:02:04 UTC
I took a look at the attached files to see whether their data points to the
kswapd issue.

The profile.log was apparently created against the wrong System.map file.
Besides showing "huft_build" as the overwhelmingly dominant function,
(which most likely is supposed to be default_idle) it has references
to 2.4.18-era VM functions that don't even exist in AS2.1's 2.4.9 kernel.
(kswapd_balance_pgdat and check_classzone_need_balance for examples). 

The top output shows nothing at all that points at the kswapd issue; in fact
the kswapd, krefilld, bdflush and kupdated daemons are hardly running at all.
It would be helpful to get an Alt-Sysrq-M output, though, to see what the
per-zone memory distribution is; a /proc/meminfo dump a bit less helpful
but potentially useful.  A dump of /proc/slabinfo might show whether there's
an aberrant amount of some resource consuming low memory.  But, with the data
we have in hand, there's no indication that there's any low memory pressure;
if there were, you'd expect to see the VM daemons cranked up doing their thing.

But the crux of the matter is to find out what the "dw.sapBEN_DVEBM" and
"kernel" processes (the ones that are typically in an uninterruptible state
when the load is high) are contending for.  An Alt-Sysrq-T would be useful to
get an idea of what kernel paths they've taken.  

Another run of "top" against the correct System.map may also yield 
some more clues.

Dave Anderson


Comment 6 Nils Philippsen 2003-02-18 17:46:39 UTC
I have an explanation for this: as I already stated in the thread on the mailing
list (but forgot to put here as well), all userlevel stuff on that machine is
SuSE SLES7 (only machine of that kind here at SAP, not enough space to install
AS on it, wiping the SuSE installation is not an option).

Being spoiled by RHL initscripts setting the symlink System.map ->
System.map.2.4.x.y.z at boot time I didn't think about that this is not the case
on that System.

Unfortunately I have a strained ankle ATM and can't drive to Walldorf due to it.
I might be able to get there tomorrow or the day after it, it depends on my foot
getting better. I also asked the colleagues at SAP LinuxLab to restart the tests
for me.

Comment 7 Nils Philippsen 2003-02-27 12:27:56 UTC
Created attachment 90397 [details]
profile log (every 5 seconds, count reset each time) bzip2ed

Comment 8 Nils Philippsen 2003-02-27 12:42:33 UTC
Created attachment 90399 [details]
`vmstat 5` output bzip2ed

Comment 9 Nils Philippsen 2003-02-27 12:53:25 UTC
Created attachment 90400 [details]
`top -d 5 -b` output (abridged) bzip2ed

Comment 10 Nils Philippsen 2003-02-27 12:54:38 UTC
Logs from new test run attached. Top log is abridged because bugzilla doesn't
let me attach non-patch files > 1024kb.

Comment 11 Larry Woodman 2003-02-27 19:46:05 UTC
OK, I done see any evidence of a crash in any of the attachments.
Like you said, the machine is pretty much crawling with only 1GB
of memory because its not enough to run this set of applications.
But, it is running and kswapd/krefilld are reclaiming memory and
apps are making some progress on the other 15 cpus.  Am I missing 
something here???

Larry Woodman


Comment 12 Nils Philippsen 2003-02-28 07:15:05 UTC
Hmm, the machine is set up with "panic=30" and will by default boot to another
kernel. When I returned in the morning, it was running that other kernel and at
least with me, the top log is cut off "somewhere in the middle" probably while
it was writing -- I see a few hundred non-printable characters at the end.

Unfortunately I have never set up netdump before or I would have done it this time. 

Comment 13 Nils Philippsen 2003-03-24 16:29:38 UTC
Update: I did the tests now with "mem=exactmap mem=...@... mem=...@..." instead
of "mem=1024m" and the machine did run for one or two days, but then kind of
semi-froze (I have no better word for it): Some processes definitely worked
while others didn't (I could telnet onto the machine to see the login: prompt,
then type in username and password, then nothing happened anymore; logging in at
a VC showed a similar picture; pinging worked; ssh didn't work).

My out-of-my-stomach guess is that some processes were swapped out and couldn't
be swapped in or something similar (but I have nothing to backen this). I think
netdump wouldn't help in this case (with the machine not being panicking or
oopsing).

I can attach the logs of this run on request, but I can't see much difference to
the older ones.

Any thoughts how we might find the culprit of this problem?

Comment 14 Larry Woodman 2003-06-23 18:09:12 UTC
Is this still a problem with the latest AS2.1 kernel errata(e.24)?
Lots more lowmem changes have been made since this bug was opened.

Larry Woodman


Comment 15 Nils Philippsen 2003-07-08 23:36:03 UTC
Problem still persistent, but we suspect that it arises from us artificially
trying to limit the memory used via "mem=...". As soon as the IBM guy has the
time, we will try it with physical RAM removed.

Comment 16 Nils Philippsen 2003-07-22 11:49:38 UTC
Both IBM and Fujitsu-Siemens got confirmation that their respective Summit
machines aren't supported with less that 2GB per cage = 4GB. In fact, using less
memory is apparently bound to show strange behaviour according to IBM engineers.


Note You need to log in before you can comment on or make changes to this bug.