Description of problem:
kernel-summit-2.4.9-e.12 crashes on low memory conditions (mem=1024m) when
running the SAP "kernel benchmark" stress tests.
Version-Release number of selected component (if applicable):
Within some hours of running the tests.
Steps to Reproduce:
1. Boot said kernel with "mem=1024"
2. Run the SAP "kernel benchmark" which is basically an R/3 system (SAPDB
database backend) with 200 simulated users who try to perform business transactions
Machine crashes after some hours.
Machine "crawls" through the benchmark but finishes it (1G really is too few
physical mem for this kind of test).
Shows some symptoms of aberrant kswapd behaviour: increased number of blocked
processes (in this case the SAPDB kernel processes), machine "freezing" after a
while, but not high amount of CPU time wasted on kswapd.
Created attachment 90131 [details]
`top -d 5 -b` output bzip2ed
Created attachment 90132 [details]
`vmstat 5` output bzip2ed
Created attachment 90134 [details]
profile log (every 5 seconds, count reset each time) bzip2ed, but for e.11 (seems I forgot this with e.12)
Based on the attachments here there is no evidence that there are any
problems in kswapd. Both the top and vmstat logs show that there is
8GB of memory to start with and it never drops below ~1.7GB free. The
profile output shows that there is never > 1/10% of total time spent in
kswapd or any of the routines it calls. Durring the max system time we
see in top about 4 CPUs out of 16CPUs are in huft_build which is used to
compress and inflate images.
Are you sure you attached the correct log files???
Thanks, Larry Woodman
I took a look at the attached files to see whether their data points to the
The profile.log was apparently created against the wrong System.map file.
Besides showing "huft_build" as the overwhelmingly dominant function,
(which most likely is supposed to be default_idle) it has references
to 2.4.18-era VM functions that don't even exist in AS2.1's 2.4.9 kernel.
(kswapd_balance_pgdat and check_classzone_need_balance for examples).
The top output shows nothing at all that points at the kswapd issue; in fact
the kswapd, krefilld, bdflush and kupdated daemons are hardly running at all.
It would be helpful to get an Alt-Sysrq-M output, though, to see what the
per-zone memory distribution is; a /proc/meminfo dump a bit less helpful
but potentially useful. A dump of /proc/slabinfo might show whether there's
an aberrant amount of some resource consuming low memory. But, with the data
we have in hand, there's no indication that there's any low memory pressure;
if there were, you'd expect to see the VM daemons cranked up doing their thing.
But the crux of the matter is to find out what the "dw.sapBEN_DVEBM" and
"kernel" processes (the ones that are typically in an uninterruptible state
when the load is high) are contending for. An Alt-Sysrq-T would be useful to
get an idea of what kernel paths they've taken.
Another run of "top" against the correct System.map may also yield
some more clues.
I have an explanation for this: as I already stated in the thread on the mailing
list (but forgot to put here as well), all userlevel stuff on that machine is
SuSE SLES7 (only machine of that kind here at SAP, not enough space to install
AS on it, wiping the SuSE installation is not an option).
Being spoiled by RHL initscripts setting the symlink System.map ->
System.map.2.4.x.y.z at boot time I didn't think about that this is not the case
on that System.
Unfortunately I have a strained ankle ATM and can't drive to Walldorf due to it.
I might be able to get there tomorrow or the day after it, it depends on my foot
getting better. I also asked the colleagues at SAP LinuxLab to restart the tests
Created attachment 90397 [details]
profile log (every 5 seconds, count reset each time) bzip2ed
Created attachment 90399 [details]
`vmstat 5` output bzip2ed
Created attachment 90400 [details]
`top -d 5 -b` output (abridged) bzip2ed
Logs from new test run attached. Top log is abridged because bugzilla doesn't
let me attach non-patch files > 1024kb.
OK, I done see any evidence of a crash in any of the attachments.
Like you said, the machine is pretty much crawling with only 1GB
of memory because its not enough to run this set of applications.
But, it is running and kswapd/krefilld are reclaiming memory and
apps are making some progress on the other 15 cpus. Am I missing
Hmm, the machine is set up with "panic=30" and will by default boot to another
kernel. When I returned in the morning, it was running that other kernel and at
least with me, the top log is cut off "somewhere in the middle" probably while
it was writing -- I see a few hundred non-printable characters at the end.
Unfortunately I have never set up netdump before or I would have done it this time.
Update: I did the tests now with "mem=exactmap mem=...@... mem=...@..." instead
of "mem=1024m" and the machine did run for one or two days, but then kind of
semi-froze (I have no better word for it): Some processes definitely worked
while others didn't (I could telnet onto the machine to see the login: prompt,
then type in username and password, then nothing happened anymore; logging in at
a VC showed a similar picture; pinging worked; ssh didn't work).
My out-of-my-stomach guess is that some processes were swapped out and couldn't
be swapped in or something similar (but I have nothing to backen this). I think
netdump wouldn't help in this case (with the machine not being panicking or
I can attach the logs of this run on request, but I can't see much difference to
the older ones.
Any thoughts how we might find the culprit of this problem?
Is this still a problem with the latest AS2.1 kernel errata(e.24)?
Lots more lowmem changes have been made since this bug was opened.
Problem still persistent, but we suspect that it arises from us artificially
trying to limit the memory used via "mem=...". As soon as the IBM guy has the
time, we will try it with physical RAM removed.
Both IBM and Fujitsu-Siemens got confirmation that their respective Summit
machines aren't supported with less that 2GB per cage = 4GB. In fact, using less
memory is apparently bound to show strange behaviour according to IBM engineers.