Red Hat Bugzilla – Bug 119580
System nearly freezes under high shared memory load
Last modified: 2007-11-30 17:06:54 EST
With 16GB RAM/12GB swap, the system nearly freezes when we do the
following (described below).
We ran "vmstat 1" and observed that the system load is always between
60% and 100% while user is 1% max and idle something between.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Start an SAP system
2. run SAP reports which do the following
- one report runs 14 threads each of which allocate 1GB of shared
memory in /dev/shm and loop over 60% of it (so that it doesn't stay
swapped out), these threads should end themselves after about 5 minutes
- one report runs 10 threads each of which continually write about
50MB of data into the database, but don't commit it to produce IO load
- System load of above 50%, user load max 1%
- When adding the IO, the system nearly freezes:
- running commands (e.g. stopping the SAP system) takes ages to complete
- the reports that loop over the shared memory end -> no load on the
allocated shared memory areas, but still no improvement until the IO
reports get stopped (either if you manage to stop them with SAP means
or by shutting down the SAP app server completely)
- Everything back to normal when additional IO stops or the SAP system
gets stopped (stopping the IO and releasing the allocated shared memory)
High load during run of the shared memory and IO reports, getting back
to normal afterwards.
Running the same tests on a machine which only has 8GB RAM/16GB swap
resulted in heavy swapping but it recovered after the shared memory
reports were finished.
Created attachment 99005 [details]
vmstat output during observation of the problem
The problem can be seen between lines 540 and 1825 of the log file.
FYI: This problem was originally observed on SuSE SLES8 on machines
with huge amounts of RAM, with their respective kernels 2.4.19 through
We know that the VM is completely different in 2.4.9-e.38, but for a
starting pointer here is the patch that fixed it in these kernels:
Andrea Arcangeli: Patch: 2.4.23aa2 (bugfixes and important VM
improvements for the high end)
The interesting part is titled "05_vm_28-shmem-big-iron-1".
I suspect this may not be fixable in AS2.1 without breaking binary
compatibility and/or upsetting VM balancing for all the other
workloads. I've seen the object based reverse mapping used in UL 1.5
and it is a rather intrusive change...
I'm not convinced it would be responsible to take that kind of risk in
Btw, how does RHEL3 work in the same test ?
We provided the link to the patch only for "illustrative" purposes,
i.e. like in "this is what Andrea hat to change in their VM to work
around the problem", not in "let's use this" and setting the "EasyFix"
We were doing the tests on RHEL2.1 unti yesterday and can begin tests
on a RHEL3 kernel shortly (2.4.21-9.0.1.EL as that is the
certification candidate for SAP). I'll keep you updated on that right
Please get several "AltSysrq M" outputs when the system in is this
state and attach them to this bug so we can determine the exact state
of the system.
Thanks, Larry Woodman
I suspect that alt-sysrq-p and/or -t may be useful too, or even some
kernel profiling to see what's going on during the bad periods.
Created attachment 99109 [details]
log of kern.* while issuing sysrq-m, -p, -t
some sysrq-m, -p, -t separated by sysrq help lines
Unfortunately RHEL3 testing isn't possible until after easter, because
the machine is only accessible remotely for us and we didn't want to
mess up the RHEL2.1 installation with a RHEL3 kernel (and all the
needed tools), so we'll have to wait for a real RHEL3 installation to
be done by someone who has physical access.
Nils, can you get a top output while the system is in this state so we
can see which kernel theads are running?
Created attachment 99397 [details]
top output of system long (days) after the problem
Top output of days after the problem. Notice the oracle process that still
consumes ~100% cpu and is responsible for 100% system load on one CPU.
Created attachment 99398 [details]
strace output of the oracle process in question
Strace output of "strace -tt -p 20239", i.e. an strace on the Oracle process in
question. Notice the long delay (several seconds) on readv()/pread() on fd 15,
which is one of the Oracle data files:
[root@walsrv09 root]# lsof -p 20239|grep 15u
oracle 20239 root 15u REG 104,11 524296192 9093189
NB: Currently the system is idle, i.e. nothing is going on to our
NB^2: Now the process (which ps calls "or_smon_HP1" BTW) is idle,
apparently it could finish whatever it was doing in the meantime.
Logging in and starting the ABAP development environment provoked
similar behavior to what was described initially, hangs on reading
from fds 14,15 which are other Oracle data files:
[root@walsrv09 root]# lsof -p 20389 | egrep '14u|15u'
oracle 20389 orahp1 14u REG 104,11 2097160192 2408517
oracle 20389 orahp1 15u REG 104,11 2097160192 1458216
NB: This time it's a real Oracle worker process, not a monitor:
[root@walsrv09 root]# ps 20389
PID TTY STAT TIME COMMAND
20389 ? R 441:49 oracleHP1 (DESCRIPTION=(LOCAL=NO)(SDU=32768))
After looking more closely at the vmstat, AltSysrq and top outputs I
am wondering if the problem is in a different area than we first
thought. Please try to get a kernel profile when the system is
chewing up 100% of a cpu in system mode and attach the output. Here
is a cookbook approach to collecting the necessary stats.
1. Enable kernel profiling by turning on nmi_watchdog and allocating
the kernel profile buffer.
For example, add the following two items to the "kernel" line of
as in the following example:
kernel /vmlinuz-2.4.9-e.9smp ro profile=2 nmi_watchdog=1 root=0805
2. Create a shell script containing the following lines:
while /bin/true; do
/usr/sbin/readprofile -v | sort -nr +2 | head -15
3. Make the system demonstrate the aberrant kswapd behavior.
4. Run the following three commands simultaneously:
Execute the readprofile shell script above, redirecting its output
to a file.
Execute "vmstat 5" and redirect its output to a second file
Execute "top -d5" and redirect its output to a third file
5. Attach the three output files (preferably in gzip'd tar file
format) to the
appropriate bugzilla or issue tracker entry.
This bug is filed against RHEL2.1, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
For more information of the RHEL errata support policy, please visit:
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.