Bug 119580 - System nearly freezes under high shared memory load
Summary: System nearly freezes under high shared memory load
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel
Version: 2.1
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Larry Woodman
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-03-31 15:47 UTC by SAP LinuxLab
Modified: 2007-11-30 22:06 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-10-19 19:22:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vmstat output during observation of the problem (156.61 KB, text/plain)
2004-03-31 15:51 UTC, SAP LinuxLab
no flags Details
log of kern.* while issuing sysrq-m, -p, -t (213.22 KB, text/plain)
2004-04-05 12:38 UTC, Nils Philippsen
no flags Details
top output of system long (days) after the problem (14.57 KB, text/plain)
2004-04-14 11:43 UTC, Nils Philippsen
no flags Details
strace output of the oracle process in question (6.54 KB, text/plain)
2004-04-14 11:50 UTC, Nils Philippsen
no flags Details

Description SAP LinuxLab 2004-03-31 15:47:18 UTC
With 16GB RAM/12GB swap, the system nearly freezes when we do the
following (described below).

We ran "vmstat 1" and observed that the system load is always between
60% and 100% while user is 1% max and idle something between.

Version-Release number of selected component (if applicable):

kernel-enterprise-2.4.9-e.38

How reproducible:

Always

Steps to Reproduce:

1. Start an SAP system
2. run SAP reports which do the following
   - one report runs 14 threads each of which allocate 1GB of shared
memory in /dev/shm and loop over 60% of it (so that it doesn't stay
swapped out), these threads should end themselves after about 5 minutes
   - one report runs 10 threads each of which continually write about
50MB of data into the database, but don't commit it to produce IO load
  
Actual results:

- System load of above 50%, user load max 1%
- When adding the IO, the system nearly freezes:
  - running commands (e.g. stopping the SAP system) takes ages to complete
  - the reports that loop over the shared memory end -> no load on the
allocated shared memory areas, but still no improvement until the IO
reports get stopped (either if you manage to stop them with SAP means
or by shutting down the SAP app server completely)
- Everything back to normal when additional IO stops or the SAP system
gets stopped (stopping the IO and releasing the allocated shared memory)

Expected results:

High load during run of the shared memory and IO reports, getting back
to normal afterwards.

Additional info:

Running the same tests on a machine which only has 8GB RAM/16GB swap
resulted in heavy swapping but it recovered after the shared memory
reports were finished.

Comment 1 SAP LinuxLab 2004-03-31 15:51:37 UTC
Created attachment 99005 [details]
vmstat output during observation of the problem

The problem can be seen between lines 540 and 1825 of the log file.

Comment 2 SAP LinuxLab 2004-03-31 16:00:17 UTC
FYI: This problem was originally observed on SuSE SLES8 on machines
with huge amounts of RAM, with their respective kernels 2.4.19 through
2.4.21.

We know that the VM is completely different in 2.4.9-e.38, but for a
starting pointer here is the patch that fixed it in these kernels:

Andrea Arcangeli: Patch: 2.4.23aa2 (bugfixes and important VM
improvements for the high end)

http://lwn.net/Articles/73367/

The interesting part is titled "05_vm_28-shmem-big-iron-1".

Comment 3 Rik van Riel 2004-03-31 18:49:31 UTC
I suspect this may not be fixable in AS2.1 without breaking binary
compatibility and/or upsetting VM balancing for all the other
workloads. I've seen the object based reverse mapping used in UL 1.5
and it is a rather intrusive change...

I'm not convinced it would be responsible to take that kind of risk in
AS2.1.

Btw, how does RHEL3 work in the same test ?

Comment 4 Nils Philippsen 2004-04-01 07:02:41 UTC
We provided the link to the patch only for "illustrative" purposes,
i.e. like in "this is what Andrea hat to change in their VM to work
around the problem", not in "let's use this" and setting the "EasyFix"
keyword ;-).

We were doing the tests on RHEL2.1 unti yesterday and can begin tests
on a RHEL3 kernel shortly (2.4.21-9.0.1.EL as that is the
certification candidate for SAP). I'll keep you updated on that right
here.

Comment 5 Larry Woodman 2004-04-01 15:10:32 UTC
Please get several "AltSysrq M" outputs when the system in is this
state and attach them to this bug so we can determine the exact state
of the system.

Thanks, Larry Woodman

Comment 6 Rik van Riel 2004-04-01 16:19:03 UTC
I suspect that alt-sysrq-p and/or -t may be useful too, or even some
kernel profiling to see what's going on during the bad periods.

Comment 7 Nils Philippsen 2004-04-05 12:38:18 UTC
Created attachment 99109 [details]
log of kern.* while issuing sysrq-m, -p, -t

some sysrq-m, -p, -t separated by sysrq help lines

Comment 8 Nils Philippsen 2004-04-05 12:42:38 UTC
Unfortunately RHEL3 testing isn't possible until after easter, because
the machine is only accessible remotely for us and we didn't want to
mess up the RHEL2.1 installation with a RHEL3 kernel (and all the
needed tools), so we'll have to wait for a real RHEL3 installation to
be done by someone who has physical access.

Comment 9 Larry Woodman 2004-04-12 16:21:46 UTC
Nils, can you get a top output while the system is in this state so we
can see which kernel theads are running?

Larry


Comment 10 Nils Philippsen 2004-04-14 11:43:46 UTC
Created attachment 99397 [details]
top output of system long (days) after the problem

Top output of days after the problem. Notice the oracle process that still
consumes ~100% cpu and is responsible for 100% system load on one CPU.

Comment 11 Nils Philippsen 2004-04-14 11:50:01 UTC
Created attachment 99398 [details]
strace output of the oracle process in question

Strace output of "strace -tt -p 20239", i.e. an strace on the Oracle process in
question. Notice the long delay (several seconds) on readv()/pread() on fd 15,
which is one of the Oracle data files:

[root@walsrv09 root]# lsof -p 20239|grep 15u
oracle	20239 root   15u   REG	  104,11  524296192  9093189
/data/oracle.hp1/sapdata1/system_1/system.data1
[root@walsrv09 root]#

Comment 12 Nils Philippsen 2004-04-14 11:52:45 UTC
NB: Currently the system is idle, i.e. nothing is going on to our
knowledge. 

Comment 13 Nils Philippsen 2004-04-14 11:54:38 UTC
NB^2: Now the process (which ps calls "or_smon_HP1" BTW) is idle,
apparently it could finish whatever it was doing in the meantime.

Comment 14 Nils Philippsen 2004-04-14 12:00:32 UTC
Logging in and starting the ABAP development environment provoked
similar behavior to what was described initially, hangs on reading
from fds 14,15 which are other Oracle data files:

[root@walsrv09 root]# lsof -p 20389 | egrep '14u|15u'
oracle  20389 orahp1   14u   REG     104,11 2097160192   2408517
/data/oracle.hp1/sapdata3/es46ci_1/es46ci.data1
oracle  20389 orahp1   15u   REG     104,11 2097160192   1458216
/data/oracle.hp1/sapdata2/es46cd_1/es46cd.data1
[root@walsrv09 root]#


Comment 15 Nils Philippsen 2004-04-14 12:02:05 UTC
NB: This time it's a real Oracle worker process, not a monitor:

[root@walsrv09 root]# ps 20389
  PID TTY      STAT   TIME COMMAND
20389 ?        R    441:49 oracleHP1 (DESCRIPTION=(LOCAL=NO)(SDU=32768))
[root@walsrv09 root]#


Comment 16 Larry Woodman 2004-04-14 15:39:24 UTC
After looking more closely at the vmstat, AltSysrq and top outputs I
am wondering if the problem is in a different area than we first
thought.  Please try to get a kernel profile when the system is
chewing up 100% of a cpu in system mode and attach the output.  Here
is a cookbook approach to collecting the necessary stats.



1. Enable kernel profiling by turning on nmi_watchdog and allocating
the kernel profile buffer.
   For example, add the following two items to the "kernel" line of
/boot/grub/grub.conf:

     profile=2 nmi_watchdog=1

   as in the following example:

     kernel /vmlinuz-2.4.9-e.9smp ro profile=2 nmi_watchdog=1 root=0805
                                     ^^^^^^^^^ ^^^^^^^^^^^^^^
   Then reboot.

2. Create a shell script containing the following lines:

#!/bin/sh
while /bin/true; do
echo;date
/usr/sbin/readprofile -v | sort -nr +2 | head -15
/usr/sbin/readprofile -r
sleep 5
done

3. Make the system demonstrate the aberrant kswapd behavior.

4. Run the following three commands simultaneously:

    Execute the readprofile shell script above, redirecting its output
to a file.
    Execute "vmstat 5" and redirect its output to a second file
    Execute "top -d5" and redirect its output to a third file

5.  Attach the three output files (preferably in gzip'd tar file
format) to the
    appropriate bugzilla or issue tracker entry.


Comment 17 RHEL Program Management 2007-10-19 19:22:54 UTC
This bug is filed against RHEL2.1, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products.  Since
this bug does not meet that criteria, it is now being closed.

For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/

If you feel this bug is indeed mission critical, please contact your
support representative.  You may be asked to provide detailed
information on how this bug is affecting you.


Note You need to log in before you can comment on or make changes to this bug.