With 16GB RAM/12GB swap, the system nearly freezes when we do the following (described below). We ran "vmstat 1" and observed that the system load is always between 60% and 100% while user is 1% max and idle something between. Version-Release number of selected component (if applicable): kernel-enterprise-2.4.9-e.38 How reproducible: Always Steps to Reproduce: 1. Start an SAP system 2. run SAP reports which do the following - one report runs 14 threads each of which allocate 1GB of shared memory in /dev/shm and loop over 60% of it (so that it doesn't stay swapped out), these threads should end themselves after about 5 minutes - one report runs 10 threads each of which continually write about 50MB of data into the database, but don't commit it to produce IO load Actual results: - System load of above 50%, user load max 1% - When adding the IO, the system nearly freezes: - running commands (e.g. stopping the SAP system) takes ages to complete - the reports that loop over the shared memory end -> no load on the allocated shared memory areas, but still no improvement until the IO reports get stopped (either if you manage to stop them with SAP means or by shutting down the SAP app server completely) - Everything back to normal when additional IO stops or the SAP system gets stopped (stopping the IO and releasing the allocated shared memory) Expected results: High load during run of the shared memory and IO reports, getting back to normal afterwards. Additional info: Running the same tests on a machine which only has 8GB RAM/16GB swap resulted in heavy swapping but it recovered after the shared memory reports were finished.
Created attachment 99005 [details] vmstat output during observation of the problem The problem can be seen between lines 540 and 1825 of the log file.
FYI: This problem was originally observed on SuSE SLES8 on machines with huge amounts of RAM, with their respective kernels 2.4.19 through 2.4.21. We know that the VM is completely different in 2.4.9-e.38, but for a starting pointer here is the patch that fixed it in these kernels: Andrea Arcangeli: Patch: 2.4.23aa2 (bugfixes and important VM improvements for the high end) http://lwn.net/Articles/73367/ The interesting part is titled "05_vm_28-shmem-big-iron-1".
I suspect this may not be fixable in AS2.1 without breaking binary compatibility and/or upsetting VM balancing for all the other workloads. I've seen the object based reverse mapping used in UL 1.5 and it is a rather intrusive change... I'm not convinced it would be responsible to take that kind of risk in AS2.1. Btw, how does RHEL3 work in the same test ?
We provided the link to the patch only for "illustrative" purposes, i.e. like in "this is what Andrea hat to change in their VM to work around the problem", not in "let's use this" and setting the "EasyFix" keyword ;-). We were doing the tests on RHEL2.1 unti yesterday and can begin tests on a RHEL3 kernel shortly (2.4.21-9.0.1.EL as that is the certification candidate for SAP). I'll keep you updated on that right here.
Please get several "AltSysrq M" outputs when the system in is this state and attach them to this bug so we can determine the exact state of the system. Thanks, Larry Woodman
I suspect that alt-sysrq-p and/or -t may be useful too, or even some kernel profiling to see what's going on during the bad periods.
Created attachment 99109 [details] log of kern.* while issuing sysrq-m, -p, -t some sysrq-m, -p, -t separated by sysrq help lines
Unfortunately RHEL3 testing isn't possible until after easter, because the machine is only accessible remotely for us and we didn't want to mess up the RHEL2.1 installation with a RHEL3 kernel (and all the needed tools), so we'll have to wait for a real RHEL3 installation to be done by someone who has physical access.
Nils, can you get a top output while the system is in this state so we can see which kernel theads are running? Larry
Created attachment 99397 [details] top output of system long (days) after the problem Top output of days after the problem. Notice the oracle process that still consumes ~100% cpu and is responsible for 100% system load on one CPU.
Created attachment 99398 [details] strace output of the oracle process in question Strace output of "strace -tt -p 20239", i.e. an strace on the Oracle process in question. Notice the long delay (several seconds) on readv()/pread() on fd 15, which is one of the Oracle data files: [root@walsrv09 root]# lsof -p 20239|grep 15u oracle 20239 root 15u REG 104,11 524296192 9093189 /data/oracle.hp1/sapdata1/system_1/system.data1 [root@walsrv09 root]#
NB: Currently the system is idle, i.e. nothing is going on to our knowledge.
NB^2: Now the process (which ps calls "or_smon_HP1" BTW) is idle, apparently it could finish whatever it was doing in the meantime.
Logging in and starting the ABAP development environment provoked similar behavior to what was described initially, hangs on reading from fds 14,15 which are other Oracle data files: [root@walsrv09 root]# lsof -p 20389 | egrep '14u|15u' oracle 20389 orahp1 14u REG 104,11 2097160192 2408517 /data/oracle.hp1/sapdata3/es46ci_1/es46ci.data1 oracle 20389 orahp1 15u REG 104,11 2097160192 1458216 /data/oracle.hp1/sapdata2/es46cd_1/es46cd.data1 [root@walsrv09 root]#
NB: This time it's a real Oracle worker process, not a monitor: [root@walsrv09 root]# ps 20389 PID TTY STAT TIME COMMAND 20389 ? R 441:49 oracleHP1 (DESCRIPTION=(LOCAL=NO)(SDU=32768)) [root@walsrv09 root]#
After looking more closely at the vmstat, AltSysrq and top outputs I am wondering if the problem is in a different area than we first thought. Please try to get a kernel profile when the system is chewing up 100% of a cpu in system mode and attach the output. Here is a cookbook approach to collecting the necessary stats. 1. Enable kernel profiling by turning on nmi_watchdog and allocating the kernel profile buffer. For example, add the following two items to the "kernel" line of /boot/grub/grub.conf: profile=2 nmi_watchdog=1 as in the following example: kernel /vmlinuz-2.4.9-e.9smp ro profile=2 nmi_watchdog=1 root=0805 ^^^^^^^^^ ^^^^^^^^^^^^^^ Then reboot. 2. Create a shell script containing the following lines: #!/bin/sh while /bin/true; do echo;date /usr/sbin/readprofile -v | sort -nr +2 | head -15 /usr/sbin/readprofile -r sleep 5 done 3. Make the system demonstrate the aberrant kswapd behavior. 4. Run the following three commands simultaneously: Execute the readprofile shell script above, redirecting its output to a file. Execute "vmstat 5" and redirect its output to a second file Execute "top -d5" and redirect its output to a third file 5. Attach the three output files (preferably in gzip'd tar file format) to the appropriate bugzilla or issue tracker entry.
This bug is filed against RHEL2.1, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.