Bug 106794
Summary: | LTC4829-RHEL 3 HANGS under heavy stress load | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | IBM Bug Proxy <bugproxy> | ||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 3.0 | CC: | anderson, bjohnson, mingo, petrides, riel, tburke | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-12-03 02:02:48 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 106472 | ||||||
Attachments: |
|
Description
IBM Bug Proxy
2003-10-10 17:18:36 UTC
------ Additional Comments From yvchan.com 2003-10-10 13:30 ------- More notes: caps lock and num lock don't show any response when I tried them after the last hang. ------ Additional Comments From yvchan.com 2003-12-10 10:35 ------- Our second machine has hung in the same way as the first machine. It hung at approximately 10pm last night. top output also shows something taking up 100 % cpu time, but the process doesn't appear in the last incarnation of the top output. 22:35:52 up 1 day, 6:59, 4 users, load average: 6.04, 2.64, 1.86 494 processes: 488 sleeping, 5 running, 1 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 61.4% 0.0% 38.5% 0.0% 0.0% 0.0% 0.0% cpu00 91.8% 0.0% 8.1% 0.0% 0.0% 0.0% 0.0% cpu01 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% Mem: 1028452k av, 1005416k used, 23036k free, 0k shrd, 25228k buff 723244k actv, 51880k in_d, 13876k in_c Swap: 2048248k av, 853856k used, 1194392k free 242356k cached The 91% usertime appears to be a db2sysc process, but it doesn't appear to be stuck. I had run several backtraces on it, and it appeared to be processing data as it is suppose to be. ------ Additional Comments From dmosby.com 2003-12-10 14:53 ------- I have observed three potential error conditions on this system: 1) System is hung. No response on console, unable to telent/ssh to system, and caps lock key does not activate caps lock light indicating that interrupts are not being processed. 2) Process "db2inst1" is constantly on a processor. Able to ssh to box. The process is in kernel space 100%. An strace of process will hang, but can be killed from another connection. 3) Process "db2inst1" is constantly on a processor. Able to ssh to box. The process returns to user space, executes msgsnd in a loop constantly moving to kernel space to service the system call. It is possible to move from state 2 to state 1. My believe at this time is that conditions 1 and 2 above are the same kernel problem. I need to do some additional investigation into this and that investigation is in progress. Conditions 2 and 3 can be differentiated by running strace on the process. Execute the command: strace -p <process pid> | less In state 2 the strace command will hang. Kill the strace using another connection. In state 3 the strace will show calls to msgsend as: msgsnd(1769483, {1, "p#"}, 2, 0) = 0 msgsnd(1802252, {1, "p#"}, 2, 0) = 0 msgsnd(1835021, {1, "p#"}, 2, 0) = 0 msgsnd(1769483, {1, "p#"}, 2, 0) = 0 msgsnd(1802252, {1, "p#"}, 2, 0) = 0 In the condition observed the command "ipcs -q" indicated that the data was being read and another process was located that was consuming the data, executing msgrcv. Whether or not this activity is a an application error I cannot say but there does not appear to be any operating system error in this third state. Work to demonstrate the cause of error show in states 1 & 2 is in progress. ------ Additional Comments From yvchan.com 2003-14-10 10:14 ------- small corrections to Dale's append earlier. state 2 has db2ipccm on the processor state 3 has db2agent on the processor We are currently investigating state 3 internally here. ------ Additional Comments From dmosby.com 2003-14-10 11:24 ------- In e-mail 14-Oct-03 Mark Wisner wrote: > Dale, Any luck? > If not, do you have a feeling where the problem is? > Is it the distro or the application or both? > In your last append to the bug, you say "conditions 1 and 2 above are > the same kernel problem" then "Whether or not this activity is a an > application error I cannot say but there does not appear to be any > operating system error in this third state." > > From what I gather, we have two problems, > 1 & 2 appear to be a kernel problem and 3 > seems to be an application problem. > > RH wants to GA on the 22nd, so things are > getting close. Is there anything you need? My response: Problem: Yes, I think there are two conditions that give the appearance of problems. What I described as "state 1" (complete system hang), and "state 2" (process spinning in kernel) I think are the same problem. What I described at "state 3" (process continually sending 2 byte messages) may or may not be an application problem. I know nothing about application. This third state does not appear to be a kernel problem, as far as I can tell the operating system is working properly when this is observed. I am unable to do any debugging in "state 1". My debugging in "state 2" is based on looking at process task stack and digging through memory in a single instance of hang. Before coming forward with theory on error I would like to gather more information using a more robust debugging technique. I had hoped to create a module to assist in debugging. This would let me insert the module into the running kernel, allowing observation of some data structures of interest. This would facilitate easy modifications to the module, allowing observation of more data without a kernel reboot. I found that the particular data of interest is local to a single source file and creating a loadable module able to access this will be a tedious process and require more time than I desire to invest in a task lacking guaranteed completion time. I changed tactics and used the less elegant approach of simply modifying a source file to create a new /proc file to provide debug information. I made the file modification on my personal system, a single processor box running RH9. I did pull the RHEL 3.0 sources down to my box but am unable to compile that kernel here. It may be that I need a RHEL 3.0 gcc, I do not know. What I need: 1) Have a kernel locally configured and built. 2) Allow me to modify the file on one or both (I suggest both) systems in order to create a new /proc entry. 3) Re-run test attempting to put system in "state 2" (process spinning in kernel and observe data structure. 4) Further debugging from this point may require additional kernel modifications. 5) Provide IP address and password for second system. ------ Additional Comments From yvchan.com 2003-15-10 10:13 ------- Several updates (sent via e-mail, and reposted here): From Dale (Oct 15; 1:01 am) Status of machine "valrhel3rc2": Modified the file "ipc/shm.c" such that it creates /proc/sysvipc/shmlks This /proc entry is similar to /proc/sysvipc/shm with the following changes: 1) Does not grab any locks when obtaining data 2) Displays value of global semaphore for list and value of each spinlock There is a bit of other info for data structure displayed as well. The impact of not grabbing the lock means that there is a very small possibility that data will be modified while I am printing it. However if the problem observed the other day is actually a held or botched lock then we can't try and grab it. I immediately hung the box the other day by trying to look at /proc/sysvipc/shm so seems that this is the case. Built a new kernel and installed as /boot/linux-ipcdebug1 Modified boot/grub.conf to make this the default boot kernel by setting "default=3". Started the DB2 test. Created shell script "monitor.sh" in directory /home/db2inst1/absolut/absolut. This script is running. It will output the contents of /proc/sysvipc/shmlks once every minute into file "lock.log" in this directory. I checked on the system before sending this and find that process 2910 is spinning on cpu1. It never seems to come back into user state - I cannot strace it. Unless it does come back off the kernel in some manner that strace cannot follow. I could output /proc/sysvipc/shm and see if system hangs to test this idea of botched lock, but that could hang system so I will wait until tomorrow. My output does not show a held/trashed lock. So I don't know what went wrong. I will have to look at this more tomorrow. Thanks for your help on this. We will talk more Wednesday. My response (Oct 15, 8:52 am) Interesting ... process 2910 is not the db2ipccm. It's an application process that is stuck in kernel space. This is one of the bh clients. Dale's response (Oct 15, 10:04 am) This morning the system still has process 2190 tying up one cpu and the shared memory data structures look "ok", at least the locking info looks fine. I'm going to go have a cup of coffee and think about this some more. Actually I thought about it too much during the night. I'm glad this is a dual processor system so that I can still get on with one spinning process, though I wonder if we would hit this on a single CPU system. ------ Additional Comments From dmosby.com 2003-16-10 14:46 ------- I have had a chance to look at two systems in the state in which a process is spinning on a processor. In both cases attempting to run strace on the process would hang with no output indicating that the process was never coming back out to user space. In both of these investigations use of /proc/sysreq-trigger was made to generate stack information. [I will send a separate note to Greg & Mark that this has the potential to be very useful for debugging. I would very much like to see this feature made to work fully. It does not work to dump process registers, and if it did it would only show registers for CPU handling the request (see sysrq_handle_showregs in sysrq.c). I would like to have this send an inter-process-interrupt to every CPU and get registers from all processors. I would like to see this passed along to RH. Will pursue off-line.] In looking at the first stack we see the spinning process in sys_shmat. The process was spinning trying to grab a lock. In the second case we were in sys_semtimedop. Both of these are called from sys_ipc. The test we are using can fail in a mode where one process spins on a processor while the other processor continues to execute processes. We can also fail in a mode where the system simply hangs. I was able to hang the system - moving from the state of process spinning on a processor to complete hung - in one case (the only time I tried) by trying to cat /proc/sysvipc/shm. The code in routine sysvipc_shm_read_proc does try and lock shm_ids so if the lock problem is here this would explain the hang at this time. We are testing this using a DB2 test. We are using a dual processor system. In a test in which the system was booted with only one cpu (maxcpus=1) we have in excess of 16 hours of run time with no problems. Normally we would expect a failure in an hour or less. My belief is that there is a bug in the locking code associated with struct ipc_ids (see ipc/util.h). In the RHEL 2.1 code ipc_lock would grab spinlock ary. In the RHEL 3.0 code this uses br_read_lock. The idea is that we use a read lock with the expectation that in the normal case we will have multiple processes able to go in and grab a spinlock for an individual element of ipc_id* entries. There could be a failure in either the br_read_lock or a spinlock within the array. In my debugging in which I printed the spinlock for each ipc_id entry I did not observe any held so right now I think the problem is with the br_read_lock, but have not really investigated enough to say for sure. The debugging I have done, printing spin locks and looking at code changes, makes me favor this as the problem. ------ Additional Comments From dmosby.com 2003-17-10 09:54 ------- My preliminary investigation leads me to believe that a process is spinning in the ipc_lock code. As this is internal to the kernel it is my belief that this is storng evidence that a kernel bug exists. Further investigation would be required to provide additional evidence. For additional details of the problem see previous Bugzilla entry. ------ Additional Comments From yvchan.com 2003-17-10 10:26 ------- I know most people know this, but to re-interate: This bug will prevent DB2 from supporting RHEL 3; hence the show-stopper priority. ------ Additional Comments From dmosby.com 2003-17-10 12:58 ------- Some notes about the code involved. (This may appear a bit complex, but is reasonably obvious when you are looking at the source code.) In code base 2.4.21 the routine ipc_lock uses a spin lock as the overall lock for struct ipc_ids. See routine ipc/util.h. The code in RHEL 2.1 is essentially the same as linux-2.4.21. In code base RHEL 3.0 the ipc_lock code nows uses br_read_lock(BR_SEMAPHORE_LOCK) for this function. The idea is that we allow multiple readers to go after individual spin locks in the array. In code base linux-2.6.0-test7 the code uses rcu (Read Copy Update) and read_barrier_depends for this function. The point here is that it appears that moving from the 2.4.21 code base both RHEL 3.0 and 2.6.0 code has tried for a performance improvement for the SMP case of ipc_lock. They have not used identical code to implenent this. We have removed the questionable BR_SEMAPHORE_LOCK patch from the RHEL3 kernel, please re-try the test with the appropriate kernel located in: http://people.redhat.com/~lwoodman/.for_ibm/ BTW, this kernel is also loaded up with other debug code but that shouldnt have any effect on your tests. This kernel is just to verify whether or not this patch is causing the problem you are seeing, let us know ASAP. Larry Woodman ------ Additional Comments From dmosby.com 2003-17-10 21:24 ------- Per request, I am placing a status report in Bugzilla. We are runing kernel 2.4.21-3.EL.sock.kswapd.io.debug.20.1smp on system valrehel3rc2. We are coming up on the 2 hour mark of running the DB2 test and no problem with process spinning or system hang. This is good. Previous testing seemed to hit this in under an hour. This kernel has debugging as listed in comment 18. Will continue to monitor system. ------ Additional Comments From yvchan.com 2003-18-10 09:46 ------- Well, the test ran overnight and we haven't hit the same problem. I have run out of log space at 1-ish EDT and had to increase it this morning, but this is long after the problem should have appeared (approximately 4 hours after). The next step I would like to see is one of the original kernels with only this patch taken out. I'm not comfortable with the amount of debug code in this kernel as I feel it may have an impact on the timing, which in theory could mask the problem. If you can send the the original kernel with only this patch taken out, it would be appreciated. btw, on a side note, may I ask what the diff is between the -3 and the -4 kernels? Or is this just a way to tell between RC3 and RC4? Thanks, yvonne. The only difference between -3 and -4 is in ppc specific for a scsi time out issue. I'll see if Larry agrees with your commnet about debug code. His earlier comment when he posted the kernel that it should not affect it. Not sure if we can get you another kernel before Monday, but will try. No problem, sorry for the delay I've been goofing off this saturday :) The kernel without the patch or the debug code is in: http://people.redhat.com/~lwoodman/.for_ibm Let us know how it goes ASAP. Larry ------ Additional Comments From yvchan.com 2003-19-10 01:10 ------- New data point with the original -4 kernel on AMD64: We reproduced the problem with the db2ipccm spinning forever in kernel space on an AMD 64 machine. I can't log into the machine anymore, but it is still up and running with the following information from top: 00:47:57 up 1 day, 6:58, 2 users, load average: 2.02, 2.01, 2.00 461 processes: 457 sleeping, 4 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 0.9% 0.0% 81.1% 11.3% 0.0% 0.1% 6.3% cpu00 0.0% 0.0% 77.3% 22.6% 0.0% 0.0% 0.0% cpu01 1.9% 0.0% 84.9% 0.0% 0.0% 0.3% 12.6% Mem: 2036592k av, 1785040k used, 251552k free, 0k shrd, 116160k buff 813188k actv, 368468k in_d, 32188k in_c Swap: 2048276k av, 4300k used, 2043976k free 1073852k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 4 root 25 0 0 0 0 RW 41.6 0.0 1551m 1 keventd 29421 db2inst1 19 0 33820 33M 31492 R 38.6 1.6 656:01 0 db2ipccm The machine isn't responding to commands well, but this is most likely due to the 2 cpus's being "engaged" with something. Any chance someone can say whether or not the keventd is behaving normally? At any rate, I've just restarted the testing on the first machine with the new kernel. I'll check it in the morning. ------ Additional Comments From yvchan.com 2003-19-10 12:04 ------- We're looking really, really good right now :) We have 96 of the 100 clients still running (the other 4 encountered log full conditions and exited... Perfectly normal) The other 96 has been steadily getting new connections and running apps as according to the timestamps that I see. As a side note, the other issue we had seen with a db2sysc spinning in userspace with calls to msgsnd has not appear either. (This would have happened almost at the same time, or just before the db2ipccm spinning in system space). We are at the 10 hour mark at the moment, and I'll let this go for the rest of the day and check in on it later today. I think we are pretty safe to say that the patch removed was the culprit. Great Job folks! :) ------ Additional Comments From yvchan.com 2003-19-10 12:16 ------- Recommendation from me (In case my last note was clear): Red Hat, please remove this patch from your kernel before you release. DB2, RH, and our joint customers would save alot of time by simply being able to work "out of the box". ------ Additional Comments From yvchan.com 2003-20-10 09:20 ------- Monday morning (the run has been going for 1 day, 8 hours, and about 15 minutes. 82 of the clients are still running, and getting jobs :) I think we're pretty safe to say we found the offending patch. Does this occur only under heavy load ? And if so, please identify heavy load. Thanks. ------ Additional Comments From yvchan.com 2003-20-10 10:26 ------- For a normal validation run (as defined in the validation kit), we expect 300 clients for at least 24 hours. In the case with the original RHEL 3 kernel, we couldn't get 100 clients for 1 hour. This appears to occur even with light to moderate load on the system. The load average on the system when this occurs is < 5 in most cases. There are several examples of top output in this bugzilla that show the load average on the system when this happens. yvonne. This was originally identified as heavy load on the 10th. Now just a few days before ship it is identified as basically any load. Your version of DB2 does not ship until November 8th, I suggest we look at this for an errata, not delay the GA. Our concern is the following - we know we fix this particular issue with the lock. But what else do we break at this time by taking the patch out. THis is our only middleware that has run into this. so did the patch help, we don't know. ------ Additional Comments From yvchan.com 2003-20-10 11:06 ------- I assume by this, you mean critical update errata available immediately after your GA available via RHN? ------ Additional Comments From yvchan.com 2003-20-10 11:23 ------- This is our normal load testing. It is designed to attain a heavy load and to max out system resources. In this instance we did not get to the point of a heavy load nor max out the resources. We believe that any application using ipcs resources with smp will hit this issue. ------ Additional Comments From khoa.com 2003-20-10 11:29 ------- If the offending patch is only to improve performance, then it would be safe to take it out - especially if you run into a functional problem like this. ------ Additional Comments From dmosby.com 2003-20-10 11:42 ------- It does not require heavy stress to hit this bug. Two processes on an otherwise idle muti-processor system could hang the OS if they both entered ipc code at the same time. Running an ipc-intensive application simply increases the odds of hitting the bug. We feel confident that the two kernels recently tested are far less likely to hit the problem. Until we understand what the problem was, and the code change made we cannot really say if the problem has been solved or has now become far more difficult to hit. The problem observed should not be considered a DB2 issue. Any software using ipc could potentially hang the OS. The DB2 test simply makes a good ipc stress test. I am not familiar with the kernel acceptance testing done on an SMP system in the area of ipc testing. It may be that testing in this area should be increased. Fr Dale and Yvonne at IBM. can we get some backtraces like enabled nmi_watchdog and/or sysreq-t ? ------ Additional Comments From dmosby.com 2003-20-10 15:31 ------- Per request, I am placing some of my notes regarding investigation of the bug in the bugzilla entry. These notes come from two debugging sessions on a system running the DB2 tests in the condition where one process was "spinning" in the kernel. I was able to use the second processor to do some debugging. The systems were not set up to take a crash dump. I made use of /proc/sysreq-trigger to obtain task stacks and then examined the spinning process. Some of the kernel output is rather wide, and may be difficult to read in Bugzilla. I am going to try and format a bit by trimming some of the data information. For example a typical trace line in dmesg: Oct 9 20:16:09 rhel3rc2 kernel: [<c011ce00>] smp_apic_timer_interrupt [kernel] 0x0 (0xf1da1f08) I will shorten to: [<c011ce00>] smp_apic_timer_interrupt [kernel] 0x0 (0xf1da1f08) The following is likely to only be of interest to those familiar with the kernel. For a conversion to text see comment number 14. --------------------- Debugging session October 9. uname -a information: Linux rhel3rc2 2.4.21-3.ELsmp #1 SMP Fri Sep 19 14:06:12 EDT 2003 i686 i686 i386 GNU/Linux Spinning process task call trace: Trimmed text "Oct 9 20:16:09 rhel3rc2 kernel:". db2sysc R F1DA1F14 1792 3325 3320 3335 3326 (NOTLB) Call Trace: [<c011cf1d>] smp_apic_timer_interrupt [kernel] 0x11d (0xf1da1ef0) [<c011ce00>] smp_apic_timer_interrupt [kernel] 0x0 (0xf1da1f08) [<c01a0068>] isofs_read_super [kernel] 0x128 (0xf1da1f34) [<c01aab1c>] .text.lock.shm [kernel] 0x131 (0xf1da1f3c) [<c0113251>] sys_ipc [kernel] 0x1b1 (0xf1da1f90) Current return address on stack is: c0113251 This is where we will return. Address above is a call (must be): 0xc01aa040 From vmlinuz-2.4.21-3.ELsmp c0113248: 89 54 24 04 mov %edx,0x4(%esp,1) c011324c: e8 ef 6d 09 00 call 0xc01aa040 <--------- c0113251: 85 c0 test %eax,%eax c0113253: 89 c2 mov %eax,%edx From System.map c01a9600 t shm_get_stat c01a96d0 T sys_shmctl c01aa040 T sys_shmat <-------- c01aa480 T sys_shmdt c01aa540 t sysvipc_shm_read_proc We are in sys_shmat Only places we call in sys_shmat are: c01aa18b: e8 e0 a6 ff ff call 0xc01a4870 c01a4870 T ipcperms c01aa281: e8 2a 90 f9 ff call 0xc01432b0 c01432b0 T do_mmap_pgoff c01aa3a0: e8 5b eb ff ff call 0xc01a8f00 c01a8f00 t shm_destroy c01aa3e3: e8 48 97 f9 ff call 0xc0143b30 c0143b30 T find_vma Routine bounds are: 0xc01aa040 to 0xc01aa480 Jump out of routine: c01aa0fe: f0 fe 08 lock decb (%eax) c01aa101: 0f 88 15 0a 00 00 js 0xc01aab1c <-- addr shown on stack ----------------------------------- Debugging session 15-Oct-03 follows. Output from "top" CPU states: cpu user nice system irq softirq iowait idle total 3.4% 0.0% 104.0% 0.0% 0.0% 92.2% 0.0% cpu00 3.5% 0.0% 4.1% 0.0% 0.0% 92.2% 0.0% cpu01 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% Mem: 1028452k av, 1005712k used, 22740k free, 0k shrd, 25284k buff 568392k actv, 72752k in_d, 13848k in_c Swap: 2040244k av, 879248k used, 1160996k free 241588k cached PID USER PRI NI SIZE RSS SHARE WCHAN LC STAT %CPU %MEM TIME CPU COMMAND 2910 db2inst1 19 0 7792 3652 3648 1 R 99.9 0.3 1013m 1 /home/db2inst1/absolut/absolut/APPS/am... Process pid 2910 appears to be spinning on CPU 1. An strace of pid 2910 hangs, producing no output. Trace from messages, using sysrq-trigger: am119c R 00000001 3908 2910 2909 (NOTLB) Call Trace: [<c011cf1d>] smp_apic_timer_interrupt [kernel] 0x11d (0xc5d29d2c) [<c011ce00>] smp_apic_timer_interrupt [kernel] 0x0 (0xc5d29d44) [<c01a8dc0>] .text.lock.sem [kernel] 0xca (0xc5d29d78) [<f887ca27>] ext3_do_update_inode [ext3] 0x177 (0xc5d29e18) [<c013d7e5>] vm_set_pte [kernel] 0x75 (0xc5d29e58) [<c013fe20>] do_wp_page [kernel] 0x6c0 (0xc5d29e84) [<c0141397>] handle_mm_fault [kernel] 0x217 (0xc5d29ec0) [<c011f78c>] do_page_fault [kernel] 0x14c (0xc5d29ef4) [<c014956f>] generic_file_write [kernel] 0x13f (0xc5d29f48) [<f8877149>] ext3_file_write [ext3] 0x39 (0xc5d29f74) [<c01130f9>] sys_ipc [kernel] 0x59 (0xc5d29f90) [<c011f640>] do_page_fault [kernel] 0x0 (0xc5d29fb0) Last value on stack "c011f640" appears to be "noise on the stack". The instruction before this is not a call. c011f633: eb c7 jmp c011f5fc <bust_spinlocks+0x2c> c011f635: 8d 74 26 00 lea 0x0(%esi,1),%esi c011f639: 8d bc 27 00 00 00 00 lea 0x0(%edi,1),%edi <-- not a "call" c011f640 <do_page_fault>: c011f640: 55 push %ebp c011f641: 57 push %edi c011f642: 56 push %esi So looking at "c01130f9". c01130e5: 8b 44 24 40 mov 0x40(%esp,1),%eax c01130e9: 89 5c 24 08 mov %ebx,0x8(%esp,1) c01130ed: 89 44 24 04 mov %eax,0x4(%esp,1) c01130f1: 89 3c 24 mov %edi,(%esp,1) c01130f4: e8 97 4d 09 00 call c01a7e90 <sys_semtimedop> <-- call c01130f9: 89 c2 mov %eax,%edx <-- addr on stack c01130fb: 89 d0 mov %edx,%eax c01130fd: 8b 5c 24 20 mov 0x20(%esp,1),%ebx From System.map: c01a7d40 t alloc_undo c01a7e60 T sys_semop c01a7e90 T sys_semtimedop <-- matches call c01a8560 T sem_exit c01a8870 t sysvipc_sem_read_proc would it be possible to attach the full sysreq-t to bugzilla? attachements don't have the size limits a normal text has ------ Additional Comments From dmosby.com 2003-20-10 16:00 ------- As requested, task trace information. As was requested, these are the task stack traces taken from /var/log/messages generated by sysrq-trigger. Created attachment 95319 [details]
TaskTraces.txt
OK, we found a potential problem and fixed it in the semaphore lock patch. We built a test kernel that we need you to try, its in: http://people.redhat.com/~lwoodman/.for_ibm/ ------ Additional Comments From dmosby.com 2003-21-10 01:54 ------- We have about 4 hours of the DB2 test running on kernel 2.4.21-3.EL.IBM.2smp so far tonight. System appears OK so far. That is good news! One thing: the bug we fixed triggered ONLY when userspace was going to use a IPC primitive while at the same time freeing that same primitive (which is why this didn't show up before, typically applications stop using a resource before freeing it). I'm mentioning this because it may point at a locking bug/race in DB2. (that's no excuse for the kernel to die, but may be something IBM might be interested in investigating) ie. a semop() was done over the semaphore in one process, in parallel to an IPC_RMID done by another process. is there perhaps some forced shutdown that happens during the test? What additional testing has IBM been able to do outside of this stress test ? Or what testing are you planning to do ? Following an extensive evaluation we have concluded that there is no primitive mixup in the code (which runs identical on a large number of other UX platforms plus Windows in identical fashion). Right now the bug is clearly identified as related to the performance fix that was removed from the code over the weekend. ------ Additional Comments From yvchan.com 2003-21-10 13:15 ------- I find it hard to believe that the kernel without the "linux-2.4.21-semaphore- scale.patch" would work fine and then with this patch, the same kernel would not "die" per say, but cause the exact same processes to be "stuck" in kernel space. This makes the assumption that it is a kernel problem, and not a DB2 problem. Why would we not exhibit this problem with the RHAS 2.1 kernels, or with the kernel that has this patch removed? We run this same validation kit in many places and have yet to experience the same problem. The DB2 code for this area is not specific to Linux, and works just as well on AIX, SUN and HP. wo wo wo Nowhere did I say or imply that this was not something we should fix in the kernel. In fact I said it was something we should fix. I was just trying to be helpful by saying that the thing we fixed, if it helped, could point to an application locking issue since it has to do with use-in-parallel-with-free. We're still waiting for the final test results of our suggested fix kernel. ------- Additional Comment #47 From Dale E. Mosby 2003-10-21 15:27 ------- DB2 test status on latest kernel: 2.4.21-3.EL.IBM.2smp #1 SMP Test was started last night and is still running, now about 18 hours of test an no spinning process, no system hang. This kernel does not appear to hit the problem. ------ Additional Comments From dmosby.com 2003-21-10 16:01 ------- I have taken an initial look at the code changes made in the latest kernel, comparing with the original RHEL 3.0 kernel. I'll comment more on that in a moment. This mail is a bit long and technical. So let me present the short technical summary here: There is a serious kernel bug in the RHEL 3.0 code as I saw it last week used for DB2 testing. Under no circumstances should a user application be able to halt the operating system. I consider this to be a critical error, though the level of seriousness may be open to debate. This is not an error in DB2, this is a Linux kernel error. The information received seems to indicate that a fix was made to solve a kernel error in the case where an application "uses an IPC primitive while at the same time frees the same primitive". I am not quite sure how to interpret this. It seems that Red Hat is indicating that DB2 continues to use a IPC primitive after releasing it, or at least this is the case their change addresses. I interpret their comment to mean that they believe application releases some Inter-Processor Communication (IPC) element, and then attempts to use the element. This may be the case, but as our application works fine on RH Linux (with the kernel fix) and other platforms as well, it would appear that there is no serious logic flaw in DB2. Even if DB2 happens to be using an IPC function (shared memory, semaphore, message) in a nonstandard way, there is clearly a critical kernel bug as no user process should be able to hang the operating system in this manner. Whether or not an IPC element did once exist or was never present, the kernel should simply return an error to the process attempting to use the IPC element. In addition to the fact that we know DB2 can hang a system it should be stressed that any use application can do the same. Some simple messaging application could lock up a RHEL 3.0 kernel as initially provided to us. This should be of a great concern. From the information available when looking at the spinning process on the lab system I have no way of determining if DB2 was trying to use some IPC element which had been released - as Red Hat indicates their latest kernel fix is addressing. I have taken a first look at the code changes. The entire IPC area is quite complex. In fact the comment for "sys_ipc", the system entry point for IPC calls, is "This is really horribly ugly.". I hate coming across code comments like that. In going from RHEL 2.1 (or Linux 2.4 tree) to RHEL 3.0 the Inter-Processor Communication (IPC) code had a performance improvement. Interestingly, the Linux 2.4 to 2. 6 tree also made a performance improvement in this area, but used different code to accomplish this. I am curious to know how the DB2 tests execute on the Linux 2.6 kernel. The IPC code uses a structure which contains an array of "entries" that each manage one IPC object. The exact details of the structure aren't important. There is a "lock" for the overall structure and a lock for each entry in the structure. Something like: struct ipc_ids { lock_for_ipc_ids stuff... struct ipc_id entries[] { lock_for_entry stuff... } } In the initial RHEL 3.0 kernel I examined (2.4.21-3.ELsmp #1 SMP ) the code to access an IPC element looks something like: 1) obtain read lock on ipc_ids structure 2) obtain spin lock on individual element of array entries The idea here is that we use a read lock to prevent manipulation of the overall data structure, and then a spin lock to allow only single access to each element within the array of IPC entries. With read locks the idea is that you can have lots of readers, but only one writer. Granting access to a individual element is considered a read - meaning that you can have multiple processes at the same time, each having a spin lock on different elements of the array entries. Whenever there is a write lock, all readers must block until the write lock is released. A write lock is used when we grow the array. In the new kernel we received, 2.4.21-3.EL.IBM.2smp, the code retains the same read lock implementation for most access to the structure, however there are two places where a write lock has been used rather than the read lock. (ipc/sem.c when returning an error to the process and ipc/msg.c in sys_msgctl, the OS interface for message control operations - see "man msgctl"). The effect of the introduced write locks is to serialize processes in this area. This may be considered a performance impact, though clearly not as big a performance impact as having the OS hang. There are three structures "ipc_ids" like so: File msg.c: struct ipc_ids msg_ids; File sem.c: struct ipc_ids sem_ids; File shm.c: struct ipc_ids shm_ids; The RHEL 3.0 code uses br_read_lock(BR_SEMAPHORE_LOCK) when locking any of these structures. It seems to me, given my current understanding, that unrelated IPC operations are be locking all of the structures. I am wondering how things would work with a code change such that br_read_lock was called with three separate locks - one unique to each structure. This would seem to be a performance improvement and if we did this with the original kernel and got the hang we could observe which lock was held. At this time I am not aware of the inter-dependence of the structures that makes it necessary to have a single lock. The change would assoicate a lock with each structure - much as the structure I illustrated. A question comes to mind: Is it possible that the hang was hit due to writer starvation? In other words, with a two processor system can we have a spin on the br_write_lock call due to having a high number of readers? I haven't analyzed the code enough to know how we guarantee that we always service a writer. Someone more familiar with the code would likely know the answer, just an area for thought. NOTES: 1) I have been focusing on the changes in ipc directory and code that I initially called out as being the problem area. There may be more changes, I don't claim this is the complete list, 2) I use the term "lock", in reality there are locks and semaphores, 3) I have simplified the details while trying to cover what I felt was important for this discussion - if this is passed on to folks working on the kernel this should be obvious. a couple of comments. It got a bit long, because many issues were raised. 1) the "this is really horribly ugly" comment in sys_ipc() is an old comment from Linus, and he only meant the multiplexing of the IPC ops. The core SysV IPC code is pretty acceptable otherwise and has gone through several iterations of refinement during the years. 2) one actual race that you might trigger is to release a SysV semaphore via IPC_RMID while it's used by another process via semop(). A narrow race window will suffice eg. during process shutdown. Other kernels wont show this race, you'd get a safe -EINVAL on the semaphore, which if it happens during process shutdown can get ignored easily. Obviously it's perfectly legal to _attempt_ to use a semaphore in parallel to destroying it, it's just pointless. The kernel should of course not deadlock when an application does that. (there might be other user-level semaphore use races that might trigger the scenarior, but this is the most obvious one. In any case, DB2 is within its right to do whatever it wants to with the semaphores - the kernel should not lock up. We only suggested a scenario that could trigger the lockup, which scenario, if avoided in userspace [if it's possible to avoid it] could work around the kernel bug. And generally it might just be practical to fully understand how that race could have happened (if it indeed happened), irrespective of the kernel's lockup.) 3) the observation that the fixed behavior is caused by the removal of the semaphore patch is incorrect - in the latest kernel we fixed a rare race in the new semaphore code itself, we didnt remove the patch. 4) the 'slowdown' caused by the fix should be near zero. The only code affected is an utter slowpath: semaphore array destruction. The SysV semaphore creation/destruction path was never a target of optimizations - in fact even the 2.6 kernel completely and globally serializes all semaphore creation and destruction in the system, via the sem_ids.sem kernel mutex. The main optimization target was always semaphore use. 5) brlock readers cannot be starved on a 2-CPU system - and the debug info seems to show reader spinning. One possibility for the lockup would be memory corruption caused by semop() using a freed semaphore. and even if some form of brlock _writer_ starvation happens on a say 64-way system, it will most likely show up as inefficiency, not as an outright deadlock. Most starvation situations are instable and dont show up as persistent lockups. Especially they dont show up as hard lockups (which the 'NumLock keys dont work' comment suggests) - the IPC brlock is never used with irqs disabled. I think it would be a good idea to also run this DB2 test on a 4-way or 8-way system as well - they will exponentially more likely trigger any SMP races. the observations about the locking changes are correct - the bug was caused by incorrect access to the ids->entries[] array. 2.6 locking is different as it utilizes RCU to update ids->entries[]. Btw., RCU is even slower in the slow-path than the write-lock portion of brlocks. The read-path should scale similarly well. Original 2.4 uses a global, system-wide spinlock to protect ids->entries[]. ------ Additional Comments From dmosby.com 2003-23-10 12:00 ------- This is a less technical descripiton of the problem encountered. The intent is to convey the general nature of the problem to someone not having kernel source access and not interested in internal details of the problem. ----- The initial release version of the RHEL 3.0 kernel contains a bug in the IPC (Inter-Process Communication) code. This problem can be triggered on a multiprocessor system by multiple processes using IPC facilities (semaphores, shared memory, and messages). The problem will be seen as a process spinning in the kernel or by a system hang. The problem arises due to a performance improvement made in this release. The kernel makes use of a data structure in the management of each type of IPC facility. Historically this structure would be protected by a global lock while a process obtained access to individual array entries held within the structure. In the RHEL 3.0 release a performance improvement is made by using a reader-writer semaphore to allow multiple processes to obtain access to separate array elements. The problem comes about due to insufficient protection of access to the data structure in some cases. A fix involving additional write locking in specific code paths has been identified and will be released after complete testing. Any news from IBM on their testing of the test kernel provided? Ingo's locking fixes have been incorporated into U1. Setting the state of this bugzilla to modified. IBM crew - please verify the fix in the beta versions of U1 in addition to the test kernels you already have. ------ Additional Comments From dmosby.com 2003-18-11 12:51 ------- Status update: I have made a couple requests for testing status. Just made another today. If I don't get a response in a week and don't receive any objections I propose to close the bug as fixed. I don't see a bug state of "fixed awaiting testing" so I will leave the state as is for now. ------ Additional Comments From yvchan.com 2003-18-11 13:18 ------- We have not received the U1 test kernel and are not expected to receive it until early Dec. Please do not close this off until we have received the U1 kernel. ------ Additional Comments From dmosby.com 2003-21-11 17:14 ------- Expectation is that U1 arrives early December. Waiting to test this. U1 available today on rhn for partner testing. ------ Additional Comments From fvieira.com 2004-01-05 10:54 ------- We are currently testing U1 on all platforms - we will have more information on this over the next couple of days. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-017.html changed: What |Removed |Added ---------------------------------------------------------------------------- Status|FIXEDAWAITINGTEST |ACCEPTED ------- Additional Comments From khoa.com 2005-02-20 16:58 EST ------- Can we close this bug report now ? If so, please close it. Thanks. changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ACCEPTED |CLOSED Impact|------ |Functionality ------- Additional Comments From yvchan.com 2005-04-28 09:16 EDT ------- Definately fixed. Closing on IBM side by yvchan@ibmca |