Bug 106794

Summary: LTC4829-RHEL 3 HANGS under heavy stress load
Product: Red Hat Enterprise Linux 3 Reporter: IBM Bug Proxy <bugproxy>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: anderson, bjohnson, mingo, petrides, riel, tburke
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-12-03 02:02:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 106472    
Attachments:
Description Flags
TaskTraces.txt none

Description IBM Bug Proxy 2003-10-10 17:18:36 UTC
The following has be reported by IBM LTC:  
RHEL 3.0 HANGS under heavy stress load
Hardware Environment: x342
2P, 3 harddisks (17GB, 32GB, 34GB)

Software Environment:
RHEL 3.0
DB2 V8.1
SVT Test tool T3

Steps to Reproduce:
1. Run T3 test tool with 200+ simultaneous clients
2.
3.

Actual Results: System hangs - system hooked up to a switch box, and when it 
has been running for about 3hrs, cannot bring up the screen on the monitor, and 
there isn't any hardisk activiy even though when I rebooot, it appeared that 
the clients were still running.

Expected Results:  Heavy disk activity as 200 simultaneous clients run against 
the machine - no hangs.

Additional Information:
We don't really have much to go on.  I restarted the computer and there doesn't 
seem to have been much information recorded.
Any help on this is greatly appreciated.Ok.  it's happened again.  We ran for
approximately an hour.  I have the system 
in front of me and it's totally DEAD!!!  The monitor is frozen, mouse, 
keyboard, the works!.

We had a top window open, and it shows a bit of info, but the top part got cut 
off.  What I can see is the swap is not near empty.  We have approximatly 800 
Megs free of swap (out of 2G)  and the file cache was at 90 Megs.  We can't see 
what was in the actual ram however, but we did have 2G to start with.

The top process was kscand.

Ideas?  SysRq is not working as we can't get any signals to the machine.  A 
serial console hasn't been setup yet; but is most likely our next step.

I can't even get to the initial console to see if there was a kernel oops. :(

yvonne.More information which wasn't added, but sent via e-mail:

We managed to run 100 clients overnight, but in the morning realised we had a 
single process spinning at 100% system time on one of the 2 cpus:  This is the 
information from the top window I had opened:

 16:58:10  up 1 day, 8 min, 10 users,  load average: 1.18, 1.14, 1.10
659 processes: 656 sleeping, 3 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    1.4%    0.0%   52.8%   0.0%     0.2%    0.0%   45.4%
           cpu00    2.8%    0.0%    5.7%   0.0%     0.4%    0.0%   90.9%
           cpu01    0.0%    0.0%  100.0%   0.0%     0.0%    0.0%    0.0%
Mem:  1028452k av, 1003644k used,   24808k free,       0k shrd,   56100k buff
                    579760k actv,    2940k in_d,   13940k in_c
Swap: 2040244k av,  978000k used, 1062244k free                   67872k cached

  PID USER     PRI  NI  SIZE  RSS SHARE WCHAN     LC STAT %CPU %MEM   TIME CPU 
COMMAND
 3325 db2inst1  19   0 14252 2404  1800            1 R    49.9  0.2  1418m   1 
db2ipccm           

We have tried to send a SIGURG which should activate our trap handler to dump a 
stack, but that was unsuccessful.  We tried strace, but the file generated was 
empty, and we tried gdb, but it couldn't attach to the process for some reason.

The machine is still up this time (the difference between this time and the 
last few times, was that we disabled the xserver .. init 3 instead of init 5).

We are looking for any help at all in identifying what is causing this process 
to spin like this.  It's using system time, so I'm hoping there is something we 
can catch here.  

sysrq has been turned on.  We are using the default RHEL 3 kernel.  Rebuilding 
could change the behaviour, and we would prefer not to do that.Per conference
call, Dale will help with debug.  He is going to attach script 
file that looks through /proc to the problem.
Mark and Greg and Glen  - added you to cc list so you can follow Greg / Glen
There is no RedHat problem opened yet.. Are you going to open one up ?  This is 
stop ship.  ThanksPer conference call, DB2 will also try to use the latest RC4 -
available Oct 9Meeting minutes:
Attending :
Salina Chu
Mark Wisner
Dale Mosby
Florbela Vieira

Dale has been able to start debugging without KDB and he would like to 
continue. Dale is able to debug when only one of the CPU's are hung and the 
other one is available to connect in. 
Currently it is beleived the RC being run is the one earlier than the RC 
released last night. But according to Flobela, this test has failed on earlier 
release canidates. 
Dale will continue debgging.
After the call Salina verified there was not a RedHat bug opened and she whould 
ask Glen Johnson to open one.

Comment 1 IBM Bug Proxy 2003-10-10 19:32:17 UTC
------ Additional Comments From yvchan.com  2003-10-10 13:30 -------
More notes:

caps lock and num lock don't show any response when I tried them after the last 
hang. 

Comment 2 IBM Bug Proxy 2003-10-13 02:00:24 UTC
------ Additional Comments From yvchan.com  2003-12-10 10:35 -------
Our second machine has hung in the same way as the first machine.  It hung at 
approximately 10pm last night.

top output also shows something taking up 100 % cpu time, but the process 
doesn't appear in the last incarnation of the top output.

 22:35:52  up 1 day,  6:59,  4 users,  load average: 6.04, 2.64, 1.86
494 processes: 488 sleeping, 5 running, 1 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total   61.4%    0.0%   38.5%   0.0%     0.0%    0.0%    0.0%
           cpu00   91.8%    0.0%    8.1%   0.0%     0.0%    0.0%    0.0%
           cpu01    0.0%    0.0%  100.0%   0.0%     0.0%    0.0%    0.0%
Mem:  1028452k av, 1005416k used,   23036k free,       0k shrd,   25228k buff
                    723244k actv,   51880k in_d,   13876k in_c
Swap: 2048248k av,  853856k used, 1194392k free                  242356k cached

The 91% usertime appears to be a db2sysc process, but it doesn't appear to be 
stuck.  I had run several backtraces on it, and it appeared to be processing 
data as it is suppose to be. 

Comment 3 IBM Bug Proxy 2003-10-13 02:02:12 UTC
------ Additional Comments From dmosby.com  2003-12-10 14:53 -------
I have observed three potential error conditions on this system:

1) System is hung. No response on console, unable to telent/ssh to system, and 
caps lock key does not activate caps lock light indicating that interrupts are 
not being processed.

2) Process "db2inst1" is constantly on a processor. Able to ssh to box. The 
process is in kernel space 100%. An strace of process will hang, but can be 
killed from another connection.

3) Process "db2inst1" is constantly on a processor. Able to ssh to box. The 
process returns to user space, executes msgsnd in a loop constantly moving to 
kernel space to service the system call.

It is possible to move from state 2 to state 1. My believe at this time is 
that conditions 1 and 2 above are the same kernel problem. I need to do some 
additional investigation into this and that investigation is in progress.

Conditions 2 and 3 can be differentiated by running strace on the process. 
Execute the command:
  strace -p <process pid> | less

In state 2 the strace command will hang. Kill the strace using another 
connection.

In state 3 the strace will show calls to msgsend as:
msgsnd(1769483, {1, "p#"}, 2, 0)        = 0
msgsnd(1802252, {1, "p#"}, 2, 0)        = 0
msgsnd(1835021, {1, "p#"}, 2, 0)        = 0
msgsnd(1769483, {1, "p#"}, 2, 0)        = 0
msgsnd(1802252, {1, "p#"}, 2, 0)        = 0

In the condition observed the command "ipcs -q" indicated that the data was 
being read and another process was located that was consuming the data, 
executing msgrcv.

Whether or not this activity is a an application error I cannot say but there 
does not appear to be any operating system error in this third state.

Work to demonstrate the cause of error show in states 1 & 2 is in progress. 

Comment 4 IBM Bug Proxy 2003-10-14 14:34:16 UTC
------ Additional Comments From yvchan.com  2003-14-10 10:14 -------
small corrections to Dale's append earlier.

state 2 has db2ipccm on the processor

state 3 has db2agent on the processor

We are currently investigating state 3 internally here. 

Comment 5 IBM Bug Proxy 2003-10-14 15:27:18 UTC
------ Additional Comments From dmosby.com  2003-14-10 11:24 -------
In e-mail 14-Oct-03 Mark Wisner wrote:

> Dale, Any luck?
> If not, do you have a feeling where the problem is?
> Is it the distro or the application or both?
> In your last append to the bug, you say "conditions 1 and 2 above are
> the same kernel problem" then "Whether or not this activity is a an
> application error I cannot say but there does not appear to be any
> operating system error in this third state."
>
> From what I gather, we have two problems,
> 1 & 2 appear to be a kernel problem and 3
> seems to be an application problem.
>
> RH wants to GA on the 22nd, so things are
> getting close. Is there anything you need?

My response:

Problem:

Yes, I think there are two conditions that give the appearance of problems.
What I described as "state 1" (complete system hang), and "state 2" (process 
spinning in kernel) I think are the same problem.
What I described at "state 3" (process continually sending 2 byte messages) 
may or may not be an application problem. I know nothing about application. 
This third state does not appear to be a kernel problem, as far as I can tell 
the operating system is working properly when this is observed.

I am unable to do any debugging in "state 1". My debugging in "state 2" is 
based on looking at process task stack and digging through memory in a single 
instance of hang. Before coming forward with theory on error I would like to 
gather more information using a more robust debugging technique.

I had hoped to create a module to assist in debugging. This would let me 
insert the module into the running kernel, allowing observation of some data 
structures of interest. This would facilitate easy modifications to the 
module, allowing observation of more data without a kernel reboot. I found 
that the particular data of interest is local to a single source file and 
creating a loadable module able to access this will be a tedious process and 
require more time than I desire to invest in a task lacking guaranteed 
completion time.  I changed tactics and used the less elegant approach of 
simply modifying a source file to create a new /proc file to provide debug 
information.

I made the file modification on my personal system, a single processor box 
running RH9. I did pull the RHEL 3.0 sources down to my box but am unable to 
compile that kernel here. It may be that I need a RHEL 3.0 gcc, I do not know.

What I need:
1) Have a kernel locally configured and built.
2) Allow me to modify the file on one or both (I suggest both) systems in
   order to create a new /proc entry.
3) Re-run test attempting to put system in "state 2" (process spinning in
   kernel and observe data structure.
4) Further debugging from this point may require additional kernel
   modifications.
5) Provide IP address and password for second system. 

Comment 6 IBM Bug Proxy 2003-10-15 14:17:03 UTC
------ Additional Comments From yvchan.com  2003-15-10 10:13 -------
Several updates (sent via e-mail, and reposted here):

From Dale (Oct 15; 1:01 am)
Status of machine "valrhel3rc2":

Modified the file "ipc/shm.c" such that it creates /proc/sysvipc/shmlks
This /proc entry is similar to /proc/sysvipc/shm with the following changes:
  1) Does not grab any locks when obtaining data
  2) Displays value of global semaphore for list and value of each spinlock

There is a bit of other info for data structure displayed as well. The impact 
of not grabbing the lock means that there is a very small possibility that data 
will be modified while I am printing it.  However if the problem observed the 
other day is actually a held or botched lock then we can't try and grab it.  I 
immediately hung the box the other day by trying to look at /proc/sysvipc/shm 
so seems that this is the case.

Built a new kernel and installed as /boot/linux-ipcdebug1

Modified boot/grub.conf to make this the default boot kernel by 
setting "default=3".

Started the DB2 test.

Created shell script "monitor.sh" in directory /home/db2inst1/absolut/absolut. 
This script is running. It will output the contents of /proc/sysvipc/shmlks 
once every minute into file "lock.log" in this directory.

I checked on the system before sending this and find that process 2910 is 
spinning on cpu1. It never seems to come back into user state - I cannot strace 
it. Unless it does come back off the kernel in some manner that strace cannot 
follow.

I could output  /proc/sysvipc/shm and see if system hangs to test this idea of 
botched lock, but that could hang system so I will wait until tomorrow.

My output does not show a held/trashed lock. So I don't know what went wrong. I 
will have to look at this more tomorrow.

Thanks for your help on this. We will talk more Wednesday.


My response (Oct 15, 8:52 am)
Interesting ...  process 2910 is not the db2ipccm.  It's an application process 
that is stuck in kernel space.  This is one of the bh clients.

Dale's response (Oct 15, 10:04 am)
This morning the system still has process 2190 tying up one cpu and the shared 
memory data structures look "ok", at least the locking info looks fine.  I'm 
going to go have a cup of coffee and think about this some more. Actually I 
thought about it too much during the night. I'm glad this is a dual processor 
system so that I can still get on with one spinning process, though I wonder if 
we would hit this on a single CPU system. 

Comment 7 IBM Bug Proxy 2003-10-16 18:52:56 UTC
------ Additional Comments From dmosby.com  2003-16-10 14:46 -------
I have had a chance to look at two systems in the state in which a process
is spinning on a processor. In both cases attempting to run strace on the
process would hang with no output indicating that the process was never
coming back out to user space.

In both of these investigations use of /proc/sysreq-trigger was made to
generate stack information. [I will send a separate note to Greg & Mark that
this has the potential to be very useful for debugging. I would very much
like to see this feature made to work fully. It does not work to dump process
registers, and if it did it would only show registers for CPU handling the
request (see sysrq_handle_showregs in sysrq.c). I would like to have this
send an inter-process-interrupt to every CPU and get registers from all
processors. I would like to see this passed along to RH. Will pursue off-line.]

In looking at the first stack we see the spinning process in sys_shmat. The
process was spinning trying to grab a lock.

In the second case we were in sys_semtimedop.

Both of these are called from sys_ipc.

The test we are using can fail in a mode where one process spins on a
processor while the other processor continues to execute processes.
We can also fail in a mode where the system simply hangs. I was able to
hang the system - moving from the state of process spinning on a processor
to complete hung - in one case (the only time I tried) by trying to
cat /proc/sysvipc/shm. The code in routine sysvipc_shm_read_proc does try
and lock shm_ids so if the lock problem is here this would explain the hang
at this time.

We are testing this using a DB2 test. We are using a dual processor system.
In a test in which the system was booted with only one cpu (maxcpus=1) we
have in excess of 16 hours of run time with no problems. Normally we would
expect a failure in an hour or less.

My belief is that there is a bug in the locking code associated with struct
ipc_ids (see ipc/util.h). In the RHEL 2.1 code ipc_lock would grab spinlock
ary. In the RHEL 3.0 code this uses br_read_lock.  The idea is that we use
a read lock with the expectation that in the normal case we will have
multiple processes able to go in and grab a spinlock for an individual
element of ipc_id* entries.

There could be a failure in either the br_read_lock or a spinlock within
the array. In my debugging in which I printed the spinlock for each ipc_id
entry I did not observe any held so right now I think the problem is with
the br_read_lock, but have not really investigated enough to say for sure.
The debugging I have done, printing spin locks and looking at code changes,
makes me favor this as the problem. 

Comment 8 IBM Bug Proxy 2003-10-17 15:06:39 UTC
------ Additional Comments From dmosby.com  2003-17-10 09:54 -------
My preliminary investigation leads me to believe that a process is spinning in 
the ipc_lock code. As this is internal to the kernel it is my belief that this 
is storng evidence that a kernel bug exists. Further investigation would be 
required to provide additional evidence. For additional details of the problem 
see previous Bugzilla entry. 

Comment 9 IBM Bug Proxy 2003-10-17 15:07:51 UTC
------ Additional Comments From yvchan.com  2003-17-10 10:26 -------
I know most people know this, but to re-interate:  This bug will prevent DB2 
from supporting RHEL 3; hence the show-stopper priority. 

Comment 10 IBM Bug Proxy 2003-10-17 17:04:15 UTC
------ Additional Comments From dmosby.com  2003-17-10 12:58 -------
Some notes about the code involved.
(This may appear a bit complex, but is reasonably obvious when you
are looking at the source code.)

In code base 2.4.21 the routine ipc_lock uses a spin lock as the overall
lock for struct ipc_ids.  See routine ipc/util.h.

The code in RHEL 2.1 is essentially the same as linux-2.4.21.

In code base RHEL 3.0 the ipc_lock code nows uses
br_read_lock(BR_SEMAPHORE_LOCK) for this function.

The idea is that we allow multiple readers to go after individual spin
locks in the array.

In code base linux-2.6.0-test7 the code uses rcu (Read Copy Update)
and read_barrier_depends for this function.

The point here is that it appears that moving from the 2.4.21 code base
both RHEL 3.0 and 2.6.0 code has tried for a performance improvement for
the SMP case of ipc_lock. They have not used identical code to implenent this. 

Comment 11 Larry Woodman 2003-10-17 19:49:42 UTC
We have removed the questionable BR_SEMAPHORE_LOCK patch
from the RHEL3 kernel, please re-try the test with the appropriate
kernel located in:

http://people.redhat.com/~lwoodman/.for_ibm/


BTW, this kernel is also loaded up with other debug code but that shouldnt
have any effect on your tests.  This kernel is just to verify whether or 
not this patch is causing the problem you are seeing, let us know ASAP.

Larry Woodman


Comment 12 IBM Bug Proxy 2003-10-18 01:27:41 UTC
------ Additional Comments From dmosby.com  2003-17-10 21:24 -------
Per request, I am placing a status report in Bugzilla. We are runing kernel
2.4.21-3.EL.sock.kswapd.io.debug.20.1smp on system valrehel3rc2.
We are coming up on the 2 hour mark of running the DB2 test and no problem
with process spinning or system hang. This is good. Previous testing seemed
to hit this in under an hour. This kernel has debugging as listed in comment
18. Will continue to monitor system. 

Comment 13 IBM Bug Proxy 2003-10-18 13:52:46 UTC
------ Additional Comments From yvchan.com  2003-18-10 09:46 -------
Well, the test ran overnight and we haven't hit the same problem.  I have run 
out of log space at 1-ish EDT and had to increase it this morning, but this is 
long after the problem should have appeared (approximately 4 hours after).  
 
The next step I would like to see is one of the original kernels with only this 
patch taken out.  I'm not comfortable with the amount of debug code in this 
kernel as I feel it may have an impact on the timing, which in theory could 
mask the problem.  

If you can send the the original kernel with only this patch taken out, it 
would be appreciated.

btw, on a side note, may I ask what the diff is between the -3 and the -4 
kernels?  Or is this just a way to tell between RC3 and RC4?

Thanks,
yvonne. 

Comment 14 Bob Johnson 2003-10-18 15:31:16 UTC
The only difference between -3 and -4 is in ppc specific for a scsi time out
issue.  I'll see if Larry agrees with your commnet about debug code.  His
earlier comment when he posted the kernel that it should not affect it.

Not sure if we can get you another kernel before Monday, but will try.

Comment 15 Larry Woodman 2003-10-18 22:15:43 UTC
No problem, sorry for the delay I've been goofing off this saturday :)

The kernel without the patch or the debug code is in:
http://people.redhat.com/~lwoodman/.for_ibm

Let us know how it goes ASAP.


Larry


Comment 16 IBM Bug Proxy 2003-10-19 05:12:18 UTC
------ Additional Comments From yvchan.com  2003-19-10 01:10 -------
New data point with the original -4 kernel on AMD64:  We reproduced the problem 
with the db2ipccm spinning forever in kernel space on an AMD 64 machine.

I can't log into the machine anymore, but it is still up and running with the 
following information from top: 

 00:47:57  up 1 day,  6:58,  2 users,  load average: 2.02, 2.01, 2.00
461 processes: 457 sleeping, 4 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    0.9%    0.0%   81.1%  11.3%     0.0%    0.1%    6.3%
           cpu00    0.0%    0.0%   77.3%  22.6%     0.0%    0.0%    0.0%
           cpu01    1.9%    0.0%   84.9%   0.0%     0.0%    0.3%   12.6%
Mem:  2036592k av, 1785040k used,  251552k free,       0k shrd,  116160k buff
                    813188k actv,  368468k in_d,   32188k in_c
Swap: 2048276k av,    4300k used, 2043976k free                 1073852k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    4 root      25   0     0    0     0 RW   41.6  0.0  1551m   1 keventd
29421 db2inst1  19   0 33820  33M 31492 R    38.6  1.6 656:01   0 
db2ipccm         

The machine isn't responding to commands well, but this is most likely due to 
the 2 cpus's being "engaged" with something.  Any chance someone can say 
whether or not the keventd is behaving normally?

At any rate, I've just restarted the testing on the first machine with the new 
kernel.  I'll check it in the morning. 

Comment 17 IBM Bug Proxy 2003-10-19 16:07:16 UTC
------ Additional Comments From yvchan.com  2003-19-10 12:04 -------
We're looking really, really good right now :)

We have 96 of the 100 clients still running (the other 4 encountered log full 
conditions and exited...  Perfectly normal)  The other 96 has been steadily 
getting new connections and running apps as according to the timestamps that I 
see.  

As a side note, the other issue we had seen with a db2sysc spinning in 
userspace with calls to msgsnd has not appear either.  (This would have 
happened almost at the same time, or just before the db2ipccm spinning in 
system space).  We are at the 10 hour mark at the moment, and I'll let this go 
for the rest of the day and check in on it later today.

I think we are pretty safe to say that the patch removed was the culprit.

Great Job folks!  :) 

Comment 18 IBM Bug Proxy 2003-10-19 16:22:22 UTC
------ Additional Comments From yvchan.com  2003-19-10 12:16 -------
Recommendation from me (In case my last note was clear):  Red Hat, please 
remove this patch from your kernel before you release.  DB2, RH, and our joint 
customers would save alot of time by simply being able to work "out of the box". 

Comment 19 IBM Bug Proxy 2003-10-20 13:23:53 UTC
------ Additional Comments From yvchan.com  2003-20-10 09:20 -------
Monday morning (the run has been going for 1 day, 8 hours, and about 15 
minutes.  82 of the clients are still running, and getting jobs :)

I think we're pretty safe to say we found the offending patch. 

Comment 20 Bob Johnson 2003-10-20 13:43:43 UTC
Does this occur only under heavy load ?
And if so, please identify heavy load.
Thanks.

Comment 21 IBM Bug Proxy 2003-10-20 14:32:38 UTC
------ Additional Comments From yvchan.com  2003-20-10 10:26 -------
For a normal validation run (as defined in the validation kit), we expect 300 
clients for at least 24 hours.  In the case with the original RHEL 3 kernel, we 
couldn't get 100 clients for 1 hour.

This appears to occur even with light to moderate load on the system.  The load 
average on the system when this occurs is < 5 in most cases.  There are several 
examples of top output in this bugzilla that show the load average on the 
system when this happens.

yvonne. 

Comment 22 Bob Johnson 2003-10-20 14:49:10 UTC
This was originally identified as heavy load on the 10th.
Now just a few days before ship it is identified as basically any load.

Your version of DB2 does not ship until November 8th, I suggest we look at this
for an errata, not delay the GA.


Comment 23 Bob Johnson 2003-10-20 15:08:09 UTC
Our concern is the following - we know we fix this particular issue with the lock.
But what else do we break at this time by taking the patch out.  THis is our
only middleware that has run into this. so did the patch help, we don't know.


Comment 24 IBM Bug Proxy 2003-10-20 15:12:54 UTC
------ Additional Comments From yvchan.com  2003-20-10 11:06 -------
I assume by this, you mean critical update errata available immediately after 
your GA available via RHN? 

Comment 25 IBM Bug Proxy 2003-10-20 15:28:06 UTC
------ Additional Comments From yvchan.com  2003-20-10 11:23 -------
This is our normal load testing. It is designed to attain a heavy load and to 
max out system resources. In this instance we did not get to the point of a 
heavy load nor max out the resources.

We believe that any application using ipcs resources with smp will hit this 
issue. 

Comment 26 IBM Bug Proxy 2003-10-20 15:32:55 UTC
------ Additional Comments From khoa.com  2003-20-10 11:29 -------
If the offending patch is only to improve performance, then it would be
safe to take it out - especially if you run into a functional problem
like this. 

Comment 27 IBM Bug Proxy 2003-10-20 15:47:53 UTC
------ Additional Comments From dmosby.com  2003-20-10 11:42 -------
It does not require heavy stress to hit this bug. Two processes on an 
otherwise idle muti-processor system could hang the OS if they both entered 
ipc code at the same time. Running an ipc-intensive application simply 
increases the odds of hitting the bug.

We feel confident that the two kernels recently tested are far less likely to 
hit the problem. Until we understand what the problem was, and the code change 
made we cannot really say if the problem has been solved or has now become far 
more difficult to hit.

The problem observed should not be considered a DB2 issue. Any software using 
ipc could potentially hang the OS. The DB2 test simply makes a good ipc stress 
test. I am not familiar with the kernel acceptance testing done on an SMP 
system in the area of ipc testing. It may be that testing in this area should 
be increased. 

Comment 30 Bob Johnson 2003-10-20 17:50:59 UTC
Fr Dale and Yvonne at IBM.

can we get some backtraces like enabled nmi_watchdog and/or sysreq-t ?

Comment 31 IBM Bug Proxy 2003-10-20 19:37:32 UTC
------ Additional Comments From dmosby.com  2003-20-10 15:31 -------
Per request, I am placing some of my notes regarding investigation of the
bug in the bugzilla entry. These notes come from two debugging sessions on
a system running the DB2 tests in the condition where one process
was "spinning" in the kernel. I was able to use the second processor to
do some debugging.

The systems were not set up to take a crash dump. I made use of
/proc/sysreq-trigger to obtain task stacks and then examined the spinning
process.

Some of the kernel output is rather wide, and may be difficult to read in
Bugzilla. I am going to try and format a bit by trimming some of the data
information. For example a typical trace line in dmesg:
Oct  9 20:16:09 rhel3rc2 kernel: [<c011ce00>] smp_apic_timer_interrupt 
[kernel] 0x0 (0xf1da1f08)
I will shorten to:
[<c011ce00>] smp_apic_timer_interrupt [kernel] 0x0 (0xf1da1f08)

The following is likely to only be of interest to those familiar with the
kernel. For a conversion to text see comment number 14.

---------------------

Debugging session October 9. uname -a information:
Linux rhel3rc2 2.4.21-3.ELsmp #1 SMP Fri Sep 19 14:06:12 EDT 2003 i686 i686 
i386 GNU/Linux

Spinning process task call trace:
Trimmed text "Oct  9 20:16:09 rhel3rc2 kernel:".

 db2sysc       R F1DA1F14  1792  3325   3320  3335          3326 (NOTLB)
 Call Trace:   [<c011cf1d>] smp_apic_timer_interrupt [kernel] 0x11d 
(0xf1da1ef0)
 [<c011ce00>] smp_apic_timer_interrupt [kernel] 0x0 (0xf1da1f08)
 [<c01a0068>] isofs_read_super [kernel] 0x128 (0xf1da1f34)
 [<c01aab1c>] .text.lock.shm [kernel] 0x131 (0xf1da1f3c)
 [<c0113251>] sys_ipc [kernel] 0x1b1 (0xf1da1f90)

Current return address on stack is:  c0113251
This is where we will return.
Address above is a call (must be): 0xc01aa040

From vmlinuz-2.4.21-3.ELsmp

c0113248:   89 54 24 04           mov    %edx,0x4(%esp,1)
c011324c:   e8 ef 6d 09 00        call   0xc01aa040           <---------
c0113251:   85 c0                 test   %eax,%eax
c0113253:   89 c2                 mov    %eax,%edx

From System.map

c01a9600 t shm_get_stat
c01a96d0 T sys_shmctl
c01aa040 T sys_shmat           <--------
c01aa480 T sys_shmdt
c01aa540 t sysvipc_shm_read_proc

We are in sys_shmat
Only places we call in sys_shmat are:

c01aa18b:   e8 e0 a6 ff ff      call   0xc01a4870      c01a4870 T ipcperms
c01aa281:   e8 2a 90 f9 ff      call   0xc01432b0      c01432b0 T do_mmap_pgoff
c01aa3a0:   e8 5b eb ff ff      call   0xc01a8f00      c01a8f00 t shm_destroy
c01aa3e3:   e8 48 97 f9 ff      call   0xc0143b30      c0143b30 T find_vma

Routine bounds are: 0xc01aa040 to 0xc01aa480
Jump out of routine:

c01aa0fe:    f0 fe 08             lock decb (%eax)
c01aa101:    0f 88 15 0a 00 00    js     0xc01aab1c   <-- addr shown on stack

-----------------------------------
Debugging session 15-Oct-03 follows.

Output from "top"
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    3.4%    0.0%  104.0%   0.0%     0.0%   92.2%    0.0%
           cpu00    3.5%    0.0%    4.1%   0.0%     0.0%   92.2%    0.0%
           cpu01    0.0%    0.0%  100.0%   0.0%     0.0%    0.0%    0.0%
Mem:  1028452k av, 1005712k used,   22740k free,       0k shrd,   25284k buff
                    568392k actv,   72752k in_d,   13848k in_c
Swap: 2040244k av,  879248k used, 1160996k free                  241588k cached

  PID USER     PRI  NI  SIZE  RSS SHARE WCHAN     LC STAT %CPU %MEM   TIME CPU 
COMMAND
 2910 db2inst1  19   0  7792 3652  3648            1 R    99.9  0.3  1013m   
1 /home/db2inst1/absolut/absolut/APPS/am...

Process pid 2910 appears to be spinning on CPU 1. An strace of pid 2910 hangs,
producing no output.

Trace from messages, using sysrq-trigger:

 am119c        R 00000001  3908  2910   2909                     (NOTLB)
 Call Trace:   [<c011cf1d>] smp_apic_timer_interrupt [kernel] 0x11d 
(0xc5d29d2c)
 [<c011ce00>] smp_apic_timer_interrupt [kernel] 0x0 (0xc5d29d44)
 [<c01a8dc0>] .text.lock.sem [kernel] 0xca (0xc5d29d78)
 [<f887ca27>] ext3_do_update_inode [ext3] 0x177 (0xc5d29e18)
 [<c013d7e5>] vm_set_pte [kernel] 0x75 (0xc5d29e58)
 [<c013fe20>] do_wp_page [kernel] 0x6c0 (0xc5d29e84)
 [<c0141397>] handle_mm_fault [kernel] 0x217 (0xc5d29ec0)
 [<c011f78c>] do_page_fault [kernel] 0x14c (0xc5d29ef4)
 [<c014956f>] generic_file_write [kernel] 0x13f (0xc5d29f48)
 [<f8877149>] ext3_file_write [ext3] 0x39 (0xc5d29f74)
 [<c01130f9>] sys_ipc [kernel] 0x59 (0xc5d29f90)
 [<c011f640>] do_page_fault [kernel] 0x0 (0xc5d29fb0)

Last value on stack "c011f640" appears to be "noise on the stack".
The instruction before this is not a call.

c011f633:   eb c7                 jmp    c011f5fc <bust_spinlocks+0x2c>
c011f635:   8d 74 26 00           lea    0x0(%esi,1),%esi
c011f639:   8d bc 27 00 00 00 00  lea    0x0(%edi,1),%edi   <-- not a "call"
c011f640 <do_page_fault>:
c011f640:   55                    push   %ebp
c011f641:   57                    push   %edi
c011f642:   56                    push   %esi

So looking at "c01130f9".

c01130e5:   8b 44 24 40         mov    0x40(%esp,1),%eax
c01130e9:   89 5c 24 08         mov    %ebx,0x8(%esp,1)
c01130ed:   89 44 24 04         mov    %eax,0x4(%esp,1)
c01130f1:   89 3c 24            mov    %edi,(%esp,1)
c01130f4:   e8 97 4d 09 00      call   c01a7e90 <sys_semtimedop>  <-- call
c01130f9:   89 c2               mov    %eax,%edx                  <-- addr on 
stack
c01130fb:   89 d0               mov    %edx,%eax
c01130fd:   8b 5c 24 20         mov    0x20(%esp,1),%ebx

From System.map:

c01a7d40 t alloc_undo
c01a7e60 T sys_semop
c01a7e90 T sys_semtimedop            <-- matches call
c01a8560 T sem_exit
c01a8870 t sysvipc_sem_read_proc 

Comment 32 Arjan van de Ven 2003-10-20 19:40:52 UTC
would it be possible to attach the full sysreq-t to bugzilla? attachements don't
have the size limits a normal text has

Comment 33 IBM Bug Proxy 2003-10-20 20:10:38 UTC
------ Additional Comments From dmosby.com  2003-20-10 16:00 -------
 
As requested, task trace information.

As was requested, these are the task stack traces taken from /var/log/messages
generated by sysrq-trigger. 

Comment 34 IBM Bug Proxy 2003-10-20 20:11:14 UTC
Created attachment 95319 [details]
TaskTraces.txt

Comment 35 Larry Woodman 2003-10-20 21:34:53 UTC
OK, we found a potential problem and fixed it in the semaphore lock patch.  
We built a test kernel that we need you to try, its in:

http://people.redhat.com/~lwoodman/.for_ibm/




Comment 36 IBM Bug Proxy 2003-10-21 05:57:35 UTC
------ Additional Comments From dmosby.com  2003-21-10 01:54 -------
We have about 4 hours of the DB2 test running on kernel
2.4.21-3.EL.IBM.2smp so far tonight. System appears OK so far. 

Comment 37 Arjan van de Ven 2003-10-21 07:20:45 UTC
That is good news!

One thing: the bug we fixed triggered ONLY when userspace was going to use a IPC
primitive while at the same time freeing that same primitive (which is why this
didn't show up before, typically applications stop using a resource before
freeing it). I'm mentioning this because it may point at a locking bug/race in DB2. 
(that's no excuse for the kernel to die, but may be something IBM might be
interested in investigating)

Comment 38 Ingo Molnar 2003-10-21 08:26:51 UTC
ie. a semop() was done over the semaphore in one process, in parallel to an
IPC_RMID done by another process.

is there perhaps some forced shutdown that happens during the test?

Comment 39 Bob Johnson 2003-10-21 13:27:19 UTC
What additional testing has IBM been able to do outside of this stress test ?
Or what testing are you planning to do ? 

Comment 40 Greg Kelleher 2003-10-21 16:09:16 UTC
Following an extensive evaluation we have concluded that there is no primitive 
mixup in the code (which runs identical on a large number of other UX platforms 
plus Windows in identical fashion). Right now the bug is clearly identified as 
related to the performance fix that was removed from the code over the weekend. 

Comment 41 IBM Bug Proxy 2003-10-21 17:17:35 UTC
------ Additional Comments From yvchan.com  2003-21-10 13:15 -------
I find it hard to believe that the kernel without the "linux-2.4.21-semaphore-
scale.patch" would work fine and then with this patch, the same kernel would 
not "die" per say, but cause the exact same processes to be "stuck" in kernel 
space.  This makes the assumption that it is a kernel problem, and not a DB2 
problem.  Why would we not exhibit this problem with the RHAS 2.1 kernels, or 
with the kernel that has this patch removed?  We run this same validation kit 
in many places and have yet to experience the same problem.  The DB2 code for 
this area is not specific to Linux, and works just as well on AIX, SUN and HP. 

Comment 42 Arjan van de Ven 2003-10-21 17:38:35 UTC
wo wo wo
Nowhere did I say or imply that this was not something we should fix in the
kernel. In fact I said it was something we should fix. I was just trying to be
helpful by saying that the thing we fixed, if it helped, could point to an
application locking issue since it has to do with use-in-parallel-with-free.

We're still waiting for the final test results of our suggested fix kernel.


Comment 43 IBM Bug Proxy 2003-10-21 19:40:39 UTC
------- Additional Comment #47 From Dale E. Mosby  2003-10-21 15:27 -------

DB2 test status on latest kernel: 2.4.21-3.EL.IBM.2smp #1 SMP 
Test was started last night and is still running, now about 18
hours of test an no spinning process, no system hang.
This kernel does not appear to hit the problem.


Comment 44 IBM Bug Proxy 2003-10-21 20:04:28 UTC
------ Additional Comments From dmosby.com  2003-21-10 16:01 -------
I have taken an initial look at the code changes made in the latest kernel,
comparing with the original RHEL 3.0 kernel. I'll comment more on that in a
moment.

This mail is a bit long and technical.  So let me present the short technical
summary here:
  There is a serious kernel bug in the RHEL 3.0 code as I saw it last week
  used for DB2 testing. Under no circumstances should a user application be
  able to halt the operating system. I consider this to be a critical error,
  though the level of seriousness may be open to debate. This is not an error
  in DB2, this is a Linux kernel error.

The information received seems to indicate that a fix was made to solve a
kernel error in the case where an application "uses an IPC primitive while
at the same time frees the same primitive".  I am not quite sure how to
interpret this. It seems that Red Hat is indicating that DB2 continues to
use a IPC primitive after releasing it, or at least this is the case their
change addresses.

I interpret their comment to mean that they believe application releases
some Inter-Processor Communication (IPC) element, and then attempts to use
the element. This may be the case, but as our application works fine on
RH Linux (with the kernel fix) and other platforms as well, it would appear
that there is no serious logic flaw in DB2.

Even if DB2 happens to be using an IPC function (shared memory, semaphore,
message) in a nonstandard way, there is clearly a critical kernel bug as no
user process should be able to hang the operating system in this manner.
Whether or not an IPC element did once exist or was never present, the kernel
should simply return an error to the process attempting to use the IPC
element. In addition to the fact that we know DB2 can hang a system it should
be stressed that any use application can do the same. Some simple messaging
application could lock up a RHEL 3.0 kernel as initially provided to us. This
should be of a great concern. 

From the information available when looking at the spinning process on the
lab system I have no way of determining if DB2 was trying to use some IPC
element which had been released - as Red Hat indicates their latest kernel
fix is addressing.

I have taken a first look at the code changes. The entire IPC area is quite
complex. In fact the comment for "sys_ipc", the system entry point for IPC
calls, is "This is really horribly ugly.". I hate coming across code comments
like that.

In going from RHEL 2.1 (or Linux 2.4 tree) to RHEL 3.0 the Inter-Processor
Communication (IPC) code had a performance improvement.  Interestingly, the
Linux 2.4 to 2. 6 tree also made a performance improvement in this area, but
used different code to accomplish this. I am curious to know how the DB2
tests execute on the Linux 2.6 kernel.

The IPC code uses a structure which contains an array of "entries" that each
manage one IPC object. The exact details of the structure aren't important.
There is a "lock" for the overall structure and a lock for each entry in the
structure.  Something like:
    struct ipc_ids {
        lock_for_ipc_ids
        stuff...
        struct ipc_id entries[] {
            lock_for_entry
            stuff...
        }
    }

In the initial RHEL 3.0 kernel I examined (2.4.21-3.ELsmp #1 SMP ) the code
to  access an IPC element looks something like:
    1) obtain read lock on ipc_ids structure
    2) obtain spin lock on individual element of array entries

The idea here is that we use a read lock to prevent manipulation of the
overall data structure, and then a spin lock to allow only single access
to each element within the array of IPC entries.

With read locks the idea is that you can have lots of readers, but only one
writer. Granting access to a individual element is considered a read - meaning
that you can have multiple processes at the same time, each having a spin
lock on different elements of the array entries. Whenever there is a write 
lock, all readers must block until the write lock is released.

A write lock is used when we grow the array.

In the new kernel we received, 2.4.21-3.EL.IBM.2smp, the code retains the
same read lock implementation for most access to the structure, however
there are two places where a write lock has been used rather than the read
lock. (ipc/sem.c when returning an error to the process and ipc/msg.c in
sys_msgctl, the OS interface for message control operations - see "man
msgctl").

The effect of the introduced write locks is to serialize processes in this
area. This may be considered a performance impact, though clearly not as big
a performance impact as having the OS hang.

There are three structures "ipc_ids" like so:
 File msg.c:  struct ipc_ids msg_ids;
 File sem.c:  struct ipc_ids sem_ids;
 File shm.c:  struct ipc_ids shm_ids;

The RHEL 3.0 code uses br_read_lock(BR_SEMAPHORE_LOCK) when locking any of
these structures. It seems to me, given my current understanding, that
unrelated IPC operations are be locking all of the structures. I am
wondering how things would work with a code change such that br_read_lock
was called with three separate locks - one unique to each structure. This
would seem to be a performance improvement and if we did this with the
original kernel and got the hang we could observe which lock was held.
At this time I am not aware of the inter-dependence of the structures that
makes it necessary to have a single lock. The change would assoicate a lock
with each structure - much as the structure I illustrated.

A question comes to mind: Is it possible that the hang was hit due to writer
starvation? In other words, with a two processor system can we have a spin
on the br_write_lock call due to having a high number of readers? I haven't
analyzed the code enough to know how we guarantee that we always service a
writer. Someone more familiar with the code would likely know the answer,
just an area for thought.

NOTES:  1) I have been focusing on the changes in ipc directory and code
that I initially called out as being the problem area. There may be more
changes, I don't claim this is the complete list, 2) I use the term "lock",
in reality there are locks and semaphores, 3) I have simplified the details
while trying to cover what I felt was important for this discussion  - if
this is passed on to folks working on the kernel this should be obvious. 

Comment 45 Ingo Molnar 2003-10-21 20:34:34 UTC
a couple of comments. It got a bit long, because many issues were raised.

1)

the "this is really horribly ugly" comment in sys_ipc() is an old comment from
Linus, and he only meant the multiplexing of the IPC ops. The core SysV IPC code
is pretty acceptable otherwise and has gone through several iterations of
refinement during the years.

2)

one actual race that you might trigger is to release a SysV semaphore via
IPC_RMID while it's used by another process via semop(). A narrow race window
will suffice eg. during process shutdown. Other kernels wont show this race,
you'd get a safe -EINVAL on the semaphore, which if it happens during process
shutdown can get ignored easily. Obviously it's perfectly legal to _attempt_ to
use a semaphore in parallel to destroying it, it's just pointless. The kernel
should of course not deadlock when an application does that.

(there might be other user-level semaphore use races that might trigger the
scenarior, but this is the most obvious one. In any case, DB2 is within its
right to do whatever it wants to with the semaphores - the kernel should not
lock up. We only suggested a scenario that could trigger the lockup, which
scenario, if avoided in userspace [if it's possible to avoid it] could work
around the kernel bug. And generally it might just be practical to fully
understand how that race could have happened (if it indeed happened),
irrespective of the kernel's lockup.)

3)

the observation that the fixed behavior is caused by the removal of the
semaphore patch is incorrect - in the latest kernel we fixed a rare race in the
new semaphore code itself, we didnt remove the patch.

4)

the 'slowdown' caused by the fix should be near zero. The only code affected is
an utter slowpath: semaphore array destruction. The SysV semaphore
creation/destruction path was never a target of optimizations - in fact even the
2.6 kernel completely and globally serializes all semaphore creation and
destruction in the system, via the sem_ids.sem kernel mutex. The main
optimization target was always semaphore use.

5)

brlock readers cannot be starved on a 2-CPU system - and the debug info seems to
show reader spinning. One possibility for the lockup would be memory corruption
caused by semop() using a freed semaphore.

and even if some form of brlock _writer_ starvation happens on a say 64-way
system, it will most likely show up as inefficiency, not as an outright
deadlock. Most starvation situations are instable and dont show up as persistent
lockups. Especially they dont show up as hard lockups (which the 'NumLock keys
dont work' comment suggests) - the IPC brlock is never used with irqs disabled.


Comment 46 Ingo Molnar 2003-10-22 06:36:02 UTC
I think it would be a good idea to also run this DB2 test on a 4-way or 8-way
system as well - they will exponentially more likely trigger any SMP races.

the observations about the locking changes are correct - the bug was caused by
incorrect access to the ids->entries[] array. 2.6 locking is different as it
utilizes RCU to update ids->entries[]. Btw., RCU is even slower in the slow-path
than the write-lock portion of brlocks. The read-path should scale similarly
well. Original 2.4 uses a global, system-wide spinlock to protect ids->entries[].

Comment 47 IBM Bug Proxy 2003-10-23 16:02:38 UTC
------ Additional Comments From dmosby.com  2003-23-10 12:00 -------
This is a less technical descripiton of the problem encountered. The intent
is to convey the general nature of the problem to someone not having kernel
source access and not interested in internal details of the problem.
-----

The initial release version of the RHEL 3.0 kernel contains a bug in the
IPC (Inter-Process Communication) code. This problem can be triggered on
a multiprocessor system by multiple processes using IPC facilities
(semaphores, shared memory, and messages). The problem will be seen as
a process spinning in the kernel or by a system hang.

The problem arises due to a performance improvement made in this release.
The kernel makes use of a data structure in the management of each type
of IPC facility. Historically this structure would be protected by a global
lock while a process obtained access to individual array entries held within
the structure. In the RHEL 3.0 release a performance improvement is made by
using a reader-writer semaphore to allow multiple processes to obtain access
to separate array elements.

The problem comes about due to insufficient protection of access to the data
structure in some cases. A fix involving additional write locking in specific
code paths has been identified and will be released after complete testing. 

Comment 48 Tim Burke 2003-11-11 15:56:37 UTC
Any news from IBM on their testing of the test kernel provided? 
Ingo's locking fixes have been incorporated into U1.  Setting the
state of this bugzilla to modified. IBM crew - please verify the fix
in the beta versions of U1 in addition to the test kernels you already
have.

Comment 49 IBM Bug Proxy 2003-11-18 23:50:01 UTC
------ Additional Comments From dmosby.com  2003-18-11 12:51 -------
Status update: I have made a couple requests for testing status. Just made 
another today. If I don't get a response in a week and don't receive any 
objections I propose to close the bug as fixed. I don't see a bug state 
of "fixed awaiting testing" so I will leave the state as is for now. 

Comment 50 IBM Bug Proxy 2003-11-18 23:50:39 UTC
------ Additional Comments From yvchan.com  2003-18-11 13:18 -------
We have not received the U1 test kernel and are not expected to receive it 
until early Dec.

Please do not close this off until we have received the U1 kernel. 

Comment 51 mark wisner 2003-11-22 21:24:05 UTC
------ Additional Comments From dmosby.com  2003-21-11 17:14 -------
Expectation is that U1 arrives early December. Waiting to test this. 

Comment 52 Bob Johnson 2003-12-15 20:11:42 UTC
U1 available today on rhn for partner testing.

Comment 53 IBM Bug Proxy 2004-01-05 16:11:26 UTC
------ Additional Comments From fvieira.com  2004-01-05 10:54 -------
We are currently testing U1 on all platforms - we will have more information 
on this over the next couple of days. 

Comment 54 Ernie Petrides 2004-12-03 02:02:48 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-017.html


Comment 55 IBM Bug Proxy 2005-02-20 22:01:36 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|FIXEDAWAITINGTEST           |ACCEPTED




------- Additional Comments From khoa.com  2005-02-20 16:58 EST -------
Can we close this bug report now ?  If so, please close it.  Thanks. 

Comment 56 IBM Bug Proxy 2005-04-28 13:23:35 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ACCEPTED                    |CLOSED
             Impact|------                      |Functionality




------- Additional Comments From yvchan.com  2005-04-28 09:16 EDT -------
Definately fixed.  Closing on IBM side by yvchan@ibmca