Bug 84452

Summary: RHEL AS2.1 QU3 errata: System hangs with 2.1 AS (timer.c)
Product: Red Hat Enterprise Linux 2.1 Reporter: Larry Troan <ltroan>
Component: kernelAssignee: Jason Baron <jbaron>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: brian.b, hfuchi, ichute, knoel, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-12-19 19:25:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 87937, 90549    
Attachments:
Description Flags
timer.c.patch
none
timer_fix.patch (later patch)
none
rsana.1518.tar.gz
none
timer.txt (sysrq info)
none
newsysrqinfo
none
sysrqw_may02 none

Description Larry Troan 2003-02-17 16:37:21 UTC
We have found that the timers in the OS are getting into a loop and are not
getting updated. 

If we hammer the system with network I/O, the system enventually goes to the
weeds.  With more debugging, we have found a bug in a timer in the kernel.  We
have fixed the bug and rerunning the test.
----------
Action by: Bryan.Leopard
Issue Registered
----------
Action by: ltroan


ltroan assigned to issue for HP-ProLiant.

Category set to: Kernel
Status set to: Waiting on Client

----------
Action by: Bryan.Leopard
Larry,

We talked with Sue about this yesterday.  We were trying to get this into the
e.12 errata before it went out the door... Here is the information from our
engineers and attached is the patch that we feel fixes it.  L Woodman has it
already via email.

Bryan

This has been reproduced on several SMP systems here, such as the DL580 G2 (4
processors), with 4 Gbit NICs running either the tg3 or bcm5700 driver. The ftp
test copies files between the test systems. It usually fails in 2 � 8 hours but
may take as long as 16 hours. 

The patch has run on two systems for over 6 hours and is still running while on
another system it ran for just under 2 hours before the system locked up. On
this last system, we didn�t have the BUG check code. We�re adding the bug check
code in __run_timers so that we�re sure we are seeing the same problem.

File uploaded: timer.c.patch

----------
Action by: Bryan.Leopard


Status set to: Waiting on Tech

----------
Action by: ltroan
This will be fixed in AS2.1Q2 errata (April-beta, May-ship).



----------
Action by: arjanv
this looks like a nice workaround for a buggy use of timers, on first sight.


ISSUE TRACKER 15751 opened as sev 1

Comment 1 Larry Troan 2003-02-17 16:38:23 UTC
Created attachment 90135 [details]
timer.c.patch

Comment 2 Larry Troan 2003-03-06 23:03:27 UTC
NEEDED FOR AS 2.1 Q2 ERRATA.

Comment 3 Larry Troan 2003-03-14 03:16:57 UTC
lwoodman says will be fixed in Q2 errata.

Comment 4 Larry Troan 2003-04-11 01:37:53 UTC
FROM ISSUE TRACKER...
Event posted 04-10-2003 02:53pm by Bryan.Leopard with duration of 0.00
Retesting with rhas21 errata QU2 beta, which as 2.4.9-e.17 kernel.  Using latest
broadcom 6.0.2a, but am getting errors 'do_IRQ:  stack overflow' logged in
dmesg.  Will restart with bcm5700 6.0.2b

Comment 5 Larry Troan 2003-04-11 01:50:03 UTC
If the error is associated with the Broadcom driver as indicated above, Red Hat
asks that you reproduce the error with the tg3 driver in the errata -- it
supports all bcm chip sets except the 5705. If the problem can not be reproduced
with the open sourced tg3 driver, Red Hat will not be able to address this problem.

Comment 6 Larry Troan 2003-04-23 18:52:46 UTC
Created attachment 91257 [details]
timer_fix.patch (later patch)

Comment 7 Larry Troan 2003-04-23 18:55:49 UTC
FROM ISSUE TRACKER>>>>>>>>>>
Event posted 04-16-2003 01:40pm by brian.b with duration of 0.00
timer_fix.patch
I have asked QA to tell me whether this is broken with the tg3 driver.  While
waiting on that, here is the patch that should be used (not the one earlier in
the thread).  The do_IRQ error is a separate issue and should probably be
ignored for this issue's purposes.

File uploaded: timer_fix.patch
----------------------

Event posted 04-16-2003 02:35pm by brian.b with duration of 0.00
This issue occurs also with the tg3 driver.  Please use the patch above to
correct the problem.
----------------------

Event posted 04-22-2003 04:03pm by brian.b with duration of 0.00
Please ignore the do_IRQ problem.  This is not a Red Hat problem and is separate
from this issue.  The system does fail without the timer_fix.patch regardless of
whether you use tg3 or broadcom.  This patch needs to be included in the QU2
release.

Comment 9 Larry Troan 2003-04-24 14:21:58 UTC
Comment on attachment 90135 [details]
timer.c.patch

superceded by April patch

Comment 10 Larry Troan 2003-04-25 13:24:40 UTC
THIS IS ALSO ISSUE TRACKER 19520.... (PRESALES ISSUE FROM BRYSTOL WEST).

Comment 11 Larry Troan 2003-04-28 21:00:58 UTC
Created attachment 91362 [details]
rsana.1518.tar.gz

Comment 13 Larry Troan 2003-04-28 21:38:52 UTC
Created attachment 91371 [details]
timer.txt (sysrq info)

Comment 14 Jason Baron 2003-04-30 20:37:26 UTC
The attached patch basically modifies add_timer to act like mod_timer....this
shouldn't be necessary. It seems that we do not fully understand the root cause
of the issue. 2.5 has similar timer code and doesn't need modify semantics for
add. it feel to me like a driver might not be using the semantics that the
timer.c code expects. for example, a drv might call add_timer, then set the
timer list fields to NULL, and then do a mod_timer.  

Comment 16 Larry Troan 2003-05-05 14:24:17 UTC
Additional Information from Issue Tracker......
Event posted 05-02-2003 02:57pm by brian.b with duration of 0.00 
  
As for the new kernel image with the additional sysrq support, we can't get to
porkchop...
  
It will be interesting to see the additional processor's states but I don't
think we'll get any new information. When we originally debugged this problem we
were using an ITP. After the system locked up, we could stop and examine the
state of all the CPUs. The primary hang was in __run_timers() always followed by
a different CPU attempting to del_timer_sync() on that same timer. The rest of
the CPUs weren't doing anything interesting. By the time the kernel falls into
the __run_timer() hole, the damage had been done long before.

------------
Event posted 05-02-2003 03:21pm by ltroan with duration of 0.00
Sorry for the confusion here. Porkchop is an internal user.The discussion this
morning was to get Jeff to push the image to a people.pageat redhat where you
could access/download the image. Will pursue this.

--------------------
Event posted 05-02-2003 03:27pm by jneedle with duration of 0.00

Send Notifications
http://people.redhat.com/jneedle/.private/.hp/<filename>

kernel-2.4.9-e.18.3.i686.rpm             kernel-smp-2.4.9-e.18.3.i686.rpm
kernel-debug-2.4.9-e.18.3.i686.rpm       kernel-summit-2.4.9-e.18.3.i686.rpm
kernel-enterprise-2.4.9-e.18.3.i686.rpm

Larry, I e-mailed you the username and password for access to that directory.

------------------
Event posted 05-02-2003 04:11pm by ltroan with duration of 0.50
The -e18.3 kernel is for internal HP use ONLY !!!
And not to be given to any customer including Bristol West FOR ANY REASON !!!

You can download it as follows......
---------------------------------------------
Jeff has put the following files in

people.redhat.com:/jneedle/.private/.hp/

kernel-2.4.9-e.18.3.i686.rpm            
kernel-smp-2.4.9-e.18.3.i686.rpm
kernel-debug-2.4.9-e.18.3.i686.rpm      
kernel-summit-2.4.9-e.18.3.i686.rpm
kernel-enterprise-2.4.9-e.18.3.i686.rpm

This is not a browseable directory.  (edited)
They will need to tack on the exact file name.

The directory is ID and password protected. You can obtain these from me (Larry
Troan). Do not append them to this Issue Tracker or to the HP tracking tool.

Status set to: Waiting on Client

--------------------
Event posted 05-02-2003 05:04pm by tom.rhodes with duration of 0.00
We have downloaded the -e18.3 kernel and will run it over the weekend.

--------------------------
Event posted 05-02-2003 05:43pm by tom.rhodes with duration of 0.00
newsysrqinfo
New sysrq output from -e18 kernel attached showing more CPU status:
CPU 0-3 and 6-7 are idle
CPU 4 is in __run_timers or send_sig_info (called from __run_timers)
CPU 5 was never displayed and presumably is stuck in del_timer_sync trying to
delete the timer running on CPU 4

WRT comments from jbaron above, this doesn't look like a driver issue. In both
traces we've captured so far, the timer in question was an itimer being used by
the X server. In both traces the X server process is stuck in del_timer_sync
trying to delete the timer that __run_timers is stuck on.

File uploaded: newsysrqinfo

----------
Event posted 05-02-2003 05:47pm by tom.rhodes with duration of 0.00
Status set to: Waiting on Tech

----------
Event posted 05-05-2003 06:57am by arjanv with duration of 0.00

Send Notifications
For this reason, add_timer has no protection against another
add_timer/del_timer/del_timer_sync/mod_timer running in parallel. By adding the
same locking sequence that is used in mod_timer(), this problem is avoided.

exactly. And it doesn't need to! If the code doesn't provide that protection
itself it should use mod_timer not add_timer.

Comment 17 Larry Troan 2003-05-05 14:28:46 UTC
Created attachment 91498 [details]
newsysrqinfo

Comment 18 Larry Troan 2003-05-06 13:17:09 UTC
Created attachment 91515 [details]
sysrqw_may02

Comment 19 Larry Troan 2003-05-06 13:18:51 UTC
FROM ISSUE TRACKER............
Event posted 05-05-2003 10:58am by tom.rhodes with duration of 0.00
Notes : Attached results from 18.3 kernel and alt-sysrq-w output. Disappointing
results: only cpu 4 and 7 were reported, cpu4 was stuck in run_timers and cpu7
was idle. The alt-sysrq-t output shows that the X server process is stuck in
del_timer_sync.

File uploaded: sysrqw_may02

Comment 21 Larry Troan 2003-05-09 15:51:09 UTC
COMMENT FROM FEATUREZILLA 90549
------- Additional Comment #1 From Arjan van de Ven on 2003-05-09 11:45 -------
         
Action by: Bryan.Leopard
Retesting with rhas21 errata QU2 beta, which as 2.4.9-e.17 kernel.  Using latest
broadcom 6.0.2a, but am getting errors 'do_IRQ:  stack overflow' logged in
dmesg.  Will restart with bcm5700 6.0.2b

we don't care. that driver is borken.
Please try to reproduce this without adding *ANY* drivers we don't ship.


Comment 22 Larry Troan 2003-05-15 22:51:32 UTC
FROM ISSUE TRACKER ....
Event posted 05-15-2003 12:10pm by brian.b with duration of 0.00       
This also occurs with the tg3 driver as noted in the following events:

Event posted 02-14-2003 09:40am by Bryan.Leopard
Event posted 04-16-2003 02:35pm by brian.b
Event posted 04-22-2003 04:03pm by brian.b

Status set to: Waiting on Tech

Comment 23 Larry Troan 2003-07-11 13:07:58 UTC
COMMENT FROM FEATUREZILLA 90549....
------- Additional Comment #7 From Larry Troan on 2003-07-11 09:05 -------          
While agreeing that this is likely a bug, we are waiting for Laurie at HP to
find out if this is a customer bug other than Bristol West. It was originally
reported as a Bristol West Sev 1 but I've been working with BW and HP,  and gave
the customer QU2 which has fixed all but an HP PSP (system monitor) problem that
apparently causes a kernel panic. HP L3 is investigating the PSP problem.

From BW/HP/RH teleconference minutes: RHES 2.1 now a supported operating system.
PSP v 6.40 is available for download from the web. URL sent in email during
conference call.
 
See Issue Tracker 21355 for complete Bristol West details.

Comment 24 Larry Troan 2003-07-15 11:44:58 UTC
FROM ISSUE TRACKER....
Event posted 07-15-2003 07:43am by ltroan with duration of 0.20       
FROM BRIAN BAKER..........
Hi guys,

I have confirmed with Laurie that there is not a known customer with the
add_timer issue.

Thanks, Brian.



Comment 25 Larry Troan 2003-07-15 11:47:14 UTC
Brian, this is being carried as a sev 1 in Issue Tracker. Suggest we lower its
severity to a sev 2. I've dropped the priority in Bugzilla from HIGH to NORMAL
and would like to drop the severity as well.

Comment 27 Larry Troan 2003-08-01 03:17:07 UTC
FROM ISSUE TRACKER...
Event posted 07-31-2003 09:17am by brian.b with duration of 0.00       
A little more info from our engineers:

This add_timer issue has shown up on the kernel mailing list:

http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&selm=eUvE.1U2.19%40gated-at.bofh.it

Andrea says below what we have been saying all along: "it's del_timer_sync
against add_timer". See my note on 4/29/03. Specifically del_timer_sync (due to
setitimer from the X server) running in parallel with an add_timer.

> On Wed, 30 Jul 2003, Andrea Arcangeli wrote:
> > > The thing triggered simply by running setitimer in one function, while
> > the it_real_fn was running in the other cpu. I don't see how 2.6 can
> > have fixed this, it_real_fun can still trivially call add_timer while
> > you run inside do_setitimer in 2.6 too. [...]
>
> This is not a race that can happen. itimer does this:
>
> del_timer_sync();
> add_timer();
>
> how can the add_timer() still happen while it_real_fn is still running on
> another CPU?
it's not add_timer against add_timer in this case, it's del_timer_sync against
add_timer.
cpu0 cpu1
------------ --------------------
do_setitimer
 it_real_fn
del_timer_sync add_timer -> crash
Andrea    

-------------------------------------------------------------------
Event posted 07-31-2003 10:51am by jneedle with duration of 0.00 
Yes, there are active discussions on this in LKML. Ingo and Andrea continue to
try and hash out the very complex interactions of the various timer routines and
a patch has been created for 2.4-based kernels that we are testing.  The entire
thread can be found here:

http://marc.theaimsgroup.com/?l=linux-kernel&m=105949344419681&w=2

Hopefully this will be resolved shortly.  It is a showstopper for QU3.

Comment 28 Jason Baron 2003-08-20 21:06:35 UTC
i've committed Ingo's solution...performance testing is needed

Comment 29 Larry Troan 2003-09-05 13:44:26 UTC
FROM ISSUE TRACKER 
 Event posted 09-01-2003 08:56pm by brian.b with duration of 0.00
Can we get an early look at the solution?


Comment 37 Larry Troan 2003-09-18 13:54:38 UTC
Test Case.....run against Jason's people page code of 9/05 with failure in 1-2
hours...
JASON, IS QA CODE BEYOND THIS FAILURE ??? 
--------------------
System locking up on SMP box by the setitimer() with invalid argument.
This has been reproduced on several SMP systems. It usually fails in 1
- 6 hours but may take as long as 12 hours.

It is possible to make reproductions. The details are as follows.

1. gcc test.c
2. while : ; do ./a.out ; done
3. netstat -c (execution on the another console)

system locking up after 1 - 6 hours.

--------------- test.c ------------------

#include <stdio.h>
#include <sys/time.h>
#include <signal.h>
#include <unistd.h>

#define LOOP 100000000
#define DELAY_SEC -1
#define DELAY_USEC -1

volatile int count = 0;
struct timeval tv[LOOP];
struct timezone tz;

void sig_action()
{
gettimeofday(tv + count, &tz);
count ++;
}

int main()
{
struct itimerval value, ovalue;
int i;

for(i=273; i<LOOP; i++){
value.it_value.tv_usec = DELAY_USEC;
value.it_value.tv_sec = DELAY_SEC;
value.it_interval.tv_usec = DELAY_USEC;
value.it_interval.tv_sec = DELAY_SEC;
setitimer(ITIMER_REAL, &value, &ovalue);
printf("%3d : \n", i);
}
return 0;
}

--------------- test.c ------------------

Comment 38 Jason Baron 2003-09-18 15:39:52 UTC
unfortunately, the latest timer fixes do not address this test case yet.

Comment 39 Larry Troan 2003-09-26 13:12:40 UTC
When we have a fix that accommodates the above test case, HP would like to test
it in parallel with us. Can we get them a kernel when it's available?

Comment 41 Jason Baron 2003-09-26 15:15:11 UTC
agreed, state changing to assigned, and yes we will make the kernel available as
soon as we track this down.



Comment 42 Jeff Needle 2003-10-02 14:31:26 UTC
*** Bug 104297 has been marked as a duplicate of this bug. ***

Comment 50 Jeff Needle 2003-10-14 19:41:59 UTC
90549 technically is a dup of this.  90549 is a featurezilla and this one is a
bugzilla and I believe 90549 was put in because of the procedures that TAMs had
in place back when QU2 was being created.  90549 seems superfluous at this point.


Comment 51 Larry Troan 2003-11-07 14:19:10 UTC
HP requests early code on this if possible so they can begin testing
our latest fix. This is their #1 RHEL2.1 MUSTFIX bug.

Comment 53 Larry Troan 2003-11-21 14:25:07 UTC
FROM ISSUE TRACKER
Event posted 11-20-2003 05:40pm by brian.b with duration of 0.00     
  Re-tested with "2.4.9-e.27.28.test" kernel. (Timer tests) . System 
lockup observed within 5 min after stating the tests. Lock-up
reproduced three times with this kernel.



Comment 55 Tim Burke 2003-12-02 17:00:59 UTC
There were numerous timer.c related fixes incorporated into U3, which
address the majority of related issues.  This particular bugzilla
entry refers to a corner case which was not addressed.  Substantial
effort was expended address this case, but it was concluded that a
non-intrusive resolution is not viable within the bounds of
compatibility constraints.  We do have some good ideas to address the
issue because, but that would require us to break kernel
compatibility.  The outcome of our analysis is that this particular
bug represents an unlikely corner case which does not justify breaking
kernel compatibility over.

Comment 56 Larry Troan 2003-12-14 22:12:24 UTC
HP-ProLiant requesting this be nominated to the Update4 MUSTFIX list.
However, RH Engineering feels initial bug is fixed in U3 and that the
timer.c program is non-realistic edge case. RH has asked that HP
recreate original problem documented in Issue Tracker 15751. 

I am therefore refraining from from nominating this issue to the U4
list until it is recreated via the original test scenario. 

Comment 57 John Flanagan 2003-12-19 19:25:55 UTC
An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2003-408.html