Bug 84452
Summary: | RHEL AS2.1 QU3 errata: System hangs with 2.1 AS (timer.c) | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Larry Troan <ltroan> | ||||||||||||||
Component: | kernel | Assignee: | Jason Baron <jbaron> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 2.1 | CC: | brian.b, hfuchi, ichute, knoel, tao | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2003-12-19 19:25:55 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 87937, 90549 | ||||||||||||||||
Attachments: |
|
Description
Larry Troan
2003-02-17 16:37:21 UTC
Created attachment 90135 [details]
timer.c.patch
NEEDED FOR AS 2.1 Q2 ERRATA. lwoodman says will be fixed in Q2 errata. FROM ISSUE TRACKER... Event posted 04-10-2003 02:53pm by Bryan.Leopard with duration of 0.00 Retesting with rhas21 errata QU2 beta, which as 2.4.9-e.17 kernel. Using latest broadcom 6.0.2a, but am getting errors 'do_IRQ: stack overflow' logged in dmesg. Will restart with bcm5700 6.0.2b If the error is associated with the Broadcom driver as indicated above, Red Hat asks that you reproduce the error with the tg3 driver in the errata -- it supports all bcm chip sets except the 5705. If the problem can not be reproduced with the open sourced tg3 driver, Red Hat will not be able to address this problem. Created attachment 91257 [details]
timer_fix.patch (later patch)
FROM ISSUE TRACKER>>>>>>>>>> Event posted 04-16-2003 01:40pm by brian.b with duration of 0.00 timer_fix.patch I have asked QA to tell me whether this is broken with the tg3 driver. While waiting on that, here is the patch that should be used (not the one earlier in the thread). The do_IRQ error is a separate issue and should probably be ignored for this issue's purposes. File uploaded: timer_fix.patch ---------------------- Event posted 04-16-2003 02:35pm by brian.b with duration of 0.00 This issue occurs also with the tg3 driver. Please use the patch above to correct the problem. ---------------------- Event posted 04-22-2003 04:03pm by brian.b with duration of 0.00 Please ignore the do_IRQ problem. This is not a Red Hat problem and is separate from this issue. The system does fail without the timer_fix.patch regardless of whether you use tg3 or broadcom. This patch needs to be included in the QU2 release. Comment on attachment 90135 [details]
timer.c.patch
superceded by April patch
THIS IS ALSO ISSUE TRACKER 19520.... (PRESALES ISSUE FROM BRYSTOL WEST). Created attachment 91362 [details]
rsana.1518.tar.gz
Created attachment 91371 [details]
timer.txt (sysrq info)
The attached patch basically modifies add_timer to act like mod_timer....this shouldn't be necessary. It seems that we do not fully understand the root cause of the issue. 2.5 has similar timer code and doesn't need modify semantics for add. it feel to me like a driver might not be using the semantics that the timer.c code expects. for example, a drv might call add_timer, then set the timer list fields to NULL, and then do a mod_timer. Additional Information from Issue Tracker...... Event posted 05-02-2003 02:57pm by brian.b with duration of 0.00 As for the new kernel image with the additional sysrq support, we can't get to porkchop... It will be interesting to see the additional processor's states but I don't think we'll get any new information. When we originally debugged this problem we were using an ITP. After the system locked up, we could stop and examine the state of all the CPUs. The primary hang was in __run_timers() always followed by a different CPU attempting to del_timer_sync() on that same timer. The rest of the CPUs weren't doing anything interesting. By the time the kernel falls into the __run_timer() hole, the damage had been done long before. ------------ Event posted 05-02-2003 03:21pm by ltroan with duration of 0.00 Sorry for the confusion here. Porkchop is an internal user.The discussion this morning was to get Jeff to push the image to a people.pageat redhat where you could access/download the image. Will pursue this. -------------------- Event posted 05-02-2003 03:27pm by jneedle with duration of 0.00 Send Notifications http://people.redhat.com/jneedle/.private/.hp/<filename> kernel-2.4.9-e.18.3.i686.rpm kernel-smp-2.4.9-e.18.3.i686.rpm kernel-debug-2.4.9-e.18.3.i686.rpm kernel-summit-2.4.9-e.18.3.i686.rpm kernel-enterprise-2.4.9-e.18.3.i686.rpm Larry, I e-mailed you the username and password for access to that directory. ------------------ Event posted 05-02-2003 04:11pm by ltroan with duration of 0.50 The -e18.3 kernel is for internal HP use ONLY !!! And not to be given to any customer including Bristol West FOR ANY REASON !!! You can download it as follows...... --------------------------------------------- Jeff has put the following files in people.redhat.com:/jneedle/.private/.hp/ kernel-2.4.9-e.18.3.i686.rpm kernel-smp-2.4.9-e.18.3.i686.rpm kernel-debug-2.4.9-e.18.3.i686.rpm kernel-summit-2.4.9-e.18.3.i686.rpm kernel-enterprise-2.4.9-e.18.3.i686.rpm This is not a browseable directory. (edited) They will need to tack on the exact file name. The directory is ID and password protected. You can obtain these from me (Larry Troan). Do not append them to this Issue Tracker or to the HP tracking tool. Status set to: Waiting on Client -------------------- Event posted 05-02-2003 05:04pm by tom.rhodes with duration of 0.00 We have downloaded the -e18.3 kernel and will run it over the weekend. -------------------------- Event posted 05-02-2003 05:43pm by tom.rhodes with duration of 0.00 newsysrqinfo New sysrq output from -e18 kernel attached showing more CPU status: CPU 0-3 and 6-7 are idle CPU 4 is in __run_timers or send_sig_info (called from __run_timers) CPU 5 was never displayed and presumably is stuck in del_timer_sync trying to delete the timer running on CPU 4 WRT comments from jbaron above, this doesn't look like a driver issue. In both traces we've captured so far, the timer in question was an itimer being used by the X server. In both traces the X server process is stuck in del_timer_sync trying to delete the timer that __run_timers is stuck on. File uploaded: newsysrqinfo ---------- Event posted 05-02-2003 05:47pm by tom.rhodes with duration of 0.00 Status set to: Waiting on Tech ---------- Event posted 05-05-2003 06:57am by arjanv with duration of 0.00 Send Notifications For this reason, add_timer has no protection against another add_timer/del_timer/del_timer_sync/mod_timer running in parallel. By adding the same locking sequence that is used in mod_timer(), this problem is avoided. exactly. And it doesn't need to! If the code doesn't provide that protection itself it should use mod_timer not add_timer. Created attachment 91498 [details]
newsysrqinfo
Created attachment 91515 [details]
sysrqw_may02
FROM ISSUE TRACKER............ Event posted 05-05-2003 10:58am by tom.rhodes with duration of 0.00 Notes : Attached results from 18.3 kernel and alt-sysrq-w output. Disappointing results: only cpu 4 and 7 were reported, cpu4 was stuck in run_timers and cpu7 was idle. The alt-sysrq-t output shows that the X server process is stuck in del_timer_sync. File uploaded: sysrqw_may02 COMMENT FROM FEATUREZILLA 90549 ------- Additional Comment #1 From Arjan van de Ven on 2003-05-09 11:45 ------- Action by: Bryan.Leopard Retesting with rhas21 errata QU2 beta, which as 2.4.9-e.17 kernel. Using latest broadcom 6.0.2a, but am getting errors 'do_IRQ: stack overflow' logged in dmesg. Will restart with bcm5700 6.0.2b we don't care. that driver is borken. Please try to reproduce this without adding *ANY* drivers we don't ship. FROM ISSUE TRACKER .... Event posted 05-15-2003 12:10pm by brian.b with duration of 0.00 This also occurs with the tg3 driver as noted in the following events: Event posted 02-14-2003 09:40am by Bryan.Leopard Event posted 04-16-2003 02:35pm by brian.b Event posted 04-22-2003 04:03pm by brian.b Status set to: Waiting on Tech COMMENT FROM FEATUREZILLA 90549.... ------- Additional Comment #7 From Larry Troan on 2003-07-11 09:05 ------- While agreeing that this is likely a bug, we are waiting for Laurie at HP to find out if this is a customer bug other than Bristol West. It was originally reported as a Bristol West Sev 1 but I've been working with BW and HP, and gave the customer QU2 which has fixed all but an HP PSP (system monitor) problem that apparently causes a kernel panic. HP L3 is investigating the PSP problem. From BW/HP/RH teleconference minutes: RHES 2.1 now a supported operating system. PSP v 6.40 is available for download from the web. URL sent in email during conference call. See Issue Tracker 21355 for complete Bristol West details. FROM ISSUE TRACKER.... Event posted 07-15-2003 07:43am by ltroan with duration of 0.20 FROM BRIAN BAKER.......... Hi guys, I have confirmed with Laurie that there is not a known customer with the add_timer issue. Thanks, Brian. Brian, this is being carried as a sev 1 in Issue Tracker. Suggest we lower its severity to a sev 2. I've dropped the priority in Bugzilla from HIGH to NORMAL and would like to drop the severity as well. FROM ISSUE TRACKER... Event posted 07-31-2003 09:17am by brian.b with duration of 0.00 A little more info from our engineers: This add_timer issue has shown up on the kernel mailing list: http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&selm=eUvE.1U2.19%40gated-at.bofh.it Andrea says below what we have been saying all along: "it's del_timer_sync against add_timer". See my note on 4/29/03. Specifically del_timer_sync (due to setitimer from the X server) running in parallel with an add_timer. > On Wed, 30 Jul 2003, Andrea Arcangeli wrote: > > > The thing triggered simply by running setitimer in one function, while > > the it_real_fn was running in the other cpu. I don't see how 2.6 can > > have fixed this, it_real_fun can still trivially call add_timer while > > you run inside do_setitimer in 2.6 too. [...] > > This is not a race that can happen. itimer does this: > > del_timer_sync(); > add_timer(); > > how can the add_timer() still happen while it_real_fn is still running on > another CPU? it's not add_timer against add_timer in this case, it's del_timer_sync against add_timer. cpu0 cpu1 ------------ -------------------- do_setitimer it_real_fn del_timer_sync add_timer -> crash Andrea ------------------------------------------------------------------- Event posted 07-31-2003 10:51am by jneedle with duration of 0.00 Yes, there are active discussions on this in LKML. Ingo and Andrea continue to try and hash out the very complex interactions of the various timer routines and a patch has been created for 2.4-based kernels that we are testing. The entire thread can be found here: http://marc.theaimsgroup.com/?l=linux-kernel&m=105949344419681&w=2 Hopefully this will be resolved shortly. It is a showstopper for QU3. i've committed Ingo's solution...performance testing is needed FROM ISSUE TRACKER Event posted 09-01-2003 08:56pm by brian.b with duration of 0.00 Can we get an early look at the solution? Test Case.....run against Jason's people page code of 9/05 with failure in 1-2 hours... JASON, IS QA CODE BEYOND THIS FAILURE ??? -------------------- System locking up on SMP box by the setitimer() with invalid argument. This has been reproduced on several SMP systems. It usually fails in 1 - 6 hours but may take as long as 12 hours. It is possible to make reproductions. The details are as follows. 1. gcc test.c 2. while : ; do ./a.out ; done 3. netstat -c (execution on the another console) system locking up after 1 - 6 hours. --------------- test.c ------------------ #include <stdio.h> #include <sys/time.h> #include <signal.h> #include <unistd.h> #define LOOP 100000000 #define DELAY_SEC -1 #define DELAY_USEC -1 volatile int count = 0; struct timeval tv[LOOP]; struct timezone tz; void sig_action() { gettimeofday(tv + count, &tz); count ++; } int main() { struct itimerval value, ovalue; int i; for(i=273; i<LOOP; i++){ value.it_value.tv_usec = DELAY_USEC; value.it_value.tv_sec = DELAY_SEC; value.it_interval.tv_usec = DELAY_USEC; value.it_interval.tv_sec = DELAY_SEC; setitimer(ITIMER_REAL, &value, &ovalue); printf("%3d : \n", i); } return 0; } --------------- test.c ------------------ unfortunately, the latest timer fixes do not address this test case yet. When we have a fix that accommodates the above test case, HP would like to test it in parallel with us. Can we get them a kernel when it's available? agreed, state changing to assigned, and yes we will make the kernel available as soon as we track this down. *** Bug 104297 has been marked as a duplicate of this bug. *** 90549 technically is a dup of this. 90549 is a featurezilla and this one is a bugzilla and I believe 90549 was put in because of the procedures that TAMs had in place back when QU2 was being created. 90549 seems superfluous at this point. HP requests early code on this if possible so they can begin testing our latest fix. This is their #1 RHEL2.1 MUSTFIX bug. FROM ISSUE TRACKER Event posted 11-20-2003 05:40pm by brian.b with duration of 0.00 Re-tested with "2.4.9-e.27.28.test" kernel. (Timer tests) . System lockup observed within 5 min after stating the tests. Lock-up reproduced three times with this kernel. There were numerous timer.c related fixes incorporated into U3, which address the majority of related issues. This particular bugzilla entry refers to a corner case which was not addressed. Substantial effort was expended address this case, but it was concluded that a non-intrusive resolution is not viable within the bounds of compatibility constraints. We do have some good ideas to address the issue because, but that would require us to break kernel compatibility. The outcome of our analysis is that this particular bug represents an unlikely corner case which does not justify breaking kernel compatibility over. HP-ProLiant requesting this be nominated to the Update4 MUSTFIX list. However, RH Engineering feels initial bug is fixed in U3 and that the timer.c program is non-realistic edge case. RH has asked that HP recreate original problem documented in Issue Tracker 15751. I am therefore refraining from from nominating this issue to the U4 list until it is recreated via the original test scenario. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2003-408.html |