Bug 123774 - System keeps lockup after running certain days
Summary: System keeps lockup after running certain days
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Tom Coughlan
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-05-20 14:52 UTC by Gary Feng
Modified: 2007-11-30 22:07 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-12-21 22:34:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Gary Feng 2004-05-20 14:52:11 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET 
CLR 1.1.4322)

Description of problem:
Running Dell PE2650 with dual 2.4 CPUs, 4Gb, and PERCRAID Mirror   
Rev: V1.0

1. Initially the system run 2.4.21-4.ELsmp #1 SMP. After about 10 
days the system locked up. Only power off the machine to get it back 
running.
2. Updated the system to 2.4.21-9.0.1.ELsmp #1 SMP. The problem still 
exists.

3. Updated the system to 2.4.21-9.0.3.ELsmp #1 SMP two weeks ago and 
the system was locked up again this morning after running 15 days.

Version-Release number of selected component (if applicable):
2.4.21-9.0.3.ELsmp #1 SMP Tue Apr 20 19:49:13 EDT 2004 i686 i686 i386 
GNU/Linux

How reproducible:
Didn't try


Additional info:

Comment 1 Rik van Riel 2004-05-20 18:58:19 UTC
Did you get any error messages, or was this a completely silent lockup?

If there were no error messages, does the U2 kernel lock up in the
same way too, or was this bug fixed in the latest version?

What workload are you running that triggers this lockup?

Comment 2 Gary Feng 2004-05-20 19:34:42 UTC
1. This is a complete silent lockup without any error messages.
2. All of the lockups are the same behavior through the different 
kernels.
3. Two database instances of Oracle 9.2.0.4 are running on the 
server, but most time the load average is 1, and 95-99% CPU idle.
4. The following is a snapshot taken on last Wednesday (5/18)
 12:14:27  up 13 days,  5:38,  1 user,  load average: 1.02, 1.01, 1.00
125 processes: 124 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    
idle
           total    0.0%    0.0%    0.2%   0.0%     0.0%    0.0%   
99.7%
           cpu00    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  
100.0%
           cpu01    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  
100.0%
           cpu02    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  
100.0%
           cpu03    0.0%    0.0%    0.9%   0.0%     0.0%    0.0%   
99.0%
Mem:  3869576k av, 3847816k used,   21760k free,       0k shrd,  
167004k buff
                   2936440k actv,  725944k in_d,   18432k in_c
Swap: 3269216k av,  352020k used, 2917196k free                 
3464972k cached


Comment 3 Rik van Riel 2004-05-20 19:45:07 UTC
Gary, could you please try enabling the NMI watchdog (boot argument
nmi_watchdog=1) ?

This way the kernel will print out a backtrace of what the CPUs are
doing, if the system gets stuck with interrupts blocked.

Also, could you please try running the most recent RHEL3 kernel, to
see if the bugfixes in that kernel make the hang go away?

I'm sorry, but without any error messages it'll be hard to track down
what's going on...

Tom, do you happen to know of any issues with the PERCRAID driver in
RHEL3 GA/U1?

Comment 4 Gary Feng 2004-05-20 20:18:12 UTC
Since this is a production server, I guess we have to wait for a 
maintenance window to enable the NMI watchdog and reboot the server.

By the way, which version is the most recent RHEL 3 kernel release, 
because we just applied the kernel 2.4.21-9.0.3.EL two weeks ago. Is 
there any newer kernel available?

Comment 5 Rik van Riel 2004-05-20 20:50:38 UTC
RHEL3 Update 3 was released last week. I believe this is kernel
2.4.21-15.EL (or maybe 2.4.21-14.EL).

Comment 6 Tom Coughlan 2004-05-20 21:13:56 UTC
Does this PERC controller use megaraid or aacraid? If megaraid, are
you using the default megaraid driver, or megaraid2?

The megaraid2 driver was updated with some bug fixes in U2. I don't
know of a problem that matches this description, but I can check
around if you let me know which driver you are using.  



Comment 7 Gary Feng 2004-05-27 20:19:11 UTC
1. Here is the info regarding the controller:
Red Hat/Adaptec aacraid driver (1.1.2 Apr 22 2004 00:25:36)
AAC0: kernel 2.8.4 build 6082
AAC0: monitor 2.8.4 build 6082
AAC0: bios 2.8.0 build 6082
AAC0: serial d63481d3fafaf001
scsi0 : percraid
blk: queue c62f1218, I/O limit 4095Mb (mask 0xffffffff)
  Vendor: DELL      Model: PERCRAID Mirror   Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
blk: queue c62f1018, I/O limit 4095Mb (mask 0xffffffff)
Attached scsi removable disk sda at scsi0, channel 0, id 0, lun 0
SCSI device sda: 286716672 512-byte hdwr sectors (146799 MB)

2. We have setup a spare server with the identical configuration as 
the one having problems and loaded the latest Redhat update of 2.4.21-
15.EL to simulate the workload and see if we could reproduce the 
lockup problem.

Comment 8 Gary Feng 2004-06-09 13:41:33 UTC
1. The spare server has been running for more than 12 days without 
problems.
2. The troubled production server has been updated with 2.4.21-15.EL 
and up running for 6 days smoothly.
3. The other things being noticeable with the new update are the load 
average has been reduced from 1.00 to 0.00 and the CPU Time for 
kswapd has been down from 10:19 to 0:48.



Comment 9 Gary Feng 2004-07-07 21:54:00 UTC
I believe that the latest kernel 2.4.21-15.EL has fixed the lockup 
issues. We have not experienced any lockup problems since applied the 
update. 

Comment 10 Ernie Petrides 2004-07-07 23:28:20 UTC
Thanks for the update, Gary.

Tom, do you believe that the megaraid2 update in RHEL3 U2 is
what fixed this problem?  If so, I'll close this bug with the
appropriate errata references.


Comment 11 Tom Coughlan 2004-12-21 22:34:15 UTC
The updated kernel fixed the problem, though the megaraid2 update was
apparently not involved, since this is an aacraid system.  Closing the BZ.

Comment 12 Ernie Petrides 2004-12-21 23:25:28 UTC
This was fixed in RHEL3 U2 (advisory RHSA-2004:188), but obviously
one should upgrade to RHEL3 U4 (advisory RHBA-2004:550), which was
just released yesterday.



Note You need to log in before you can comment on or make changes to this bug.