Red Hat Bugzilla – Bug 123774
System keeps lockup after running certain days
Last modified: 2007-11-30 17:07:01 EST
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET
Description of problem:
Running Dell PE2650 with dual 2.4 CPUs, 4Gb, and PERCRAID Mirror
1. Initially the system run 2.4.21-4.ELsmp #1 SMP. After about 10
days the system locked up. Only power off the machine to get it back
2. Updated the system to 2.4.21-9.0.1.ELsmp #1 SMP. The problem still
3. Updated the system to 2.4.21-9.0.3.ELsmp #1 SMP two weeks ago and
the system was locked up again this morning after running 15 days.
Version-Release number of selected component (if applicable):
2.4.21-9.0.3.ELsmp #1 SMP Tue Apr 20 19:49:13 EDT 2004 i686 i686 i386
Did you get any error messages, or was this a completely silent lockup?
If there were no error messages, does the U2 kernel lock up in the
same way too, or was this bug fixed in the latest version?
What workload are you running that triggers this lockup?
1. This is a complete silent lockup without any error messages.
2. All of the lockups are the same behavior through the different
3. Two database instances of Oracle 18.104.22.168 are running on the
server, but most time the load average is 1, and 95-99% CPU idle.
4. The following is a snapshot taken on last Wednesday (5/18)
12:14:27 up 13 days, 5:38, 1 user, load average: 1.02, 1.01, 1.00
125 processes: 124 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait
total 0.0% 0.0% 0.2% 0.0% 0.0% 0.0%
cpu00 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
cpu01 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
cpu02 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
cpu03 0.0% 0.0% 0.9% 0.0% 0.0% 0.0%
Mem: 3869576k av, 3847816k used, 21760k free, 0k shrd,
2936440k actv, 725944k in_d, 18432k in_c
Swap: 3269216k av, 352020k used, 2917196k free
Gary, could you please try enabling the NMI watchdog (boot argument
This way the kernel will print out a backtrace of what the CPUs are
doing, if the system gets stuck with interrupts blocked.
Also, could you please try running the most recent RHEL3 kernel, to
see if the bugfixes in that kernel make the hang go away?
I'm sorry, but without any error messages it'll be hard to track down
what's going on...
Tom, do you happen to know of any issues with the PERCRAID driver in
Since this is a production server, I guess we have to wait for a
maintenance window to enable the NMI watchdog and reboot the server.
By the way, which version is the most recent RHEL 3 kernel release,
because we just applied the kernel 2.4.21-9.0.3.EL two weeks ago. Is
there any newer kernel available?
RHEL3 Update 3 was released last week. I believe this is kernel
2.4.21-15.EL (or maybe 2.4.21-14.EL).
Does this PERC controller use megaraid or aacraid? If megaraid, are
you using the default megaraid driver, or megaraid2?
The megaraid2 driver was updated with some bug fixes in U2. I don't
know of a problem that matches this description, but I can check
around if you let me know which driver you are using.
1. Here is the info regarding the controller:
Red Hat/Adaptec aacraid driver (1.1.2 Apr 22 2004 00:25:36)
AAC0: kernel 2.8.4 build 6082
AAC0: monitor 2.8.4 build 6082
AAC0: bios 2.8.0 build 6082
AAC0: serial d63481d3fafaf001
scsi0 : percraid
blk: queue c62f1218, I/O limit 4095Mb (mask 0xffffffff)
Vendor: DELL Model: PERCRAID Mirror Rev: V1.0
Type: Direct-Access ANSI SCSI revision: 02
blk: queue c62f1018, I/O limit 4095Mb (mask 0xffffffff)
Attached scsi removable disk sda at scsi0, channel 0, id 0, lun 0
SCSI device sda: 286716672 512-byte hdwr sectors (146799 MB)
2. We have setup a spare server with the identical configuration as
the one having problems and loaded the latest Redhat update of 2.4.21-
15.EL to simulate the workload and see if we could reproduce the
1. The spare server has been running for more than 12 days without
2. The troubled production server has been updated with 2.4.21-15.EL
and up running for 6 days smoothly.
3. The other things being noticeable with the new update are the load
average has been reduced from 1.00 to 0.00 and the CPU Time for
kswapd has been down from 10:19 to 0:48.
I believe that the latest kernel 2.4.21-15.EL has fixed the lockup
issues. We have not experienced any lockup problems since applied the
Thanks for the update, Gary.
Tom, do you believe that the megaraid2 update in RHEL3 U2 is
what fixed this problem? If so, I'll close this bug with the
appropriate errata references.
The updated kernel fixed the problem, though the megaraid2 update was
apparently not involved, since this is an aacraid system. Closing the BZ.
This was fixed in RHEL3 U2 (advisory RHSA-2004:188), but obviously
one should upgrade to RHEL3 U4 (advisory RHBA-2004:550), which was
just released yesterday.