From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322) Description of problem: Running Dell PE2650 with dual 2.4 CPUs, 4Gb, and PERCRAID Mirror Rev: V1.0 1. Initially the system run 2.4.21-4.ELsmp #1 SMP. After about 10 days the system locked up. Only power off the machine to get it back running. 2. Updated the system to 2.4.21-9.0.1.ELsmp #1 SMP. The problem still exists. 3. Updated the system to 2.4.21-9.0.3.ELsmp #1 SMP two weeks ago and the system was locked up again this morning after running 15 days. Version-Release number of selected component (if applicable): 2.4.21-9.0.3.ELsmp #1 SMP Tue Apr 20 19:49:13 EDT 2004 i686 i686 i386 GNU/Linux How reproducible: Didn't try Additional info:
Did you get any error messages, or was this a completely silent lockup? If there were no error messages, does the U2 kernel lock up in the same way too, or was this bug fixed in the latest version? What workload are you running that triggers this lockup?
1. This is a complete silent lockup without any error messages. 2. All of the lockups are the same behavior through the different kernels. 3. Two database instances of Oracle 9.2.0.4 are running on the server, but most time the load average is 1, and 95-99% CPU idle. 4. The following is a snapshot taken on last Wednesday (5/18) 12:14:27 up 13 days, 5:38, 1 user, load average: 1.02, 1.01, 1.00 125 processes: 124 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 99.7% cpu00 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu01 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu02 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu03 0.0% 0.0% 0.9% 0.0% 0.0% 0.0% 99.0% Mem: 3869576k av, 3847816k used, 21760k free, 0k shrd, 167004k buff 2936440k actv, 725944k in_d, 18432k in_c Swap: 3269216k av, 352020k used, 2917196k free 3464972k cached
Gary, could you please try enabling the NMI watchdog (boot argument nmi_watchdog=1) ? This way the kernel will print out a backtrace of what the CPUs are doing, if the system gets stuck with interrupts blocked. Also, could you please try running the most recent RHEL3 kernel, to see if the bugfixes in that kernel make the hang go away? I'm sorry, but without any error messages it'll be hard to track down what's going on... Tom, do you happen to know of any issues with the PERCRAID driver in RHEL3 GA/U1?
Since this is a production server, I guess we have to wait for a maintenance window to enable the NMI watchdog and reboot the server. By the way, which version is the most recent RHEL 3 kernel release, because we just applied the kernel 2.4.21-9.0.3.EL two weeks ago. Is there any newer kernel available?
RHEL3 Update 3 was released last week. I believe this is kernel 2.4.21-15.EL (or maybe 2.4.21-14.EL).
Does this PERC controller use megaraid or aacraid? If megaraid, are you using the default megaraid driver, or megaraid2? The megaraid2 driver was updated with some bug fixes in U2. I don't know of a problem that matches this description, but I can check around if you let me know which driver you are using.
1. Here is the info regarding the controller: Red Hat/Adaptec aacraid driver (1.1.2 Apr 22 2004 00:25:36) AAC0: kernel 2.8.4 build 6082 AAC0: monitor 2.8.4 build 6082 AAC0: bios 2.8.0 build 6082 AAC0: serial d63481d3fafaf001 scsi0 : percraid blk: queue c62f1218, I/O limit 4095Mb (mask 0xffffffff) Vendor: DELL Model: PERCRAID Mirror Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 blk: queue c62f1018, I/O limit 4095Mb (mask 0xffffffff) Attached scsi removable disk sda at scsi0, channel 0, id 0, lun 0 SCSI device sda: 286716672 512-byte hdwr sectors (146799 MB) 2. We have setup a spare server with the identical configuration as the one having problems and loaded the latest Redhat update of 2.4.21- 15.EL to simulate the workload and see if we could reproduce the lockup problem.
1. The spare server has been running for more than 12 days without problems. 2. The troubled production server has been updated with 2.4.21-15.EL and up running for 6 days smoothly. 3. The other things being noticeable with the new update are the load average has been reduced from 1.00 to 0.00 and the CPU Time for kswapd has been down from 10:19 to 0:48.
I believe that the latest kernel 2.4.21-15.EL has fixed the lockup issues. We have not experienced any lockup problems since applied the update.
Thanks for the update, Gary. Tom, do you believe that the megaraid2 update in RHEL3 U2 is what fixed this problem? If so, I'll close this bug with the appropriate errata references.
The updated kernel fixed the problem, though the megaraid2 update was apparently not involved, since this is an aacraid system. Closing the BZ.
This was fixed in RHEL3 U2 (advisory RHSA-2004:188), but obviously one should upgrade to RHEL3 U4 (advisory RHBA-2004:550), which was just released yesterday.