From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Description of problem: I have two identical Compaq Proliant 5500's each with 2 PIII Xeon 500Mhz processors. Both machines are exhibiting idential behaviour which leads me to believe that it is not a hardware problem. I have installed RedHat 7.3 and all errata including kernel-2.4.18-10smp. When I was using kernel 2.4.18-3smp, the machines would unexpectedly reboot every few minutes at random. Now that I have upgraded to kernel 2.4.18-10smp it happens only a couple of times a day. It seems to happen when there is a lot of disk access, eg. When trying to compile a kernel or copy a few hundred files. If I convert the filesystems to ext2 then the problem goes away. Or, if I boot into the uniprocessor kernel, the problem goes away. I understand that this may be very difficult for you to reproduce on different hardware. I am currently trying out a vanilla kernel 2.4.19 to see if that cures the problem. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. Install 7.3 on twin processor Proliant 5500, use ext3 filesystem 2. Boot into smp kernel 3. Start a large compile or copy operation. Actual Results: System suddenly reboots without warning or crash dump. Additional info:
UPDATE Newly compiled kernel 2.4.19 from www.kernel.org does NOT fix the problem.
This sounds like a hardware problem offhand, but it might be due to driver interactions. It is quite possible that ext3 is simply causing a higher, or different, load on the hardware due to the different disk access patterns caused by journal accesses. It could be something as basic as the power supply not being up to the job of powering both CPUs at full tilt plus a heavily loaded disk drive; kernel compiles do tend to stress both of those components simultaneously. If you set up a serial console, can you capture any diagnostics from the kernel?
also can you give lsmod output? (just so we can see what hw and drivers are in use)
[root@Beavis root]# lsmod Module Size Used by Not tainted binfmt_misc 7684 1 autofs 12612 0 (autoclean) (unused) 3c59x 29288 0 (unused) eepro100 21040 1 ext3 70944 2 jbd 53792 2 [ext3] sym53c8xx 63204 0 (unused) cpqarray 22624 3 sd_mod 12992 0 (unused) scsi_mod 113284 2 [sym53c8xx sd_mod] I'll setup a serial console and let you know if I get any output.
I setup a serial console as suggested and tested that it worked. (All the bootup messages got displayed on the terminal as expected). Then I reproduced the problem. I can reproduce it reliably now, I cd to "/usr/src/linux-2.4" and type "make dep bzImage modules". The machine will then reboot in a matter of seconds. No additional output appeared on the serial console when the system went down. I don't think it's a power problem either. These machine each have two large power supplies and can support up to 4 processors and 10 hard disks. We are using only two processors and two disks in each machine.
Ah... 2 PIII Xeon 500Mhz I had a problem a while back with a 4-way PIII Xeon 500MHz SMP machine which rebooted spontaneously under certain loads with recent kernels, and after a long period of debugging it turned out to be a known CPU microcode fault fixed with Intel's current microcode errata. The kernel-utils package in Red Hat 7.3 has errata updates for the microcode which can be applied at runtime. Try installing kernel-utils and running /sbin/service microcode_ctl start and see if you can still reproduce the problem. If that fixes things, you can do a /sbin/chkconfig microcode_ctl on to get the system to update the microcode automatically on each boot. My own test box has been rock-solid since doing this.
I was really hoping that microcode thing was the answer, but no. I can still reproduce the problem. It looks like my CPUs are already at the latest version. Sep 2 16:10:43 Beavis kernel: microcode: CPU0 already at revision 56 (current=56) Sep 2 16:10:43 Beavis kernel: microcode: CPU1 already at revision 56 (current=56) Sep 2 16:10:43 Beavis kernel: microcode: freed 4096 bytes Sep 2 16:10:28 Beavis microcode_ctl: microcode_ctl startup succeeded I have noticed the following message at boot time on both machines, could it be related? Sep 2 16:16:04 Beavis kernel: mtrr: your CPUs had inconsistent fixed MTRR setti ngs Sep 2 16:16:04 Beavis kernel: mtrr: probably your BIOS does not setup all CPUs
Well, if CPU 1 isn't getting properly set up by the BIOS then that would certainly cause unpredictable behaviour. You could try booting with maxcpus=1 to see if that helps; if so, you probably want to look for a BIOS update from your vendor (actually, given the symptoms, that wouldn't hurt in any case.)
I've read up on the mtrr message. It's purely informational and a common occurence on SMP machines. We have an identical machine running 7.2 with the 2.4.7-10smp kernel and ext3 which doesn't suffer from the rebooting problem. Only the newer kernels, supplied with 7.3, exhibit this behaviour. Setting maxcpus=1, as you suggested, prevents the problem from occuring - but it's not really the solution I was looking for. I think there is still a problem with smp and ext3 in the newer kernels.
It is almost completely impossible for ext3 to be the cause here. Spontaneous reboots really require a triple-page-fault, which is not something I have ever seen caused by *any* filesystem code. The problem I had previously with my own Xeon system was also not visible on 2.4.7 systems, but that turned out to be due to an updated scheduler in Red Hat's 2.4.18-based kernels which happened to reorder instructions in just the right way to trigger the CPU fault. Also, the biggest change in ext3 between our 2.4.7-10 and 2.4.18-10 kernels is not in 2.4.19 (Marcelo didn't merge it until the 2.4.20-pre series), so if the problem is seen with 2.4.19 but not with 2.4.7-10, I really don't think it's likely to be an ext3 problem. This really smells to me like a subtle CPU, PCI or chipset bug. I'd definitely look for a BIOS update as the next course of action. After that, we probably need to start narrowing down to see which kernel the bad behaviour started at before we can get anywhere.
OK. I trust your judgement on ext3 not being the issue. Both machines are using the latest BIOS update. I've just installed RH7.2 on one of the machines and the problem is still occurring. I find this very odd because I have an identical machine running 7.2 without any problems. I'm going to close this bug and spend some time trying to figure out why that machine runs fine, but these two apparently identical ones don't. Thanks for all your help and advice.