Red Hat Bugzilla – Bug 73077
ext3 and 2.4.18-10smp still cause problems
Last modified: 2007-04-18 12:46:12 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Description of problem:
I have two identical Compaq Proliant 5500's each with 2 PIII Xeon 500Mhz
processors. Both machines are exhibiting idential behaviour which leads me to
believe that it is not a hardware problem. I have installed RedHat 7.3 and all
errata including kernel-2.4.18-10smp.
When I was using kernel 2.4.18-3smp, the machines would unexpectedly reboot
every few minutes at random. Now that I have upgraded to kernel 2.4.18-10smp it
happens only a couple of times a day. It seems to happen when there is a lot of
disk access, eg. When trying to compile a kernel or copy a few hundred files.
If I convert the filesystems to ext2 then the problem goes away. Or, if I boot
into the uniprocessor kernel, the problem goes away.
I understand that this may be very difficult for you to reproduce on different
hardware. I am currently trying out a vanilla kernel 2.4.19 to see if that
cures the problem.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install 7.3 on twin processor Proliant 5500, use ext3 filesystem
2. Boot into smp kernel
3. Start a large compile or copy operation.
Actual Results: System suddenly reboots without warning or crash dump.
Newly compiled kernel 2.4.19 from www.kernel.org does NOT fix the problem.
This sounds like a hardware problem offhand, but it might be due to driver
interactions. It is quite possible that ext3 is simply causing a higher, or
different, load on the hardware due to the different disk access patterns caused
by journal accesses. It could be something as basic as the power supply not
being up to the job of powering both CPUs at full tilt plus a heavily loaded
disk drive; kernel compiles do tend to stress both of those components
If you set up a serial console, can you capture any diagnostics from the kernel?
also can you give lsmod output?
(just so we can see what hw and drivers are in use)
[root@Beavis root]# lsmod
Module Size Used by Not tainted
binfmt_misc 7684 1
autofs 12612 0 (autoclean) (unused)
3c59x 29288 0 (unused)
eepro100 21040 1
ext3 70944 2
jbd 53792 2 [ext3]
sym53c8xx 63204 0 (unused)
cpqarray 22624 3
sd_mod 12992 0 (unused)
scsi_mod 113284 2 [sym53c8xx sd_mod]
I'll setup a serial console and let you know if I get any output.
I setup a serial console as suggested and tested that it worked. (All the
bootup messages got displayed on the terminal as expected). Then I reproduced
the problem. I can reproduce it reliably now, I cd to "/usr/src/linux-2.4" and
type "make dep bzImage modules". The machine will then reboot in a matter of
seconds. No additional output appeared on the serial console when the system
I don't think it's a power problem either. These machine each have two large
power supplies and can support up to 4 processors and 10 hard disks. We are
using only two processors and two disks in each machine.
Ah... 2 PIII Xeon 500Mhz
I had a problem a while back with a 4-way PIII Xeon 500MHz SMP machine which
rebooted spontaneously under certain loads with recent kernels, and after a long
period of debugging it turned out to be a known CPU microcode fault fixed with
Intel's current microcode errata. The kernel-utils package in Red Hat 7.3 has
errata updates for the microcode which can be applied at runtime. Try
installing kernel-utils and running
/sbin/service microcode_ctl start
and see if you can still reproduce the problem. If that fixes things, you can do a
/sbin/chkconfig microcode_ctl on
to get the system to update the microcode automatically on each boot.
My own test box has been rock-solid since doing this.
I was really hoping that microcode thing was the answer, but no. I can still
reproduce the problem. It looks like my CPUs are already at the latest version.
Sep 2 16:10:43 Beavis kernel: microcode: CPU0 already at revision 56
Sep 2 16:10:43 Beavis kernel: microcode: CPU1 already at revision 56
Sep 2 16:10:43 Beavis kernel: microcode: freed 4096 bytes
Sep 2 16:10:28 Beavis microcode_ctl: microcode_ctl startup succeeded
I have noticed the following message at boot time on both machines, could it be
Sep 2 16:16:04 Beavis kernel: mtrr: your CPUs had inconsistent fixed MTRR setti
Sep 2 16:16:04 Beavis kernel: mtrr: probably your BIOS does not setup all CPUs
Well, if CPU 1 isn't getting properly set up by the BIOS then that would
certainly cause unpredictable behaviour. You could try booting with maxcpus=1
to see if that helps; if so, you probably want to look for a BIOS update from
your vendor (actually, given the symptoms, that wouldn't hurt in any case.)
I've read up on the mtrr message. It's purely informational and a common
occurence on SMP machines. We have an identical machine running 7.2 with the
2.4.7-10smp kernel and ext3 which doesn't suffer from the rebooting problem.
Only the newer kernels, supplied with 7.3, exhibit this behaviour.
Setting maxcpus=1, as you suggested, prevents the problem from occuring - but
it's not really the solution I was looking for. I think there is still a
problem with smp and ext3 in the newer kernels.
It is almost completely impossible for ext3 to be the cause here. Spontaneous
reboots really require a triple-page-fault, which is not something I have ever
seen caused by *any* filesystem code.
The problem I had previously with my own Xeon system was also not visible on
2.4.7 systems, but that turned out to be due to an updated scheduler in Red
Hat's 2.4.18-based kernels which happened to reorder instructions in just the
right way to trigger the CPU fault.
Also, the biggest change in ext3 between our 2.4.7-10 and 2.4.18-10 kernels is
not in 2.4.19 (Marcelo didn't merge it until the 2.4.20-pre series), so if the
problem is seen with 2.4.19 but not with 2.4.7-10, I really don't think it's
likely to be an ext3 problem.
This really smells to me like a subtle CPU, PCI or chipset bug. I'd definitely
look for a BIOS update as the next course of action. After that, we probably
need to start narrowing down to see which kernel the bad behaviour started at
before we can get anywhere.
OK. I trust your judgement on ext3 not being the issue.
Both machines are using the latest BIOS update.
I've just installed RH7.2 on one of the machines and the problem is still
occurring. I find this very odd because I have an identical machine running 7.2
without any problems.
I'm going to close this bug and spend some time trying to figure out why that
machine runs fine, but these two apparently identical ones don't.
Thanks for all your help and advice.