73077 – ext3 and 2.4.18-10smp still cause problems

Bug 73077 - ext3 and 2.4.18-10smp still cause problems

Summary: ext3 and 2.4.18-10smp still cause problems

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-08-30 13:04 UTC by phil skuse
Modified:	2007-04-18 16:46 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2002-09-02 17:58:25 UTC
Embargoed:

Attachments	(Terms of Use)

Description phil skuse 2002-08-30 13:04:38 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

Description of problem:
I have two identical Compaq Proliant 5500's each with 2 PIII Xeon 500Mhz 
processors. Both machines are exhibiting idential behaviour which leads me to 
believe that it is not a hardware problem. I have installed RedHat 7.3 and all 
errata including kernel-2.4.18-10smp.

When I was using kernel 2.4.18-3smp, the machines would unexpectedly reboot 
every few minutes at random. Now that I have upgraded to kernel 2.4.18-10smp it 
happens only a couple of times a day. It seems to happen when there is a lot of 
disk access, eg. When trying to compile a kernel or copy a few hundred files.
If I convert the filesystems to ext2 then the problem goes away. Or, if I boot 
into the uniprocessor kernel, the problem goes away.

I understand that this may be very difficult for you to reproduce on different 
hardware. I am currently trying out a vanilla kernel 2.4.19 to see if that 
cures the problem. 

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. Install 7.3 on twin processor Proliant 5500, use ext3 filesystem
2. Boot into smp kernel
3. Start a large compile or copy operation.
	

Actual Results:  System suddenly reboots without warning or crash dump.

Additional info:

Comment 1 phil skuse 2002-08-30 13:08:17 UTC

UPDATE

Newly compiled kernel 2.4.19 from www.kernel.org does NOT fix the problem.

Comment 2 Stephen Tweedie 2002-08-30 13:17:25 UTC

This sounds like a hardware problem offhand, but it might be due to driver
interactions.  It is quite possible that ext3 is simply causing a higher, or
different, load on the hardware due to the different disk access patterns caused
by journal accesses.  It could be something as basic as the power supply not
being up to the job of powering both CPUs at full tilt plus a heavily loaded
disk drive; kernel compiles do tend to stress both of those components
simultaneously.

If you set up a serial console, can you capture any diagnostics from the kernel?

Comment 3 Arjan van de Ven 2002-08-30 13:21:00 UTC

also can you give lsmod output?
(just so we can see what hw and drivers are in use)

Comment 4 phil skuse 2002-09-02 13:14:46 UTC

[root@Beavis root]# lsmod
Module                  Size  Used by    Not tainted
binfmt_misc             7684   1 
autofs                 12612   0  (autoclean) (unused)
3c59x                  29288   0  (unused)
eepro100               21040   1 
ext3                   70944   2 
jbd                    53792   2  [ext3]
sym53c8xx              63204   0  (unused)
cpqarray               22624   3 
sd_mod                 12992   0  (unused)
scsi_mod              113284   2  [sym53c8xx sd_mod]

I'll setup a serial console and let you know if I get any output.

Comment 5 phil skuse 2002-09-02 14:08:32 UTC

I setup a serial console as suggested and tested that it worked. (All the 
bootup messages got displayed on the terminal as expected). Then I reproduced 
the problem. I can reproduce it reliably now, I cd to "/usr/src/linux-2.4" and 
type "make dep bzImage modules". The machine will then reboot in a matter of 
seconds. No additional output appeared on the serial console when the system 
went down.

I don't think it's a power problem either. These machine each have two large 
power supplies and can support up to 4 processors and 10 hard disks. We are 
using only two processors and two disks in each machine.

Comment 6 Stephen Tweedie 2002-09-02 14:39:37 UTC

Ah... 2 PIII Xeon 500Mhz 

I had a problem a while back with a 4-way PIII Xeon 500MHz SMP machine which
rebooted spontaneously under certain loads with recent kernels, and after a long
period of debugging it turned out to be a known CPU microcode fault fixed with
Intel's current microcode errata.  The kernel-utils package in Red Hat 7.3 has
errata updates for the microcode which can be applied at runtime.  Try
installing kernel-utils and running

/sbin/service microcode_ctl start

and see if you can still reproduce the problem.  If that fixes things, you can do a

/sbin/chkconfig microcode_ctl on

to get the system to update the microcode automatically on each boot.

My own test box has been rock-solid since doing this.

Comment 7 phil skuse 2002-09-02 15:20:36 UTC

I was really hoping that microcode thing was the answer, but no. I can still 
reproduce the problem. It looks like my CPUs are already at the latest version.

Sep  2 16:10:43 Beavis kernel: microcode: CPU0 already at revision 56 
(current=56)
Sep  2 16:10:43 Beavis kernel: microcode: CPU1 already at revision 56 
(current=56)
Sep  2 16:10:43 Beavis kernel: microcode: freed 4096 bytes
Sep  2 16:10:28 Beavis microcode_ctl: microcode_ctl startup succeeded 

I have noticed the following message at boot time on both machines, could it be 
related?

Sep  2 16:16:04 Beavis kernel: mtrr: your CPUs had inconsistent fixed MTRR setti
ngs
Sep  2 16:16:04 Beavis kernel: mtrr: probably your BIOS does not setup all CPUs

Comment 8 Stephen Tweedie 2002-09-02 15:43:10 UTC

Well, if CPU 1 isn't getting properly set up by the BIOS then that would
certainly cause unpredictable behaviour.  You could try booting with maxcpus=1
to see if that helps; if so, you probably want to look for a BIOS update from
your vendor (actually, given the symptoms, that wouldn't hurt in any case.)

Comment 9 phil skuse 2002-09-02 16:40:45 UTC

I've read up on the mtrr message. It's purely informational and a common 
occurence on SMP machines. We have an identical machine running 7.2 with the 
2.4.7-10smp kernel and ext3 which doesn't suffer from the rebooting problem. 
Only the newer kernels, supplied with 7.3, exhibit this behaviour.

Setting maxcpus=1, as you suggested, prevents the problem from occuring - but 
it's not really the solution I was looking for. I think there is still a 
problem with smp and ext3 in the newer kernels.

Comment 10 Stephen Tweedie 2002-09-02 17:58:18 UTC

It is almost completely impossible for ext3 to be the cause here.  Spontaneous
reboots really require a triple-page-fault, which is not something I have ever
seen caused by *any* filesystem code.  

The problem I had previously with my own Xeon system was also not visible on
2.4.7 systems, but that turned out to be due to an updated scheduler in Red
Hat's 2.4.18-based kernels which happened to reorder instructions in just the
right way to trigger the CPU fault.

Also, the biggest change in ext3 between our 2.4.7-10 and 2.4.18-10 kernels is
not in 2.4.19 (Marcelo didn't merge it until the 2.4.20-pre series), so if the
problem is seen with 2.4.19 but not with 2.4.7-10, I really don't think it's
likely to be an ext3 problem.

This really smells to me like a subtle CPU, PCI or chipset bug.  I'd definitely
look for a BIOS update as the next course of action.  After that, we probably
need to start narrowing down to see which kernel the bad behaviour started at
before we can get anywhere.

Comment 11 phil skuse 2002-09-09 15:21:07 UTC

OK. I trust your judgement on ext3 not being the issue. 
Both machines are using the latest BIOS update.

I've just installed RH7.2 on one of the machines and the problem is still 
occurring. I find this very odd because I have an identical machine running 7.2 
without any problems.

I'm going to close this bug and spend some time trying to figure out why that 
machine runs fine, but these two apparently identical ones don't.

Thanks for all your help and advice.

Note You need to log in before you can comment on or make changes to this bug.