Bug 53768 - Broken debug code in io_apic.c
Summary: Broken debug code in io_apic.c
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.3
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brock Organ
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2001-09-18 01:44 UTC by Sam Varshavchik
Modified: 2007-04-18 16:37 UTC (History)
1 user (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2002-10-26 05:48:59 UTC
Embargoed:


Attachments (Terms of Use)
The hardware. (2.61 KB, text/plain)
2001-09-18 01:45 UTC, Sam Varshavchik
no flags Details
2.4.3-12smp configuration with both SCSI adapters - stable configuration. (1.47 KB, text/plain)
2001-09-20 02:33 UTC, Sam Varshavchik
no flags Details
2.4.7-10smp with both SCSI adapters - unstable, crashes after 3-4 minutes of heavy use. (1.47 KB, text/plain)
2001-09-20 02:34 UTC, Sam Varshavchik
no flags Details
2.4.7-10smp with the PCI SCSI card taken out - stable so far. (1.36 KB, text/plain)
2001-09-20 02:35 UTC, Sam Varshavchik
no flags Details
Brown paper bag bug fix. (906 bytes, patch)
2001-10-10 01:57 UTC, Sam Varshavchik
no flags Details | Diff

Description Sam Varshavchik 2001-09-18 01:44:55 UTC
For tracking purposes.

Upgraded an SMP workstation to Red Hat 7.2.  Logged in in X, started latest
mozilla nightly build (2001091712 RPM).  Entered "http://cnn.com" into
Mozilla.  Hard lock up.  No keyboard activity.  The system is dead, can
only do a hardware reset.

Rebooted, ext3 recovered, and refscked my partitions.  Did this again, and
once again the machine locked up.  Reproducible.

Downgraded all partitions to ext2, booted 2.4.7-10smp with ext2 partitions,
repeated these steps, the system locked up.

Installed the errata 2.4.3-12smp, booted 7.2 with this kernel, the system
does not lock up when I perform these steps.

Analysis of UP kernels will follow.  This is a Supermicro dual-PIII box
with 256MB RAM and 512MB swap, and original vintage soundblaster 16 ISA
card (sound effects are enabled in Gnome).  There is an onboard AIC7xxx
controller, there's also a second Adaptec SCSI controller that drives an
external scanner (not used).

Comment 1 Sam Varshavchik 2001-09-18 01:45:50 UTC
Created attachment 31958 [details]
The hardware.

Comment 2 Sam Varshavchik 2001-09-18 05:00:25 UTC
Cannot reproduce with a UP kernel.  Repeatedly switching between UP and SMP, I
can get SMP to crash every time.  Sometimes I need to hit refresh a couple of
times, before it locks up.  Unable to reproduce the crash with UP.



Comment 3 Arjan van de Ven 2001-09-18 08:41:21 UTC
Can you ping the machine from another machine ?
What you describe can also be an X lockup.....

Comment 4 Sam Varshavchik 2001-09-18 12:29:37 UTC
The box stops responding to pings when it dies.  It's time to build a serial
console...




Comment 5 Sam Varshavchik 2001-09-19 00:43:10 UTC
It's pretty bad.

No OOPS logged to the serial console.  nmi_watchdog=1 is also silent (I verified
that the NMI interrupts were being generated, as per kernel docs).  Magic SysRq
is also silent.

I pulled out the ISA soundblaster.  2.4.7-10smp continues to lock up without the
ISA soundblaster.  Makes no difference.

UP kernel is fine, and never locks up.

The lock up appears to correlate with the amount of system uptime or CPU
activity.  One time I foolishly rebooted into 2.4.7-10smp after a lock up. 
After a lengthy fsck, the kernel froze before initscripts concluded.  I repeated
the experiment, once again got a hard lockup before initscript completely.  With
a repaired filesystem, I can log in and do some stuff in Mozilla, before locking up.

I've just recovered the system after losing the kudzu database, /etc/inittab,
and a bunch of other base package files.  I've decided to take a break, and
stick with 2.4.3smp, for now.  Suggestions welcome.




Comment 6 Arjan van de Ven 2001-09-19 09:27:32 UTC
I assume you're not using the NVidia binary only driver ?

Comment 7 Sam Varshavchik 2001-09-19 11:23:22 UTC
Correct.  Just a straightforward 7.2 install.

Comment 8 Sam Varshavchik 2001-09-20 00:12:27 UTC
I pulled out the Adaptec 2940 PCI SCSI adapter, and booted 2.4.7-10smp.  I could
not reproduce the crash.  The 2940 card only has an external scanner hooked up
to it, and is otherwise not being used.  The on-motherboard 2940U2/W adapter is
the one that has a bunch of disks and a CD-RW hanging off it.  The PCI SCSI card
only has a scanner attached to it, but I have not been using the scanner at all.

After I pulled out the 2940 PCI SCSI, I could not make the kernel crash.  I
dropped the card back in (in case the card was not socketed properly on the
motherboard), reattached the scanner, and rebooted.  The machine froze the 3rd
time I reloaded cnn.com in mozilla.
I've now pulled out the card again, and I'm now running 2.4.7-10smp again.  I
will update this bug tomorrow to indicate whether I've had any crashes.  If I
haven't, this would be a fairly good indication that the factors are:

1) A 2.4.7-10smp kernel

2) Two (possibly different models) Adaptec SCSI adapters.  Whether or not they
have to be different, or not, is not known

The 2.4.3-12smp kernel never crashed with this combination.

For reference, the current on-board 2940U2/W has the following stuff hanging off it:

ched devices: 
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: FUJITSU  Model: M1606S-512       Rev: 6234
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 03 Lun: 00
  Vendor: YAMAHA   Model: CRW8424S         Rev: 1.0d
  Type:   CD-ROM                           ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 06 Lun: 00
  Vendor: Seagate  Model: STT8000N         Rev: 3.22
  Type:   Sequential-Access                ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 08 Lun: 00
  Vendor: SEAGATE  Model: ST39140LW        Rev: 1483
  Type:   Direct-Access                    ANSI SCSI revision: 02
The card that I pulled out had a UMAX scanner on it.  That's it.



Comment 9 Sam Varshavchik 2001-09-20 02:32:13 UTC
The BIOS rev on the PCI SCSI card is 1.23

The motherboard SCSI BIOS rev on is 2.11

Attaching /proc/interrupts and /proc/ioports...



Comment 10 Sam Varshavchik 2001-09-20 02:33:16 UTC
Created attachment 32160 [details]
2.4.3-12smp configuration with both SCSI adapters - stable configuration.

Comment 11 Sam Varshavchik 2001-09-20 02:34:10 UTC
Created attachment 32161 [details]
2.4.7-10smp with both SCSI adapters - unstable, crashes after 3-4 minutes of heavy use.

Comment 12 Sam Varshavchik 2001-09-20 02:35:22 UTC
Created attachment 32162 [details]
2.4.7-10smp with the PCI SCSI card taken out - stable so far.

Comment 13 Sam Varshavchik 2001-09-20 21:53:17 UTC
2.4.7-10smp is yet to crash after a full day, with a single SCSI adapter.

Comment 14 Sam Varshavchik 2001-09-24 03:16:11 UTC
It's not aic7xxx_old.  I rebuilt 2.4.7-10smp reverting the three changes to
aic7xxx_old between 2.4.3 and 2.4.7 - the machine still locks up.



Comment 15 Sam Varshavchik 2001-10-03 16:47:19 UTC
2.4.9-0.18 still locks up.


Comment 16 Sam Varshavchik 2001-10-08 01:33:44 UTC
I am unable to crash the kernel if I boot 2.4.7-10smp with 'noapic'.  I'll still
run with 'noapic' for a little while longer, just to be sure.

I've also tried building 2.4.7-10smp without most of the patches.  I was able to
finally build 2.4.7-10smp only with -ac, tux, ext3 and the minimum fixup patches
needed to compile the kernel.  That was the minimum configuration that I managed
to build with.  That build proved to be even more unstable, so I can't use it as
a working baseline.

If noapic boot continues to run, would it make sense to take arch/i386/io_apic.c
from 2.4.3-12smp, which works, stick it into 2.4.7-10smp, and see what happens?



Comment 17 Sam Varshavchik 2001-10-09 00:43:22 UTC
Rebuilt 2.4.7-10smp with io_apic.c from 2.4.3-12smp.

SO FAR SO GOOD.

Looks like there's only one line's worth of substantive changes between the two
versions.  I have no idea what it means, but so far I'm unable to crash
2.4.7-10smp with revereted io_apic.c.  Will continue to test.



Comment 18 Sam Varshavchik 2001-10-10 01:56:43 UTC
The real bug here is some broken debug code that was added in kernel 2.4.6.

Looks like it can affect any SMP motherboard. With certain combinations of APIC
devices, and PCI IRQ pin mappings, you're going to go into an infinite loop
while holding an ioapic spinlock.

Furrfu....





Comment 19 Sam Varshavchik 2001-10-10 01:57:16 UTC
Created attachment 33701 [details]
Brown paper bag bug fix.

Comment 20 Sam Varshavchik 2001-10-14 16:37:02 UTC
Tested this patch on an ASUS-P2DS and ABIT-BP6 boards.



Comment 21 Jim Wright 2002-10-26 05:48:51 UTC
looks like this final patch is in 2.4.18-17.7.x source.  close bug?

Comment 22 Sam Varshavchik 2002-10-26 06:04:44 UTC
Yeah, this patch went into -ac, then into the linus tree about a year ago.



Note You need to log in before you can comment on or make changes to this bug.