For tracking purposes. Upgraded an SMP workstation to Red Hat 7.2. Logged in in X, started latest mozilla nightly build (2001091712 RPM). Entered "http://cnn.com" into Mozilla. Hard lock up. No keyboard activity. The system is dead, can only do a hardware reset. Rebooted, ext3 recovered, and refscked my partitions. Did this again, and once again the machine locked up. Reproducible. Downgraded all partitions to ext2, booted 2.4.7-10smp with ext2 partitions, repeated these steps, the system locked up. Installed the errata 2.4.3-12smp, booted 7.2 with this kernel, the system does not lock up when I perform these steps. Analysis of UP kernels will follow. This is a Supermicro dual-PIII box with 256MB RAM and 512MB swap, and original vintage soundblaster 16 ISA card (sound effects are enabled in Gnome). There is an onboard AIC7xxx controller, there's also a second Adaptec SCSI controller that drives an external scanner (not used).
Created attachment 31958 [details] The hardware.
Cannot reproduce with a UP kernel. Repeatedly switching between UP and SMP, I can get SMP to crash every time. Sometimes I need to hit refresh a couple of times, before it locks up. Unable to reproduce the crash with UP.
Can you ping the machine from another machine ? What you describe can also be an X lockup.....
The box stops responding to pings when it dies. It's time to build a serial console...
It's pretty bad. No OOPS logged to the serial console. nmi_watchdog=1 is also silent (I verified that the NMI interrupts were being generated, as per kernel docs). Magic SysRq is also silent. I pulled out the ISA soundblaster. 2.4.7-10smp continues to lock up without the ISA soundblaster. Makes no difference. UP kernel is fine, and never locks up. The lock up appears to correlate with the amount of system uptime or CPU activity. One time I foolishly rebooted into 2.4.7-10smp after a lock up. After a lengthy fsck, the kernel froze before initscripts concluded. I repeated the experiment, once again got a hard lockup before initscript completely. With a repaired filesystem, I can log in and do some stuff in Mozilla, before locking up. I've just recovered the system after losing the kudzu database, /etc/inittab, and a bunch of other base package files. I've decided to take a break, and stick with 2.4.3smp, for now. Suggestions welcome.
I assume you're not using the NVidia binary only driver ?
Correct. Just a straightforward 7.2 install.
I pulled out the Adaptec 2940 PCI SCSI adapter, and booted 2.4.7-10smp. I could not reproduce the crash. The 2940 card only has an external scanner hooked up to it, and is otherwise not being used. The on-motherboard 2940U2/W adapter is the one that has a bunch of disks and a CD-RW hanging off it. The PCI SCSI card only has a scanner attached to it, but I have not been using the scanner at all. After I pulled out the 2940 PCI SCSI, I could not make the kernel crash. I dropped the card back in (in case the card was not socketed properly on the motherboard), reattached the scanner, and rebooted. The machine froze the 3rd time I reloaded cnn.com in mozilla. I've now pulled out the card again, and I'm now running 2.4.7-10smp again. I will update this bug tomorrow to indicate whether I've had any crashes. If I haven't, this would be a fairly good indication that the factors are: 1) A 2.4.7-10smp kernel 2) Two (possibly different models) Adaptec SCSI adapters. Whether or not they have to be different, or not, is not known The 2.4.3-12smp kernel never crashed with this combination. For reference, the current on-board 2940U2/W has the following stuff hanging off it: ched devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: FUJITSU Model: M1606S-512 Rev: 6234 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 03 Lun: 00 Vendor: YAMAHA Model: CRW8424S Rev: 1.0d Type: CD-ROM ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 06 Lun: 00 Vendor: Seagate Model: STT8000N Rev: 3.22 Type: Sequential-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 08 Lun: 00 Vendor: SEAGATE Model: ST39140LW Rev: 1483 Type: Direct-Access ANSI SCSI revision: 02 The card that I pulled out had a UMAX scanner on it. That's it.
The BIOS rev on the PCI SCSI card is 1.23 The motherboard SCSI BIOS rev on is 2.11 Attaching /proc/interrupts and /proc/ioports...
Created attachment 32160 [details] 2.4.3-12smp configuration with both SCSI adapters - stable configuration.
Created attachment 32161 [details] 2.4.7-10smp with both SCSI adapters - unstable, crashes after 3-4 minutes of heavy use.
Created attachment 32162 [details] 2.4.7-10smp with the PCI SCSI card taken out - stable so far.
2.4.7-10smp is yet to crash after a full day, with a single SCSI adapter.
It's not aic7xxx_old. I rebuilt 2.4.7-10smp reverting the three changes to aic7xxx_old between 2.4.3 and 2.4.7 - the machine still locks up.
2.4.9-0.18 still locks up.
I am unable to crash the kernel if I boot 2.4.7-10smp with 'noapic'. I'll still run with 'noapic' for a little while longer, just to be sure. I've also tried building 2.4.7-10smp without most of the patches. I was able to finally build 2.4.7-10smp only with -ac, tux, ext3 and the minimum fixup patches needed to compile the kernel. That was the minimum configuration that I managed to build with. That build proved to be even more unstable, so I can't use it as a working baseline. If noapic boot continues to run, would it make sense to take arch/i386/io_apic.c from 2.4.3-12smp, which works, stick it into 2.4.7-10smp, and see what happens?
Rebuilt 2.4.7-10smp with io_apic.c from 2.4.3-12smp. SO FAR SO GOOD. Looks like there's only one line's worth of substantive changes between the two versions. I have no idea what it means, but so far I'm unable to crash 2.4.7-10smp with revereted io_apic.c. Will continue to test.
The real bug here is some broken debug code that was added in kernel 2.4.6. Looks like it can affect any SMP motherboard. With certain combinations of APIC devices, and PCI IRQ pin mappings, you're going to go into an infinite loop while holding an ioapic spinlock. Furrfu....
Created attachment 33701 [details] Brown paper bag bug fix.
Tested this patch on an ASUS-P2DS and ABIT-BP6 boards.
looks like this final patch is in 2.4.18-17.7.x source. close bug?
Yeah, this patch went into -ac, then into the linus tree about a year ago.