Bug 28145

Summary:	IBM Netfinity hangs with 2.4.1-0.1.9smp
Product:	[Retired] Red Hat Linux	Reporter:	Brian Brock <bbrock>
Component:	kernel	Assignee:	Michael K. Johnson <johnsonm>
Status:	CLOSED RAWHIDE	QA Contact:	Brock Organ <borgan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.1	CC:	mdrew, mingo, rlandry, zaitcev
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2001-03-21 18:33:06 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Brian Brock 2001-02-17 15:52:07 UTC

An IBM Netfinity 5500 hangs with 2.4.1-0.1.9smp after an arbitrary interval
with little / no apparent ongoing activity (In one example, within 14
minutes).  2.4.1-0.1.9 (non-smp) does not appear to cause the same hang.
The following message is noted in dmesg output:

WARNING:  MP Table in the EBDA can be UNSAFE, contact
linux-smp.org if you experience SMP problems.

Machine configuration:
IBM Netfinity 5500
2x PII-450 processors
ServeRAID BIOS 3.00.12
IBM PCI ServeRAID 4.50.05 on mb
pcnet32 / Fast 79C791 ethernet controller
on-board IDE used only for CD-ROM

please let me know what other details are required, I'll provide them as
soon as possible.

Comment 1 Glen Foster 2001-02-17 23:29:47 UTC

This defect is considered MUST-FIX for Florence Gold release

Comment 2 Michael K. Johnson 2001-02-20 16:53:06 UTC

Matt, is this the same thing you are seeing?

Comment 3 Ingo Molnar 2001-02-20 17:05:01 UTC

If there is any NMI oops then please include the (symbolic) oops as well.

Comment 4 Brian Brock 2001-02-21 16:06:01 UTC

There's no Oops generated.

When the problem occurs, the machine is frozen and will not accept any input. 
Looking through the logs has not yet revealed any hints to activity immediately
before the crash.

Comment 5 Ingo Molnar 2001-02-21 16:57:08 UTC

Please check whether the same happens if you add "nmi_watchdog=0" to the kernel boot options. I had one similar report (not in bugzilla) which was solved by this.
Yet another question: whenever the lockup happens, was X running by any chace? In that case if a NMI watchdog oops is generated it wont be displayed in X and it might not make it to disk.

Comment 6 Brian Brock 2001-02-21 17:26:08 UTC

X has been running, the machine has been in runlevel 5 each time.

I'll set the default runlevel to 3 and not start X, as well as add the flag
specified.  If that's resolving the problem, I'll try to narrow down which
permutations promote the failure.

Comment 7 Ingo Molnar 2001-02-21 17:32:43 UTC

Sorry i was not specific enough because i assumed that the silent lockup happened in textmode as well. Please first go to runlevel 3 (or console textmode) but do not disable the NMI watchdog. If the 'silent lockup' still occurs (ie. absolutely no oops on the text console) then please try the nmi_watchdog=0 thing.
By doing both things at once we risk masking a lockup that the NMI watchdog would be capable of detecting.

Comment 8 Ingo Molnar 2001-02-21 17:35:10 UTC

One more thing to watch: once the lockup happens in textmode, please make sure it's not a 'soft lockup'. A soft lockup is one where keyboard still works partially (eg. numlock works and you can probably switch between text consoles), but no keyboard input is accepted and the system is locked up otherwise. The NMI oopser only detects 'hard lockups', ie. situations where a CPU in the system does not respond to interrupts for more than 5 seconds.

Comment 9 Brian Brock 2001-02-21 18:49:14 UTC

The lockup is still occuring in runlevel 3, without the nmi_watchdog flag set.

The lockup is hard, with no keyboard input accepted.  Keyboard is unresponsive
to caps-lock, num-lock, switching virtual consoles, or unblanking the screen.  I
had two ssh logins to the machine at the time. Last uptime reported by `watch
uptime` was 27 min.  Syslog.conf was configured to log *.debug to
/var/log/debug, and `tail -f /var/log/debug` reported no log messages within 2
minutes of the lockup (ie there was nothing to log).  Each ssh login is frozen
(not accepting any input and the connection was not detected as dropped by ssh).

Is there a suggested workaround to disable screen blanking, or should the screen
automatically unblank when an oops is generated?

Rebooting with nmi_watchdog=0

Comment 10 Michael K. Johnson 2001-03-01 04:15:57 UTC

Ingo, is it possible that the ioapic changes in 2.4.2-0.1.16 will fix
this or change it in any way?

Comment 11 Ingo Molnar 2001-03-01 09:53:30 UTC

there is a small chance that the latest kernel will fix this system too. To unblank the screen permanently, do something like this: "setterm -blank 0 -powersave off -powerdown 0". But i think on oops we unblank the screen ...

Comment 12 Michael K. Johnson 2001-03-02 02:11:07 UTC

IBM said to report that this is a problem on machines with a
"service processor"

Comment 13 Ingo Molnar 2001-03-02 08:38:46 UTC

Linus is investigating this problem too. Right now our best guesses are: it's either an NMI <-> SMM-handler interaction, or it's caused by the 0x61 port access in do_nmi(). There are ways to write an SMM handler which accidentally enables NMIs while the SMM handler is still executing. Linus has seen such cases, it happens if the SMM handler eg. calls a BIOS routine that does an 'IRET' instruction - which enables NMIs as a side-effect. Could we get IBM folks to comment on this theory? The sporadic timing of the lockup definitely points in this direction. Unless we find a solution the probable 'official' fix for 2.4.3 is going to be that the NMI watchdog is disabled by default, and needs to be enabled via nmi_watchdog=1.

Comment 14 Bob Matthews 2001-03-04 18:21:24 UTC

I was finally able to reproduce this.  It is a quick, painless death.  Trying
0.1.19 with Ingo's nmi-lockup-workaround patch.

Comment 15 Bob Matthews 2001-03-04 18:22:09 UTC

Adding Rob Landry to cc: list as support contact.

Comment 16 Bob Matthews 2001-03-05 15:27:08 UTC

This with kernel 2.4.2-0.1.19enterprise.  The machine has been up for 24 hours
with no problems noted.  It passed a minimal cerberus run (i.e. as much of
cerberus as will run in 128 MB.)

Comment 17 Bob Matthews 2001-03-05 15:28:06 UTC

This is on a Netfinity 5100.

Comment 18 Brian Brock 2001-03-06 19:32:40 UTC

Verifying now on an IBM netfinity 5100 (possibly the same machine, and it's zero
work for me).

How was the bug reproduced?  I've not been able to lock the box up at-will yet.

Comment 19 Brian Brock 2001-03-08 22:39:06 UTC

Same machine is eliciting bug #  31092 (SCSI problems, severe FS corruption in <
10 minutes of a Cerberus run).

Comment 20 Wendy Hung 2001-03-19 21:59:08 UTC

nmi_watchdog timer has been disabled by default in RH 7.1 RC2 (2.4.2-0.1.19smp kernel)
User must recompile the kernel in order to enable the timer.

Comment 21 Arjan van de Ven 2001-03-19 22:16:51 UTC

So basically this issue is fixed then by disabling the timer?

Comment 22 Brian Brock 2001-03-19 22:25:47 UTC

please let me know if/when rapid testing is required for this again; the IBM
Netfinity 5500 that I originally produced this bug(s) on now has a dead drive,
and won't attempt to boot until I've replaced the drive.  If changes comes
through rapidly, then I'll need to make a minor adjustment to my working
priorities.

Comment 23 Arjan van de Ven 2001-03-21 18:33:01 UTC

"The machine has been up for 24 hours with no problems noted" -> fixed

Comment 24 Brian Brock 2001-05-01 15:49:15 UTC

closing, bug no longer applicable.