An IBM Netfinity 5500 hangs with 2.4.1-0.1.9smp after an arbitrary interval
with little / no apparent ongoing activity (In one example, within 14
minutes). 2.4.1-0.1.9 (non-smp) does not appear to cause the same hang.
The following message is noted in dmesg output:
WARNING: MP Table in the EBDA can be UNSAFE, contact
firstname.lastname@example.org if you experience SMP problems.
IBM Netfinity 5500
2x PII-450 processors
ServeRAID BIOS 3.00.12
IBM PCI ServeRAID 4.50.05 on mb
pcnet32 / Fast 79C791 ethernet controller
on-board IDE used only for CD-ROM
please let me know what other details are required, I'll provide them as
soon as possible.
This defect is considered MUST-FIX for Florence Gold release
Matt, is this the same thing you are seeing?
If there is any NMI oops then please include the (symbolic) oops as well.
There's no Oops generated.
When the problem occurs, the machine is frozen and will not accept any input.
Looking through the logs has not yet revealed any hints to activity immediately
before the crash.
Please check whether the same happens if you add "nmi_watchdog=0" to the kernel boot options. I had one similar report (not in bugzilla) which was solved by this.
Yet another question: whenever the lockup happens, was X running by any chace? In that case if a NMI watchdog oops is generated it wont be displayed in X and it might not make it to disk.
X has been running, the machine has been in runlevel 5 each time.
I'll set the default runlevel to 3 and not start X, as well as add the flag
specified. If that's resolving the problem, I'll try to narrow down which
permutations promote the failure.
Sorry i was not specific enough because i assumed that the silent lockup happened in textmode as well. Please first go to runlevel 3 (or console textmode) but do not disable the NMI watchdog. If the 'silent lockup' still occurs (ie. absolutely no oops on the text console) then please try the nmi_watchdog=0 thing.
By doing both things at once we risk masking a lockup that the NMI watchdog would be capable of detecting.
One more thing to watch: once the lockup happens in textmode, please make sure it's not a 'soft lockup'. A soft lockup is one where keyboard still works partially (eg. numlock works and you can probably switch between text consoles), but no keyboard input is accepted and the system is locked up otherwise. The NMI oopser only detects 'hard lockups', ie. situations where a CPU in the system does not respond to interrupts for more than 5 seconds.
The lockup is still occuring in runlevel 3, without the nmi_watchdog flag set.
The lockup is hard, with no keyboard input accepted. Keyboard is unresponsive
to caps-lock, num-lock, switching virtual consoles, or unblanking the screen. I
had two ssh logins to the machine at the time. Last uptime reported by `watch
uptime` was 27 min. Syslog.conf was configured to log *.debug to
/var/log/debug, and `tail -f /var/log/debug` reported no log messages within 2
minutes of the lockup (ie there was nothing to log). Each ssh login is frozen
(not accepting any input and the connection was not detected as dropped by ssh).
Is there a suggested workaround to disable screen blanking, or should the screen
automatically unblank when an oops is generated?
Rebooting with nmi_watchdog=0
Ingo, is it possible that the ioapic changes in 2.4.2-0.1.16 will fix
this or change it in any way?
there is a small chance that the latest kernel will fix this system too. To unblank the screen permanently, do something like this: "setterm -blank 0 -powersave off -powerdown 0". But i think on oops we unblank the screen ...
IBM said to report that this is a problem on machines with a
Linus is investigating this problem too. Right now our best guesses are: it's either an NMI <-> SMM-handler interaction, or it's caused by the 0x61 port access in do_nmi(). There are ways to write an SMM handler which accidentally enables NMIs while the SMM handler is still executing. Linus has seen such cases, it happens if the SMM handler eg. calls a BIOS routine that does an 'IRET' instruction - which enables NMIs as a side-effect. Could we get IBM folks to comment on this theory? The sporadic timing of the lockup definitely points in this direction. Unless we find a solution the probable 'official' fix for 2.4.3 is going to be that the NMI watchdog is disabled by default, and needs to be enabled via nmi_watchdog=1.
I was finally able to reproduce this. It is a quick, painless death. Trying
0.1.19 with Ingo's nmi-lockup-workaround patch.
Adding Rob Landry to cc: list as support contact.
This with kernel 2.4.2-0.1.19enterprise. The machine has been up for 24 hours
with no problems noted. It passed a minimal cerberus run (i.e. as much of
cerberus as will run in 128 MB.)
This is on a Netfinity 5100.
Verifying now on an IBM netfinity 5100 (possibly the same machine, and it's zero
work for me).
How was the bug reproduced? I've not been able to lock the box up at-will yet.
Same machine is eliciting bug # 31092 (SCSI problems, severe FS corruption in <
10 minutes of a Cerberus run).
nmi_watchdog timer has been disabled by default in RH 7.1 RC2 (2.4.2-0.1.19smp kernel)
User must recompile the kernel in order to enable the timer.
So basically this issue is fixed then by disabling the timer?
please let me know if/when rapid testing is required for this again; the IBM
Netfinity 5500 that I originally produced this bug(s) on now has a dead drive,
and won't attempt to boot until I've replaced the drive. If changes comes
through rapidly, then I'll need to make a minor adjustment to my working
"The machine has been up for 24 hours with no problems noted" -> fixed
closing, bug no longer applicable.