Bug 28145
Summary: | IBM Netfinity hangs with 2.4.1-0.1.9smp | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Brian Brock <bbrock> |
Component: | kernel | Assignee: | Michael K. Johnson <johnsonm> |
Status: | CLOSED RAWHIDE | QA Contact: | Brock Organ <borgan> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 7.1 | CC: | mdrew, mingo, rlandry, zaitcev |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2001-03-21 18:33:06 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Brian Brock
2001-02-17 15:52:07 UTC
This defect is considered MUST-FIX for Florence Gold release Matt, is this the same thing you are seeing? If there is any NMI oops then please include the (symbolic) oops as well. There's no Oops generated. When the problem occurs, the machine is frozen and will not accept any input. Looking through the logs has not yet revealed any hints to activity immediately before the crash. Please check whether the same happens if you add "nmi_watchdog=0" to the kernel boot options. I had one similar report (not in bugzilla) which was solved by this. Yet another question: whenever the lockup happens, was X running by any chace? In that case if a NMI watchdog oops is generated it wont be displayed in X and it might not make it to disk. X has been running, the machine has been in runlevel 5 each time. I'll set the default runlevel to 3 and not start X, as well as add the flag specified. If that's resolving the problem, I'll try to narrow down which permutations promote the failure. Sorry i was not specific enough because i assumed that the silent lockup happened in textmode as well. Please first go to runlevel 3 (or console textmode) but do not disable the NMI watchdog. If the 'silent lockup' still occurs (ie. absolutely no oops on the text console) then please try the nmi_watchdog=0 thing. By doing both things at once we risk masking a lockup that the NMI watchdog would be capable of detecting. One more thing to watch: once the lockup happens in textmode, please make sure it's not a 'soft lockup'. A soft lockup is one where keyboard still works partially (eg. numlock works and you can probably switch between text consoles), but no keyboard input is accepted and the system is locked up otherwise. The NMI oopser only detects 'hard lockups', ie. situations where a CPU in the system does not respond to interrupts for more than 5 seconds. The lockup is still occuring in runlevel 3, without the nmi_watchdog flag set. The lockup is hard, with no keyboard input accepted. Keyboard is unresponsive to caps-lock, num-lock, switching virtual consoles, or unblanking the screen. I had two ssh logins to the machine at the time. Last uptime reported by `watch uptime` was 27 min. Syslog.conf was configured to log *.debug to /var/log/debug, and `tail -f /var/log/debug` reported no log messages within 2 minutes of the lockup (ie there was nothing to log). Each ssh login is frozen (not accepting any input and the connection was not detected as dropped by ssh). Is there a suggested workaround to disable screen blanking, or should the screen automatically unblank when an oops is generated? Rebooting with nmi_watchdog=0 Ingo, is it possible that the ioapic changes in 2.4.2-0.1.16 will fix this or change it in any way? there is a small chance that the latest kernel will fix this system too. To unblank the screen permanently, do something like this: "setterm -blank 0 -powersave off -powerdown 0". But i think on oops we unblank the screen ... IBM said to report that this is a problem on machines with a "service processor" Linus is investigating this problem too. Right now our best guesses are: it's either an NMI <-> SMM-handler interaction, or it's caused by the 0x61 port access in do_nmi(). There are ways to write an SMM handler which accidentally enables NMIs while the SMM handler is still executing. Linus has seen such cases, it happens if the SMM handler eg. calls a BIOS routine that does an 'IRET' instruction - which enables NMIs as a side-effect. Could we get IBM folks to comment on this theory? The sporadic timing of the lockup definitely points in this direction. Unless we find a solution the probable 'official' fix for 2.4.3 is going to be that the NMI watchdog is disabled by default, and needs to be enabled via nmi_watchdog=1. I was finally able to reproduce this. It is a quick, painless death. Trying 0.1.19 with Ingo's nmi-lockup-workaround patch. Adding Rob Landry to cc: list as support contact. This with kernel 2.4.2-0.1.19enterprise. The machine has been up for 24 hours with no problems noted. It passed a minimal cerberus run (i.e. as much of cerberus as will run in 128 MB.) This is on a Netfinity 5100. Verifying now on an IBM netfinity 5100 (possibly the same machine, and it's zero work for me). How was the bug reproduced? I've not been able to lock the box up at-will yet. Same machine is eliciting bug # 31092 (SCSI problems, severe FS corruption in < 10 minutes of a Cerberus run). nmi_watchdog timer has been disabled by default in RH 7.1 RC2 (2.4.2-0.1.19smp kernel) User must recompile the kernel in order to enable the timer. So basically this issue is fixed then by disabling the timer? please let me know if/when rapid testing is required for this again; the IBM Netfinity 5500 that I originally produced this bug(s) on now has a dead drive, and won't attempt to boot until I've replaced the drive. If changes comes through rapidly, then I'll need to make a minor adjustment to my working priorities. "The machine has been up for 24 hours with no problems noted" -> fixed closing, bug no longer applicable. |