Bug 166722
Summary: | Kernel panic during system shutdown | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Rigoberto Corujo <rigoberto.corujo> | ||||
Component: | kernel | Assignee: | David Miller <davem> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.0 | CC: | petrides | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2005-11-18 20:18:47 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Rigoberto Corujo
2005-08-24 21:53:54 UTC
Created attachment 118095 [details]
Photograph of kernel panic
Could you please try to reproduce this on the latest released kernel (version 2.4.21-32.0.1.EL), which was released 3 months ago? There have been many important (and potentially relevant) fixes since U4. Also, please try to capture the full console oops output (with serial console if necessary). We at least need to see if the kernel is tainted and what the module list looks like. Thanks in advance. Hello Ernie, I'm sorry for taking so long to respond to this case as I've been completely swamped. We'll follow your suggestion of moving to the 2.4.21-32 kernel and see if that fixes the problem for this particular customer. As far as capturing console logs via the serial port, that might not be easy to do. The customer has a 288 node cluster and it is difficult to know which nodes are going to crash. These nodes do have a management port that we use to power them on/off, which should also have console redirection capability, but we need to figure out what the BIOS recipe is to enable console redirection to the management port. Anyway, I think you can close this case as we need to first try the 2.4.21-32 kernel before attempting to further troubleshoot this problem. Should the problem persist even after upgrading, I will file a new Bugzilla. Thank you very much for your support. BTW, your name sounds very familiar. Were you a former kernel developer with DEC? Rigoberto Rigoberto, yes, I used to be a contractor for many years there, and was involved heavily with OSF/1 -> Digitial UNIX -> Compaq Tru64 UNIX kernel development. Reverting state to NEEDINFO. After receiving a second report by a different customer, we were able to obtain enough information to determine that the kernel panic was being caused by the Infiniband drivers. The scenario that leads up to the panic is as follows: 1) In a cluster, node "A" is exporting a filesystem, say "/scratch". 2) Node "B" is NFS mounting "/scratch" from node "A" with the "tcp" mount option over the Infiniband interconnect. It is important to note that he problem doesn't occur with "udp". 3) Node "B" runs an application that writes to files in "/scratch". 4) A cluster-wide shutdown command is issued and all the nodes begin to stop their services. 5) Node "A's" nfs service is stopped during the shutdown and, therefore, is no longer exporting "/scratch". 6) Node "B" unloads its Infiniband driver that was being used to mount "/scratch" from node "A". 7) Shortly afterwards, node "B" panics. It should be noted that if node "A" doesn't shutdown its nfs service before node "B" shutdowns down, then the panic does not occur. We reported the incident to Voltaire, who provides the Infiniband drivers, and they provided the following explanation: ---- Linux holds a reference counter on network devices, the counter is increased / decreased during traffic. There is a kernel implementation related problem that causes the counter to stay non-zero for very long time (possibly forever). In thiscase the device un-registration will cause the machine to wait forever. This usually happens during shutdown / reboot during heavy traffic. During server shutdown / reboot all services are being stopped and all process are being killed. Voltaire IBHOST is a registered service and therefore being stopped during the shutdown / reboot event. This causes the removal of the IPoIB interface and also the removal of the IPoIB kernel module, which calls the unregister device command ( From the kernel ). This issue can also happen in Ethernet drivers, the main difference is that Ethernet drivers are not removed during shutdown / reboot ( Only the interface is brought down ) and therefore donât call the unregister_device. ---- Voltaire has provided a patch for this problem. This case can be considered closed. Thank you for your assistance. Rigoberto |