We recently received a report of several (2-CPU) nodes on a cluster panicking when they were powered off. The cluster had previously running an application called AMBER which probably stressed the system pretty good. The customer sent us a photograph of the console showing the panic information which I will attach shortly. Basically, the trace shows that the problem occurred at âneigh_destroy+216â (linux-2.4.21/net/core/neighbour.c) Trace: neigh_destroy+216 dst_destroy+92 dst_run_gc+103 Iâve shown some of the disassembled code below. Since the trace is probably showing what the next instruction would have been, not the instruction being currently executed, I assume that we are really interested in either âneigh_destroy+209â or âneigh_destroy+211â, which would be the âkfreeâ (shown in the code snippet below). Iâm guessing that the âkfreeâ is the culprit only because if it were the line before, which is âatomic_dec_and_test (&hh->hh_refcnt)â, then that would imply a bad âhhâ and we would have blown up a few lines earlier while referencing âhhâ. Relevant code of a disassembled âneigh_destroyâ. 0xffffffff80255921 <neigh_destroy+209>: je 0xffffffff80255928 <neigh_destroy+216> 0xffffffff80255923 <neigh_destroy+211>: callq 0xffffffff8014eb20 <kfree> 0xffffffff80255928 <neigh_destroy+216>: mov 0x68(%rbp),%rdi Relevant code of a disassembled âdst_destroyâ. 0xffffffff80254a34 <dst_destroy+84>: mov %rbx,%rdi 0xffffffff80254a37 <dst_destroy+87>: callq 0xffffffff80255850 <neigh_destroy> 0xffffffff80254a3c <dst_destroy+92>: mov 0xc0(%rbp),%rax Relevant code of a disassembled âdst_run_gcâ. 0xffffffff8025471e <dst_run_gc+94>: mov %rax,0x0(%rbp) 0xffffffff80254722 <dst_run_gc+98>: callq 0xffffffff802549e0 <dst_destroy> 0xffffffff80254727 <dst_run_gc+103>: test %rax,%rax FILE: linux-2.4.21/net/core/neighbour.c void neigh_destroy(struct neighbour *neigh) { ⦠while ((hh = neigh->hh) != NULL) { neigh->hh = hh->hh_next; hh->hh_next = NULL; write_lock_bh(&hh->hh_lock); hh->hh_output = neigh_blackhole; write_unlock_bh(&hh->hh_lock); if (atomic_dec_and_test(&hh->hh_refcnt)) kfree(hh); } ⦠} ---------------------------------------------------- FILE: linux-2.4.21/include/net/neighbour.h static inline void neigh_release(struct neighbour *neigh) { if (atomic_dec_and_test(&neigh->refcnt)) neigh_destroy(neigh); } ---------------------------------------------------- FILE: linux-2.4.21/net/core/dst.c struct dst_entry *dst_destroy(struct dst_entry * dst) { ⦠neigh = dst->neighbour; ⦠if (neigh) { dst->neighbour = NULL; neigh_release(neigh); } ⦠} ---------------------------------------------------- FILE: linux-2.4.21/net/core/dst.c static void dst_run_gc(unsigned long dummy) { ⦠if (!spin_trylock(&dst_lock)) { mod_timer(&dst_gc_timer, jiffies + HZ/10); return; } ⦠while ((dst = *dstp) != NULL) { if (atomic_read(&dst->__refcnt)) { dstp = &dst->next; delayed++; continue; } *dstp = dst->next; dst = dst_destroy(dst); ⦠spin_unlock(&dst_lock); } Iâm not sure if this implies a locking problem. It would appear as if the âwrite_lock_bh(&hh->hh_lock)â, or some other type of locking, should have taken place prior to the following line??? That is, prior to actually using âhhâ?? while ((hh = neigh->hh) != NULL) These nodes have 2 CPUâs each so I donât know if this code is well protected against multiple kernel threads. The âdst_run_gcâ, which calls âdst_destroyâ, which calls âneigh_releaseâ, which calls âneigh_destroyâ, does appear to set a lock, which may be sufficient in this case. These machines are running Linux 2.4.21-27 and the processor type is EM64T. I am trying to see if I can get my hands on the AMBER application that the customer ran prior to the kernel panic. As I understand it, the kernel panics during shutdown only started to occur after users began running the AMBER application. So, unfortunately, I have not been able to reproduce it yet and do not have a reproducer to provide. Any help in resolving this problem would be appreciated. Thank you. Rigoberto
Created attachment 118095 [details] Photograph of kernel panic
Could you please try to reproduce this on the latest released kernel (version 2.4.21-32.0.1.EL), which was released 3 months ago? There have been many important (and potentially relevant) fixes since U4. Also, please try to capture the full console oops output (with serial console if necessary). We at least need to see if the kernel is tainted and what the module list looks like. Thanks in advance.
Hello Ernie, I'm sorry for taking so long to respond to this case as I've been completely swamped. We'll follow your suggestion of moving to the 2.4.21-32 kernel and see if that fixes the problem for this particular customer. As far as capturing console logs via the serial port, that might not be easy to do. The customer has a 288 node cluster and it is difficult to know which nodes are going to crash. These nodes do have a management port that we use to power them on/off, which should also have console redirection capability, but we need to figure out what the BIOS recipe is to enable console redirection to the management port. Anyway, I think you can close this case as we need to first try the 2.4.21-32 kernel before attempting to further troubleshoot this problem. Should the problem persist even after upgrading, I will file a new Bugzilla. Thank you very much for your support. BTW, your name sounds very familiar. Were you a former kernel developer with DEC? Rigoberto
Rigoberto, yes, I used to be a contractor for many years there, and was involved heavily with OSF/1 -> Digitial UNIX -> Compaq Tru64 UNIX kernel development. Reverting state to NEEDINFO.
After receiving a second report by a different customer, we were able to obtain enough information to determine that the kernel panic was being caused by the Infiniband drivers. The scenario that leads up to the panic is as follows: 1) In a cluster, node "A" is exporting a filesystem, say "/scratch". 2) Node "B" is NFS mounting "/scratch" from node "A" with the "tcp" mount option over the Infiniband interconnect. It is important to note that he problem doesn't occur with "udp". 3) Node "B" runs an application that writes to files in "/scratch". 4) A cluster-wide shutdown command is issued and all the nodes begin to stop their services. 5) Node "A's" nfs service is stopped during the shutdown and, therefore, is no longer exporting "/scratch". 6) Node "B" unloads its Infiniband driver that was being used to mount "/scratch" from node "A". 7) Shortly afterwards, node "B" panics. It should be noted that if node "A" doesn't shutdown its nfs service before node "B" shutdowns down, then the panic does not occur. We reported the incident to Voltaire, who provides the Infiniband drivers, and they provided the following explanation: ---- Linux holds a reference counter on network devices, the counter is increased / decreased during traffic. There is a kernel implementation related problem that causes the counter to stay non-zero for very long time (possibly forever). In thiscase the device un-registration will cause the machine to wait forever. This usually happens during shutdown / reboot during heavy traffic. During server shutdown / reboot all services are being stopped and all process are being killed. Voltaire IBHOST is a registered service and therefore being stopped during the shutdown / reboot event. This causes the removal of the IPoIB interface and also the removal of the IPoIB kernel module, which calls the unregister device command ( From the kernel ). This issue can also happen in Ethernet drivers, the main difference is that Ethernet drivers are not removed during shutdown / reboot ( Only the interface is brought down ) and therefore donât call the unregister_device. ---- Voltaire has provided a patch for this problem. This case can be considered closed. Thank you for your assistance. Rigoberto