Description of problem: We analyzed a kernel panic that indicates a race condition in the n_tty line discipline driver between the n_tty_receive_buf() and n_tty_close() functions Version-Release number of selected component (if applicable): 2.4.9-e.34 (code remains unchanged in 2.4.9-e.40) How reproducible: Happened once. We already reported a large number of system crahes on the same type of production servers (BUG 116738). It is possible, but not provable, that this is the same probklem Steps to Reproduce: Unknown. Actual results: System freezes due to invalid address reference in IRQ handler. Expected results: No freeze. Additional info: The crash occurs at line 730 of drivers/char/n_tty.c in the function n_tty_receive_buf(), because the "read_buf" element of the controlling tty structure is NULL (see detailed analysis below). The read_buf element is checked in the same function in line 723, 7 lines earlier. Thus, it must have been NULLed in the meantime. This is possible because there is no locking at least until the spin_lock_irqsave(&tty->read_lock) in l. 727. However, the only function (AFAICS) that sets read_buf to NULL, n_tty_close(), does not seem to do any locking, so even if the spin_lock_irqsave() in n_tty_receive_buf() happened earlier, it would apparently not help. Both n_tty_receive_buf() and n_tty_close() are called through function pointers from other parts of the kernel. n_tty_close() is called e.g. from do_tty_hangup() in drivers/char/tty_io.c. n_tty_receive_buf() is supposed to be called from the low level driver when data arrives, and can thus happen asynchronously.
Created attachment 100078 [details] Analysis of panic in n_tty_receive_buf()
Created attachment 100079 [details] netdump crash log We also have a netdump vmcore file, but the core dump is incomplete. We are not sure about the reason. Probably some operator reset the machine during the dump.
n_tty_receive_buf was called from an interrupt handler. Unfortunately I have no idea which interrupt it was that triggered the problem. Perhaps someone at RedHat has an idea? I have pretty low knowledge about the inner workings of the tty layer, and specifically from where the function pointers of the line discipline may be called.
can you give a list of modules loaded at the time so that we can narrow down the code involved ?
Created attachment 100121 [details] Sysreport file from server Hello, I attach you the SYSREPORT file, where you can find all this info.
well, I'm not sure there is a big involvement here, but the kernel is tainted, can you replace bcm5700 with tg3, and remove the ipmi module (intel bonding, I think thats for?), and see if the problem recurrs?
As I commented in comment #21 in BUG 116738, we cannot easily exchange drivers and components because these are production systems, and the problems have not been reproduced in the lab. I am just asking you to have a look at the Oops I analyzed and tell me if you think my analysis is correct, or if not, what I got wrong. There are no signs of the bcm5700 or ipmi drivers being involved in the Oops. Please have one serious look at what we did before you reject it as tainted. Furthermore, I am still waiting for an answer to my question (in issue tracker #38803) wrt the NMI watchdog: how high is the risk to "shoot down" a running production system with the watchdog from RH's experience? As a temprorary workaround we have told the customer to shut off Hyperthreading in the servers, and thus, run as UP. Since then no further crashes were reported, and the customer is currently content because HT/smp doesn't benefit him too much. However, this may change in the future, so we'd rather solve the problem instead of hoping that the customer will always run UP. He may even buy SMP servers some time in the future. The fact that the problems are gone since we switched SMP off supports my suspicion that there are locking problems involved. The tty layer seems to be a likely candidate because a) there are known locking problems ther, b) the machines do a lot of serial IO, and c) my above analysis points in this direction.
PS the ipmi driver is part of our Servermanagement package. It talks to our BMC.
*** This bug has been marked as a duplicate of 131672 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.