Red Hat Bugzilla – Bug 122729
Race condition in tty driver
Last modified: 2013-03-06 00:56:43 EST
Description of problem:
We analyzed a kernel panic that indicates a race condition in
the n_tty line discipline driver between the n_tty_receive_buf() and
Version-Release number of selected component (if applicable):
2.4.9-e.34 (code remains unchanged in 2.4.9-e.40)
Happened once. We already reported a large number of system crahes on
the same type of production servers (BUG 116738). It is possible, but
not provable, that this is the same probklem
Steps to Reproduce:
System freezes due to invalid address reference in IRQ handler.
The crash occurs at line 730 of drivers/char/n_tty.c in the
function n_tty_receive_buf(), because the "read_buf" element
of the controlling tty structure is NULL (see detailed
The read_buf element is checked in the same function in line 723,
7 lines earlier. Thus, it must have been NULLed in the meantime.
This is possible because there is no locking at least until
the spin_lock_irqsave(&tty->read_lock) in l. 727.
However, the only function (AFAICS) that sets read_buf to NULL,
n_tty_close(), does not seem to do any locking, so even if
the spin_lock_irqsave() in n_tty_receive_buf() happened earlier,
it would apparently not help.
Both n_tty_receive_buf() and n_tty_close() are called through
function pointers from other parts of the kernel. n_tty_close()
is called e.g. from do_tty_hangup() in drivers/char/tty_io.c.
n_tty_receive_buf() is supposed to be called from the low level
driver when data arrives, and can thus happen asynchronously.
Created attachment 100078 [details]
Analysis of panic in n_tty_receive_buf()
Created attachment 100079 [details]
netdump crash log
We also have a netdump vmcore file, but the core dump is incomplete. We are not
sure about the reason. Probably some operator reset the machine during the
n_tty_receive_buf was called from an interrupt handler.
Unfortunately I have no idea which interrupt it was that triggered the
problem. Perhaps someone at RedHat has an idea? I have pretty low
knowledge about the inner workings of the tty layer, and specifically
from where the function pointers of the line discipline may be called.
can you give a list of modules loaded at the time so that we can
narrow down the code involved ?
Created attachment 100121 [details]
Sysreport file from server
Hello, I attach you the SYSREPORT file, where you can find all this info.
well, I'm not sure there is a big involvement here, but the kernel is
tainted, can you replace bcm5700 with tg3, and remove the ipmi module
(intel bonding, I think thats for?), and see if the problem recurrs?
As I commented in comment #21 in BUG 116738, we cannot easily exchange
drivers and components because these are production systems, and the
problems have not been reproduced in the lab.
I am just asking you to have a look at the Oops I analyzed and tell me
if you think my analysis is correct, or if not, what I got wrong.
There are no signs of the bcm5700 or ipmi drivers being involved in
the Oops. Please have one serious look at what we did before you
reject it as tainted.
Furthermore, I am still waiting for an answer to my question (in issue
tracker #38803) wrt the NMI watchdog: how high is the risk to "shoot
down" a running production system with the watchdog from RH's experience?
As a temprorary workaround we have told the customer to shut off
Hyperthreading in the servers, and thus, run as UP. Since then no
further crashes were reported, and the customer is currently content
because HT/smp doesn't benefit him too much.
However, this may change in the future, so we'd rather solve the
problem instead of hoping that the customer will always run UP.
He may even buy SMP servers some time in the future.
The fact that the problems are gone since we switched SMP off supports
my suspicion that there are locking problems involved. The tty layer
seems to be a likely candidate because a) there are known locking
problems ther, b) the machines do a lot of serial IO, and c) my above
analysis points in this direction.
PS the ipmi driver is part of our Servermanagement package. It talks
to our BMC.
*** This bug has been marked as a duplicate of 131672 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.