Bug 122729

Summary: Race condition in tty driver
Product: Red Hat Enterprise Linux 2.1 Reporter: Martin Wilck <martin.wilck>
Component: kernelAssignee: Jason Baron <jbaron>
Status: CLOSED DUPLICATE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: ernst-heinrich.klaas, knoel, raimondi, riel, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-21 19:03:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Analysis of panic in n_tty_receive_buf()
none
netdump crash log
none
Sysreport file from server none

Description Martin Wilck 2004-05-07 15:09:02 UTC
Description of problem:
We analyzed a kernel panic that indicates a race condition in
the n_tty line discipline driver between the n_tty_receive_buf() and
n_tty_close() functions

Version-Release number of selected component (if applicable):
2.4.9-e.34 (code remains unchanged in 2.4.9-e.40)

How reproducible:
Happened once. We already reported a large number of system crahes on
the same type of production servers (BUG 116738). It is possible, but
not provable, that this is the same probklem

Steps to Reproduce:
Unknown.
  
Actual results:
System freezes due to invalid address reference in IRQ handler.

Expected results:
No freeze.

Additional info:
The crash occurs at line 730 of drivers/char/n_tty.c in the
function  n_tty_receive_buf(), because the "read_buf" element
of the controlling tty structure is NULL (see detailed
analysis below).

The read_buf element is checked in the same function in line 723,
7 lines earlier. Thus, it must have been NULLed in the meantime.
This is possible because there is no locking at least until
the spin_lock_irqsave(&tty->read_lock) in l. 727.

However, the only function (AFAICS) that sets read_buf to NULL,
n_tty_close(), does not seem to do any locking, so even if
the spin_lock_irqsave() in n_tty_receive_buf() happened earlier,
it would apparently not help.
Both n_tty_receive_buf() and n_tty_close() are called through
function pointers from other parts of the kernel. n_tty_close()
is called e.g. from do_tty_hangup() in drivers/char/tty_io.c.
n_tty_receive_buf() is supposed to be called from the low level
driver when data arrives, and can thus happen asynchronously.

Comment 1 Martin Wilck 2004-05-07 15:10:24 UTC
Created attachment 100078 [details]
Analysis of panic in n_tty_receive_buf()

Comment 2 Martin Wilck 2004-05-07 15:12:26 UTC
Created attachment 100079 [details]
netdump crash log

We also have a netdump vmcore file, but the core dump is incomplete. We are not
sure about the reason. Probably some operator reset the machine during the
dump.

Comment 3 Martin Wilck 2004-05-07 15:18:46 UTC
n_tty_receive_buf was called from an interrupt handler.
Unfortunately I have no idea which interrupt it was that triggered the
problem. Perhaps someone at RedHat has an idea? I have pretty low
knowledge about the inner workings of the tty layer, and specifically
from where the function pointers of the line discipline may be called.


Comment 4 Arjan van de Ven 2004-05-07 15:19:38 UTC
can you give a list of modules loaded at the time so that we can
narrow down the code involved ?

Comment 5 Raul Pingarron 2004-05-10 10:30:18 UTC
Created attachment 100121 [details]
Sysreport file from server

Hello, I attach you the SYSREPORT file, where you can find all this info.

Comment 6 Neil Horman 2004-08-19 11:09:16 UTC
well, I'm not sure there is a big involvement here, but the kernel is
tainted, can you replace bcm5700 with tg3, and remove the ipmi module
(intel bonding, I think thats for?), and see if the problem recurrs?

Comment 7 Martin Wilck 2004-08-19 11:42:48 UTC
As I commented in comment #21 in BUG 116738, we cannot easily exchange
drivers and components because these are production systems, and the
problems have not been reproduced in the lab.

I am just asking you to have a look at the Oops I analyzed and tell me
if you think my analysis is correct, or if not, what I got wrong.
There are no signs of the bcm5700 or ipmi drivers being involved in
the Oops. Please have one serious look at what we did before you
reject it as tainted.

Furthermore, I am still waiting for an answer to my question (in issue
tracker #38803) wrt the NMI watchdog: how high is the risk to "shoot
down" a running production system with the watchdog from RH's experience?

As a temprorary workaround we have told the customer to shut off
Hyperthreading in the servers, and thus, run as UP. Since then no
further crashes were reported, and the customer is currently content
because HT/smp doesn't benefit him too much. 

However, this may change in the future, so we'd rather solve the
problem instead of hoping that the customer will always run UP.
He may even buy SMP servers some time in the future.

The fact that the problems are gone since we switched SMP off supports
my suspicion that there are locking problems involved. The tty layer
seems to be a likely candidate because a) there are known locking
problems ther, b) the machines do a lot of serial IO, and c) my above
analysis points in this direction.

Comment 8 Martin Wilck 2004-08-19 11:45:34 UTC
PS the ipmi driver is part of our Servermanagement package. It talks
to our BMC.



Comment 9 Neil Horman 2004-09-03 11:22:22 UTC

*** This bug has been marked as a duplicate of 131672 ***

Comment 10 Red Hat Bugzilla 2006-02-21 19:03:06 UTC
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.