Bug 131672
Summary: | CAN-2004-0814 potential race condition in RHEL 2.1/3 tty layer | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Tom "spot" Callaway <tcallawa> | ||||
Component: | kernel | Assignee: | Jason Baron <jbaron> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 2.1 | CC: | alan, bressers, knoel, martin.wilck, mjc, riel, tao | ||||
Target Milestone: | --- | Keywords: | Security | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | impact=important,public=20040907 | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2005-06-14 17:26:32 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 116738, 132992 | ||||||
Attachments: |
|
Description
Tom "spot" Callaway
2004-09-03 01:03:04 UTC
Tom, please open a separate bug for RHEL 3. This one will apply only to RHEL 2.1. Thanks. -ernie *** Bug 122729 has been marked as a duplicate of this bug. *** Created attachment 103431 [details]
clean up of digi's patch to fix tty race condition
revision on digi's patch to clean up missing race points outside of added locks
This patch still has nothing to do with the real bug. You have to fix this at the ldisc level not within n_tty. (joining in from Bug 122729) So you _do_ think it is a bug in the tty/line discipline layer? This is an area of the kernel where I severely lack experience. Can you point to any resources where this is discussed e.g. in LKML? Is it fixed in EL3.0, or kernel 2.6.x? There were multiple bugs in the driver code I looked at which would cause crashes and were driver side bugs. There is also a slightly odd corner case in the case of a line discipline change occurring at the exact momemt data is delivered by interrupt to the ldisc handler. In that situation the following can occur ldisc->close is called While running or just after ldisc->receive.. is called The bug appears to be in both 2.4 and 2.6 The concept for a fix I'd propose is that we create a "null" ldisc that way we can fix the race by setting the tty ldisc to the null_ldisc before we called ->close on the previous ldisc. [We can't open the new one first without fairly major changes] I am reposting my analyis from Bug 122729 here again because it seems to fit very well what you said. I am still wondering why there seems to be no locking code whatsoever in n_tty_close(); but that may be due to it assuming that locking happens in the ldisc layer. Btw our customer also uses heavy serial multiplexing. Here is my old Oops analysis: The crash occurs at line 730 of drivers/char/n_tty.c in the function n_tty_receive_buf(), because the "read_buf" element of the controlling tty structure is NULL (see detailed analysis below). The read_buf element is checked in the same function in line 723, 7 lines earlier. Thus, it must have been NULLed in the meantime. This is possible because there is no locking at least until the spin_lock_irqsave(&tty->read_lock) in l. 727. However, the only function (AFAICS) that sets read_buf to NULL, n_tty_close(), does not seem to do any locking, so even if the spin_lock_irqsave() in n_tty_receive_buf() happened earlier, it would apparently not help. There are two cases you get this crash. One of them is a driver level bug when a driver calls ->receive in parallel with or after having notified the serial layer of a hangup. That seems to be the one people see in the real world with buggy drivers. The second is the ldisc switch case which will probably only bite people doing a lot of PPP sessions at a time on SMP machines. n_tty can't do any locking because the race is bigger than n_tty itself and requires the locking is higher up. Which would be all well and good if ldisc did that locking. Similar race effects exist in the switch to n_tty from ppp and so on also. At the point you make "->receive is never called after ->close" an invariant for ldisc then the n_tty code and the other ldiscs I've looked at work. This is already meant to be an invariant but the one case got missed. Arjan probably cares about this for 2.6... No ppp is involved in our case. It is some specific proprietary protocol talked through the serial line with some food balances. But the protocol is impemented in user space and uses the normal ttySX interface. Are you setting line discplines - if not then I don't see how this can be the problem you are seeing I'll inquire, thank you. Alan, I don't quite understand why you think we wouldn't hit this bug without switching the line discipline. The "driver level" problem you describe could just a swell happen with the default line discipline, couldn't it? Regards, Martin If drivers make random wrong calls to the tty layer they will crash. This has always been the case with operating systems. The ldisc switch race is a very very specific situation and a very narrow window when changing between tty and ppp for example. If you don't change ldisc you can't hit the race as the racy code isnt ever called. > If drivers make random wrong calls to the tty layer they will crash.
Does that mean, then, that the standard serial driver is buggy?
AFAICS, in our case, n_tty_receive_buf() was called by
the serial interupt handler.
In your situation is that occurring after a hangup triggered by the serial driver itself. It's possible I guess. Instrumenting for ldisc close before ldisc receive (ie n_tty_receive_buf) would be interesting in that situation. Ok seem to be at least three bugs kernel side 1. Some drivers call ldisc->receive_buf() directly. That's a nono but is done in the kernel with offenders including pty and rocketport. Could be out of tree offenders 2. Ldisc switching locks. I have a test fix I'm playing with in 2.6 for cases not in #1 3. Possibly raw/nonraw switching locks although I think that may purely be a symptom of #1 but this needs verification. The n_tty layer believes that if it sets TTY_NO_FLIP then it won't get input. Drivers bypassing the flip queues break this assumption. 4. TIOCSTI ioctls break the assumption too The bypassing seems to be appropriate for some drivers so we probably need a function for "try_to_queue_to_ldisc()" or some similar helper for these. Pty is trickier as it wants to queue and sleep if need be. Workarounds: 1. Don't call ldisc.receive_buf directly, if you must call it then only call it if TTY_DONT_FLIP is clear. Preferably switch to using the tty flip buffer logic as that will avoid this one 2. No workaround (RHEL fix may eventually require workaround 1 anyway) 3. Maybe workaround 1. Alan, this seems to conflict with what Ted Ts'o said back in 2000? http://marc.theaimsgroup.com/?l=linux-kernel&m=94346476622424&w=2 Also, I checked with Digi, and their driver can be operated in two modes: If the DGRP driver global variable called "rawreadok" is on, (which is on by default) it will call ldisc->receive_buf() directly. If that variable is off, the driver will fill up a flip buffer, and schedule it to be delivered later with a call to tty_schedule_flip(). *** Bug 133109 has been marked as a duplicate of this bug. *** Any news on this topic? For reference the fix that got committed upstream into 2.4.29: http://linux.bkbits.net:8080/linux-2.4/cset@41c3801dxJnuBSRCpUiMRkwItPHjWA The fix has been deemed too invasive for a RHEL2.1 update at this time. Closing as WONTFIX. |