From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.0.1) Gecko/20060313 Fedora/1.5.0.1-9 Firefox/1.5.0.1 pango-text Description of problem: Description of problem: Writes to /dev/ttyS1 frequently stall. Analysis shows the 8250 driver incorrectly detects the device as a 16550A with the UART_BUG_TXEN bug. Only on SMP systems is this bug is exposed. When we boot with maxcpus=1 the problem does not occur. When this situation occurs, the tty_struxt does not stepped flags set and the uart_info->xmit buffer is not empty. If interrupts were occurring the data should be sent to the UART. Because the data is not being sent, it seems a "transmitter holding register empty" interrupt (THRI) is getting lost and therefore outgoing data stops. This kind of errant behaviour was discussed in June 2006 on LKML under the topic UART_BUG_TXEN. However I'm not sure the patches then proposed handled all the issues. In particular, there are SMP races between the 8250 interrupt service routine (ISR) and non-ISR code when they access the IIR register of the UART. This has severe impact to Stratus since the serial ports are used for management. Version-Release number of selected component (if applicable): 2.6.18-86.el5 How reproducible: On SMP systems, ttyS1 seems to always be detected with the false positive UART_BUG_TXEN. When the false positive occurs, the UART eventually locks up and it stops outputting more data. Steps to Reproduce: 1. Boot kernel with 16550A devices present. 2. Use chat to send initialization strings repeatedly to the modem connected to ttyS1 -or- login interactively to ttyS1 3. Actual results: Data flow out from ttyS1 hangs. Expected results: Data flow should not hang. Additional info: My analysis of this problem follows. How UART_BUG_TXEN gets set due to a false positive on SMP systems --- The UART is initialized by function serial8250_startup() in 8250.c. At line 1755 the call to serial_link_irq_chain(up) connects the IRQ to the ISR in this driver. It is relevant that the ISR reads the IIR before it tries to acquire the up->port.lock spinlock and reading the IIR would clear THRI if it is the interrupt cause thus breaking this detection logic that comes a few lines later in serial8250_startup(). Line 1776 is the last step necessary for the ISR to be entered. 1776 serial8250_set_mctrl(&up->port, up->port.mctrl); 1777 1778 /* 1779 * Do a quick test to see if we receive an 1780 * interrupt when we enable the TX irq. 1781 */ 1782 serial_outp(up, UART_IER, UART_IER_THRI); 1783 lsr = serial_in(up, UART_LSR); 1784 iir = serial_in(up, UART_IIR); 1785 serial_outp(up, UART_IER, 0); 1786 1787 if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) { 1788 if (!(up->bugs & UART_BUG_TXEN)) { 1789 up->bugs |= UART_BUG_TXEN; 1790 pr_debug("ttyS%d - enabling bad tx status workarounds\n", 1791 port->line); 1792 } 1793 } else { 1794 up->bugs &= ~UART_BUG_TXEN; 1795 } 1796 1797 spin_unlock_irqrestore(&up->port.lock, flags); The problem is that line 1782 causes an interrupt and the ISR is entered on another processor and it reads the IIR before the IIR is read on line 1784. How incorrectly detecting UART_BUG_TXEN causes output to stall --- When usermode has more characters for the UART to transmit, the characters are placed into the uart_info->xmit circular buffer and serial8250_start_tx() gets called. In that function we flow through the following code path: 1148 struct uart_8250_port *up = (struct uart_8250_port *)port; 1149 1150 if (!(up->ier & UART_IER_THRI)) { 1151 up->ier |= UART_IER_THRI; 1152 serial_out(up, UART_IER, up->ier); 1153 1154 if (up->bugs & UART_BUG_TXEN) { 1155 unsigned char lsr, iir; 1156 lsr = serial_in(up, UART_LSR); 1157 iir = serial_in(up, UART_IIR); 1158 if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) 1159 transmit_chars(up); 1160 } 1161 } On NON-buggy UARTs line 1152 causes a THRI interrupt request. If the IIR has not be read by the ISR by the time it is read on line 1157, the value read by line 1157 can indicate that THRI is pending; in this case, reading the IIR would clear the THRI status. This IIR value does not contain the UART_IIR_NO_INT bit so that line 1159 would be bypassed and no characters sent to the transmitter in the UART, yet the interrupt cause is cleared. Subsequent calls to this routine do nothing because (up->ier & UART_IER_THRI) is already true. This causes output stalls. Proposed fix --- The attached patch blocks the UART from asserting its IRQ during the quick test in serial8250_startup previously discussed. Our tests show this eliminates the problem on Stratus SMP systems. (We have not yet duplicated the original problem on a non-Stratus system.) Version-Release number of selected component (if applicable): 5.3 How reproducible: Sometimes Steps to Reproduce: Actual Results: Expected Results: Additional info:
Created attachment 309643 [details] patch 2/2 - locking re-ordered continued from bug 440121 Lock reordering patch continued from bug 440121 already committed to the 5.3 tree.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-96.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Testing at Stratus has verified that the problem we reported was fixed in kernel-2.6.18-96.el5.
Upstream activity: There are some further problems showing up with the way the probing is done. They will want backporting at some point (certainly for any rt patched system). The patch we have now should still go out as is however as it is an improvement and the further fixes are _not_ for any regressions from previous releases.
Attention Partners! RHEL 5.3 public Beta will be released soon. This URGENT priority/severity bug should have a fix in place in the recently released Partner Alpha drop, available at ftp://partners.redhat.com. If you haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure the highest possible quality bits in the upcoming Beta drop. Thanks, more information about Beta testing to come. - Red Hat QE Partner Management
Stratus did a 24hr reboot test to verify this fix in the 5.3 Alpha (kernel 2.6.18-118.el5) using x86_64 architecture. The system was rebooted 316 times and false positive detection of UART_BUG_TXEN never occurred. This confirms the fix is working correctly in the tested kernel.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html
*** Bug 282231 has been marked as a duplicate of this bug. ***