Bug 91307

Summary:	(SERIAL)Serial line usage sometimes freezes the system
Product:	[Retired] Red Hat Linux	Reporter:	jouni
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	9	CC:	alan, dwmw2, redhat
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-30 15:40:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jouni 2003-05-21 08:20:46 UTC

Description of problem:
Sometimes during making C++ program which reads(+ sometimes writes) serial 
line causes my system to hang.  Hang is complete, nothing on logs, console 
does not respond, all tcp/ip traffic is dead.  I.e. no access or responce from 
the system.

I am 99,9% sure that serial line causes this.  Freeses has happened mostly 
when connecting serial line when the line is open (... well I know this is 
bad) or where there has been bug in my program...?  This should not hang the 
server though...

Now I got the first freeze during normal operation with no teasing of server.

I have six serial lines in use.  ttyS0 and 1 are on m/b.
Ports 4-7 are on Lava Computer's &quot;LavaPort-Quad&quot; PCI card.

-One port is used by APC UPS control program (Powerchute) 
-One port is basicly idle 57600 baud line to GSM phone.
-One port is 19200 baud to one gadget - some traffic every second.  Both reads 
and writes.
-One port is 2400 baud to one gadget.  Some traffic every 10 seconds.  Both 
reads and writes.
-Two ports are 9600 baud with about 100 bytes every minute.  Only reads.


Version-Release number of selected component (if applicable):
kernel-2.4.20-13.9

How reproducible:
Couldn't Reproduce


Additional info:

Hard to reproduce.  At most I have been experiencing this twice (three times?) 
a day.  Sometimes many days has occured without freezes.

When using only 4 first lines before my uptimes were allways weeks.  Only 
reasons for rebooting were maintenance.

I am willing trying to help debugging this down.  However this server runs 
many services which have to run constantly and which I can not moved anywhere 
else.

#lspci -v

....

02:0e.0 Serial controller: Lava Computer mfg Inc: Unknown device 0120 (prog-if 
02 [16550])
        Flags: slow devsel, IRQ 11
        I/O ports at a800 [size=8]
        I/O ports at a400 [size=8]

02:0e.1 Serial controller: Lava Computer mfg Inc: Unknown device 0121 (prog-if 
02 [16550])
        Flags: slow devsel, IRQ 11
        I/O ports at a000 [size=8]
        I/O ports at 9800 [size=8]

# cat /proc/interrupts 
           CPU0       
  0:     176367          XT-PIC  timer
  1:          4          XT-PIC  keyboard
  2:          0          XT-PIC  cascade
  3:        342          XT-PIC  serial
  4:      22929          XT-PIC  serial
  5:      28449          XT-PIC  bttv
  8:          1          XT-PIC  rtc
  9:      18929          XT-PIC  3ware Storage Controller, usb-uhci, usb-uhci, 
eth0
 10:         36          XT-PIC  ehci-hcd, HiSax
 11:      46552          XT-PIC  usb-uhci, serial
 12:         42          XT-PIC  PS/2 Mouse
 15:          0          XT-PIC  ide1
NMI:          0 
ERR:          0

Comment 1 jouni 2003-05-22 10:01:37 UTC

Now after starting to run a daemon constantly which reads the two last ports 
the server hanged 4 times in 24 hours.  This is very bad for me.

Typing:
  "echo 1 >/proc/sys/kernel/sysrq"

... [hang] ...

'alt', 'sysrq', 't'

Did not have any reaction (on unlogged terminal)

Comment 2 jouni 2003-05-28 10:13:09 UTC

After lots of testing to find out what might have caused the kernel hangs by 
adding a new running daemon I have found following additional information.

The program after which I started to notice freezes used to open two serial 
lines /dev/ttyS4 and /dev/ttys6.  The serial lines were opened with 
NONBLOCK'ing flag and program waited in select call for data to arrive.  
Then 'read' was called in loop until either '\n' is read or errno == 
EWOULDBLOCK.

I splitted program for two separate processes to make one process read only 
one serial line.  Still freezes happened (mentioned 3-5 times per 24hours).

Now I tried to open the descriptors BLOCKING and let the read calls to block  
untill data arrives.  Now I got 36 hours continouos run without a hitch.

After stopping the program again and testing again with the version using 
select and nonblocking line -> 30 minitutes and freeze again!

One extra info:  Once during the testing I noticed that 'make' stopped to 
work.  I realized that when saving the file from emacs the new contents were 
saved on disk (buffer?) but the modification time which is shown by 'ls -l' 
stopped changing any more.  Something got corrupted in kernel...  (One nigth 
the system time suddenly jumped 4295 - reported by ntpd.  I rebooted rigth 
away, but it looks like the kernel was hanged for that time but recovered 
somehow at 6:00 am)

I hope this helps you a little to find the problem.

Please note that I still have a program running which uses one NONBLOCKING 
serial line (ttyS0) and blocks in select...  But calls 'read' only once after 
select shows data is available.

Comment 3 jouni 2003-06-01 16:31:14 UTC

With above mentioned blocking reads I have had no problems lately.

However, now when I tried to restart the program (2 process shutdowns and 2 
process starts) I got an hang again.

This time I was able to capture following information from console (Just hand-
copied one screenfulls showing:

alt-sysrq-t
===========
ekowell	S D7927000 4 2765 1 2769 19142 (NOTLB)
Call Trace [<c0125e1d>] schedule timeout [kernel] 0xad (0xcfa05ee0))
[<c018782d>] read_chan [kernel] 0x28d (0xcfa05f18))
[<c0183885>] tty_read [kernel] 0xd5 (0xcfa05f74))
[<c0146f83>] sys_read [kernel] 0xa3 (0xcfa05f94))
[<c010953f>] system_call [kernel] 0x33 (0xcfa05fc0))

initlog S 00000292 8 2767 2759 2768  (NOTLB)
Call Trace: [<c0125dcd>] schedule_timeput [kernel] 0x5d (oxd89ebf64))
[<c0125d60>] process_timeout [kernel] 0x0 (0xd89ebf64))
[<c0125f03>] sys_nanosleep [kernel] 0xd3 (0xd89ebf84))
[<c010953f>] system_call [kernel] 0x33 (0xd89ebfc0))

ekowell 2 00000000 1360 2768 2767 (L-TLB)
Call Trace: [<c011fa2e>] do_exit [kernel] 0x26e (oxde943f94))
[<c011fb54>] do_group_exit [kernel] 0x54 (0xde943fb0))
[<c010953f>] system_call [kernel] 0x33 (0xde943fc0))

ekowell R CAE87FC4 0 2769 1 2770 2765 (NOTLB)
Call Trace: [<c01095ed>] reschedule [kernel] 0x5 (0xcae87fc0))

ekowell R current 0 2770 2769 (NOTLB)
Call Trace:

***********************************************************


Alt-SysRq-P
===========
Call Trace [<c0197364>] rs_interrupt [kernel] 0xc4 (0xd3c53ce8))
[<c010aac5>] handle_IRQ_event [kernel] 0x45 (0xd3c53d14))
[<c010aa64>] do_IRQ [kernel] 0x84 (0xd3c53d34))
[<c010d778>] call_do_IRQ [kernel] 0x5 (0xd3c53d54))
[<c010aab0>] handle_IRQ_event [kernel] 0x30 (0xd3c53d80))
[<c010ac64>] do_IRQ [kernel] 0x84 (0xd3c53da8))
[<c010d778>] call_do_IRQ [kernel] 0x5 (0xd3c53dc8))
[<c01217ee>] do_softirq [kernel] 0x3e (0xd3c53df4))
[<c010ac9e>] do_IRQ [kernel] 0xbe (0xd3c53e14))
[<c010d778>] call_do_IRQ [kernel] 0x5 (0xd3c53e34))
[<c0190068>] con_clear_unimap [kernel] 0x78 (0xd3c53e58))
[<c0196b2e>] serial_out [kernel] 0x1e (0xd3c53e60))
[<c0198769>] change_speed [kernel] 0x2c9 (0xd3c53e70))
[<c019a12f>] rs_set_termios [kernel] 0x3f (0xd3c53ea4))
[<c0188be9>] change_termios [kernel] 0x1a9 (0xd3c53ec8))
[<c0188d62>] set_termios [kernel] 0x112 (0xd3c53f14))
[<c0189172>] n_tty_ioctl [kernel] 0x1e2 (0xd3c53f48))
[<c0185597>] tty_ioctl [kernel] 0x2e7 (0xd3c53f64))
[<c0156919>] sys_ioctl [kernel] 0xc9 (0xd3c53f94))
[<c010953f>] system_call [kernel] 0x33 (0xd3c53fc0))

After typing the above, the system did not respond to magic keys anymore...
Nothing on kernel log.

Comment 4 jouni 2003-06-02 15:21:24 UTC

Just to update that I got today first freeze with the blocking reads.  Seems 
that it only made hangs more infrequent.

Comment 5 jouni 2003-06-04 13:22:33 UTC

Problems exists still with kernel 2.4.20-18.9

Comment 6 Alan Cox 2003-06-05 15:45:02 UTC

Does it occur if you avoiid the USB module loading "nousb". This looks like some
kind of shared IRQ problem at first glance

Comment 7 jouni 2003-06-05 19:32:13 UTC

I rebooted the server with "nousb".  Let's see what happens.

I now have to leave town for two days - I really do hope that the server does 
not do its hang soon...

Comment 8 jouni 2003-06-06 05:11:30 UTC

Unfortunately "nousb" did not help.  Server froze again in half an hour.
Serial lines do not share interrupts with other devices any more.

Comment 9 jouni 2003-06-08 11:03:35 UTC

After unsuccessfull nousb trial I leaft server with only one process 
reading /dev/ttyS4 instead of the two separate processes for ttyS4 & ttyS6.  
When I came back server was healthy (?) and uptime over 50 hours.

I stopped the above mentioned process and started it again to read ttyS6.  
Instant freeze.  Alt-SysRq keys worked *a while* again.  I was not able to 
hand copy the information now, but Alt-SysRq-P togled between trace of 
check_modem_status, rs_interrupt and transmit_char.  The stack was similar to 
what is quoted above on Comment #3.  (But shorter on top).

alt-sysrq-t:  (Shift-PageUp does not work on this freezed state so I see only 
one screenfull).  There was again 'Ekowell' in R state and trace something 
like:
call_do_IRQ, do_softirq, do_IRQ, call_do_IRQ, con_clear_unimap, serial_out, 
change_speed, rs_set_termios, change_termios, set_termios, n_tty_ioctl,  
tty_ioctl, sys_ioctl, system_call

This looks to me like "tcsetattr" in my serial line setup during program 
initialition.

Top screen showed following end of some trace:  write_chan, do_page_fault, 
read_chan, tty_read, sys_read, system_call

(I am still "nousb":
# cat /proc/interrupts
           CPU0
  0:     141000          XT-PIC  timer
  1:          4          XT-PIC  keyboard
  2:          0          XT-PIC  cascade
  3:        267          XT-PIC  serial
  4:      18037          XT-PIC  serial
  5:      21683          XT-PIC  bttv
  8:          1          XT-PIC  rtc
  9:      16074          XT-PIC  3ware Storage Controller, eth0
 10:         34          XT-PIC  HiSax
 11:       9109          XT-PIC  serial
 12:         42          XT-PIC  PS/2 Mouse
 15:          0          XT-PIC  ide1
NMI:          0
ERR:          0

# cat /proc/tty/driver/serial
serinfo:1.0 driver:5.05c revision:2001-07-08
0: uart:16550A port:3F8 irq:4 baud:19200 tx:8233 rx:111941 fe:10 brk:3 
RTS|CTS|DTR|DSR|CD
1: uart:16550A port:2F8 irq:3 baud:2400 tx:591 rx:1274 RTS|DTR
4: uart:16550A port:A800 irq:11 baud:9600 tx:0 rx:0
5: uart:16550A port:A400 irq:11 baud:2400 tx:4796 rx:28659 CTS|DTR|DSR|CD
6: uart:16550A port:A000 irq:11 baud:9600 tx:0 rx:2689 RTS|DTR
7: uart:16550A port:9800 irq:11 baud:57600 tx:108 rx:65 RTS|CTS|DTR|DSR

(#4 now unused as mentioned)

Comment 10 jouni 2003-06-14 10:50:25 UTC

Some observations I have found when trying to find the maximum set of 
processes to keep my server stable.  Due to quite low frequency of crashes 
these obvervations migth not be statistically very significant...

-When running two proces 'X's on ttyS4 and ttyS6 I got kernel hang 3-5 times 
per 24 hours.
-When I run one proces 'X' on ttyS4 uptime was over 50 hours before I made 
changes to try something else.
-When I run one process 'X' on ttyS6 I got a hang within 12 hours.

-Now I changed serial lines so that I run the two processes 'X' on ttyS1 and 
ttyS4.  I changed previous device/proces from ttyS1 to ttyS6 which 
communicates a little less that 'X'  (Less than the 16 chars -LILO size- at 
once...)  Currently I have no crashes after 40 hours with everything running.

Also it should be mentioned that the devices which 'X' communicates with are 
connected with *about* 15 meter (50 feet.  hand-wired...) cables.  Therefore 
there could maybe be some noice or wrong signals on the line.  Maybe error-
path has some problems...

This makes me suspect a) Many serial lines having same interrupt (11 in my 
case) migth cause some problems if 1) data arrives on different ports at the 
same time  2) There is line noice at certain time during certain other 
activity  3) LILO gets full on some line with some other activity.  b)  Maybe 
the ioport A000 conflicts with something under higher usage.

Comment 11 jouni 2003-06-27 13:09:14 UTC

Just to inform that after swapping the two serial lines (Comment #10) I have 
now uptime of over 14 days with no problems observed at all.

Comment 12 Alan Cox 2003-06-27 14:28:35 UTC

Error path might explain its rarity if its noise on the carrier especially. The
traces are useful, although they dont immediately tell me what is going on, they
do suggest candidates

Comment 13 jouni 2003-07-14 12:20:23 UTC

Uptime now 31 days with no problems.  Concerning noise I have not seen any 
corrupted data coming from serial lines during this time period.  However 
please note that I do not validate the incoming data extensively (I only check 
that correct amount (26) of comma separated fields have arrived and that the 
line starts with certain two chars)

Comment 14 Bugzilla owner 2004-09-30 15:40:58 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/