Description of problem: Sometimes during making C++ program which reads(+ sometimes writes) serial line causes my system to hang. Hang is complete, nothing on logs, console does not respond, all tcp/ip traffic is dead. I.e. no access or responce from the system. I am 99,9% sure that serial line causes this. Freeses has happened mostly when connecting serial line when the line is open (... well I know this is bad) or where there has been bug in my program...? This should not hang the server though... Now I got the first freeze during normal operation with no teasing of server. I have six serial lines in use. ttyS0 and 1 are on m/b. Ports 4-7 are on Lava Computer's "LavaPort-Quad" PCI card. -One port is used by APC UPS control program (Powerchute) -One port is basicly idle 57600 baud line to GSM phone. -One port is 19200 baud to one gadget - some traffic every second. Both reads and writes. -One port is 2400 baud to one gadget. Some traffic every 10 seconds. Both reads and writes. -Two ports are 9600 baud with about 100 bytes every minute. Only reads. Version-Release number of selected component (if applicable): kernel-2.4.20-13.9 How reproducible: Couldn't Reproduce Additional info: Hard to reproduce. At most I have been experiencing this twice (three times?) a day. Sometimes many days has occured without freezes. When using only 4 first lines before my uptimes were allways weeks. Only reasons for rebooting were maintenance. I am willing trying to help debugging this down. However this server runs many services which have to run constantly and which I can not moved anywhere else. #lspci -v .... 02:0e.0 Serial controller: Lava Computer mfg Inc: Unknown device 0120 (prog-if 02 [16550]) Flags: slow devsel, IRQ 11 I/O ports at a800 [size=8] I/O ports at a400 [size=8] 02:0e.1 Serial controller: Lava Computer mfg Inc: Unknown device 0121 (prog-if 02 [16550]) Flags: slow devsel, IRQ 11 I/O ports at a000 [size=8] I/O ports at 9800 [size=8] # cat /proc/interrupts CPU0 0: 176367 XT-PIC timer 1: 4 XT-PIC keyboard 2: 0 XT-PIC cascade 3: 342 XT-PIC serial 4: 22929 XT-PIC serial 5: 28449 XT-PIC bttv 8: 1 XT-PIC rtc 9: 18929 XT-PIC 3ware Storage Controller, usb-uhci, usb-uhci, eth0 10: 36 XT-PIC ehci-hcd, HiSax 11: 46552 XT-PIC usb-uhci, serial 12: 42 XT-PIC PS/2 Mouse 15: 0 XT-PIC ide1 NMI: 0 ERR: 0
Now after starting to run a daemon constantly which reads the two last ports the server hanged 4 times in 24 hours. This is very bad for me. Typing: "echo 1 >/proc/sys/kernel/sysrq" ... [hang] ... 'alt', 'sysrq', 't' Did not have any reaction (on unlogged terminal)
After lots of testing to find out what might have caused the kernel hangs by adding a new running daemon I have found following additional information. The program after which I started to notice freezes used to open two serial lines /dev/ttyS4 and /dev/ttys6. The serial lines were opened with NONBLOCK'ing flag and program waited in select call for data to arrive. Then 'read' was called in loop until either '\n' is read or errno == EWOULDBLOCK. I splitted program for two separate processes to make one process read only one serial line. Still freezes happened (mentioned 3-5 times per 24hours). Now I tried to open the descriptors BLOCKING and let the read calls to block untill data arrives. Now I got 36 hours continouos run without a hitch. After stopping the program again and testing again with the version using select and nonblocking line -> 30 minitutes and freeze again! One extra info: Once during the testing I noticed that 'make' stopped to work. I realized that when saving the file from emacs the new contents were saved on disk (buffer?) but the modification time which is shown by 'ls -l' stopped changing any more. Something got corrupted in kernel... (One nigth the system time suddenly jumped 4295 - reported by ntpd. I rebooted rigth away, but it looks like the kernel was hanged for that time but recovered somehow at 6:00 am) I hope this helps you a little to find the problem. Please note that I still have a program running which uses one NONBLOCKING serial line (ttyS0) and blocks in select... But calls 'read' only once after select shows data is available.
With above mentioned blocking reads I have had no problems lately. However, now when I tried to restart the program (2 process shutdowns and 2 process starts) I got an hang again. This time I was able to capture following information from console (Just hand- copied one screenfulls showing: alt-sysrq-t =========== ekowell S D7927000 4 2765 1 2769 19142 (NOTLB) Call Trace [<c0125e1d>] schedule timeout [kernel] 0xad (0xcfa05ee0)) [<c018782d>] read_chan [kernel] 0x28d (0xcfa05f18)) [<c0183885>] tty_read [kernel] 0xd5 (0xcfa05f74)) [<c0146f83>] sys_read [kernel] 0xa3 (0xcfa05f94)) [<c010953f>] system_call [kernel] 0x33 (0xcfa05fc0)) initlog S 00000292 8 2767 2759 2768 (NOTLB) Call Trace: [<c0125dcd>] schedule_timeput [kernel] 0x5d (oxd89ebf64)) [<c0125d60>] process_timeout [kernel] 0x0 (0xd89ebf64)) [<c0125f03>] sys_nanosleep [kernel] 0xd3 (0xd89ebf84)) [<c010953f>] system_call [kernel] 0x33 (0xd89ebfc0)) ekowell 2 00000000 1360 2768 2767 (L-TLB) Call Trace: [<c011fa2e>] do_exit [kernel] 0x26e (oxde943f94)) [<c011fb54>] do_group_exit [kernel] 0x54 (0xde943fb0)) [<c010953f>] system_call [kernel] 0x33 (0xde943fc0)) ekowell R CAE87FC4 0 2769 1 2770 2765 (NOTLB) Call Trace: [<c01095ed>] reschedule [kernel] 0x5 (0xcae87fc0)) ekowell R current 0 2770 2769 (NOTLB) Call Trace: *********************************************************** Alt-SysRq-P =========== Call Trace [<c0197364>] rs_interrupt [kernel] 0xc4 (0xd3c53ce8)) [<c010aac5>] handle_IRQ_event [kernel] 0x45 (0xd3c53d14)) [<c010aa64>] do_IRQ [kernel] 0x84 (0xd3c53d34)) [<c010d778>] call_do_IRQ [kernel] 0x5 (0xd3c53d54)) [<c010aab0>] handle_IRQ_event [kernel] 0x30 (0xd3c53d80)) [<c010ac64>] do_IRQ [kernel] 0x84 (0xd3c53da8)) [<c010d778>] call_do_IRQ [kernel] 0x5 (0xd3c53dc8)) [<c01217ee>] do_softirq [kernel] 0x3e (0xd3c53df4)) [<c010ac9e>] do_IRQ [kernel] 0xbe (0xd3c53e14)) [<c010d778>] call_do_IRQ [kernel] 0x5 (0xd3c53e34)) [<c0190068>] con_clear_unimap [kernel] 0x78 (0xd3c53e58)) [<c0196b2e>] serial_out [kernel] 0x1e (0xd3c53e60)) [<c0198769>] change_speed [kernel] 0x2c9 (0xd3c53e70)) [<c019a12f>] rs_set_termios [kernel] 0x3f (0xd3c53ea4)) [<c0188be9>] change_termios [kernel] 0x1a9 (0xd3c53ec8)) [<c0188d62>] set_termios [kernel] 0x112 (0xd3c53f14)) [<c0189172>] n_tty_ioctl [kernel] 0x1e2 (0xd3c53f48)) [<c0185597>] tty_ioctl [kernel] 0x2e7 (0xd3c53f64)) [<c0156919>] sys_ioctl [kernel] 0xc9 (0xd3c53f94)) [<c010953f>] system_call [kernel] 0x33 (0xd3c53fc0)) After typing the above, the system did not respond to magic keys anymore... Nothing on kernel log.
Just to update that I got today first freeze with the blocking reads. Seems that it only made hangs more infrequent.
Problems exists still with kernel 2.4.20-18.9
Does it occur if you avoiid the USB module loading "nousb". This looks like some kind of shared IRQ problem at first glance
I rebooted the server with "nousb". Let's see what happens. I now have to leave town for two days - I really do hope that the server does not do its hang soon...
Unfortunately "nousb" did not help. Server froze again in half an hour. Serial lines do not share interrupts with other devices any more.
After unsuccessfull nousb trial I leaft server with only one process reading /dev/ttyS4 instead of the two separate processes for ttyS4 & ttyS6. When I came back server was healthy (?) and uptime over 50 hours. I stopped the above mentioned process and started it again to read ttyS6. Instant freeze. Alt-SysRq keys worked *a while* again. I was not able to hand copy the information now, but Alt-SysRq-P togled between trace of check_modem_status, rs_interrupt and transmit_char. The stack was similar to what is quoted above on Comment #3. (But shorter on top). alt-sysrq-t: (Shift-PageUp does not work on this freezed state so I see only one screenfull). There was again 'Ekowell' in R state and trace something like: call_do_IRQ, do_softirq, do_IRQ, call_do_IRQ, con_clear_unimap, serial_out, change_speed, rs_set_termios, change_termios, set_termios, n_tty_ioctl, tty_ioctl, sys_ioctl, system_call This looks to me like "tcsetattr" in my serial line setup during program initialition. Top screen showed following end of some trace: write_chan, do_page_fault, read_chan, tty_read, sys_read, system_call (I am still "nousb": # cat /proc/interrupts CPU0 0: 141000 XT-PIC timer 1: 4 XT-PIC keyboard 2: 0 XT-PIC cascade 3: 267 XT-PIC serial 4: 18037 XT-PIC serial 5: 21683 XT-PIC bttv 8: 1 XT-PIC rtc 9: 16074 XT-PIC 3ware Storage Controller, eth0 10: 34 XT-PIC HiSax 11: 9109 XT-PIC serial 12: 42 XT-PIC PS/2 Mouse 15: 0 XT-PIC ide1 NMI: 0 ERR: 0 # cat /proc/tty/driver/serial serinfo:1.0 driver:5.05c revision:2001-07-08 0: uart:16550A port:3F8 irq:4 baud:19200 tx:8233 rx:111941 fe:10 brk:3 RTS|CTS|DTR|DSR|CD 1: uart:16550A port:2F8 irq:3 baud:2400 tx:591 rx:1274 RTS|DTR 4: uart:16550A port:A800 irq:11 baud:9600 tx:0 rx:0 5: uart:16550A port:A400 irq:11 baud:2400 tx:4796 rx:28659 CTS|DTR|DSR|CD 6: uart:16550A port:A000 irq:11 baud:9600 tx:0 rx:2689 RTS|DTR 7: uart:16550A port:9800 irq:11 baud:57600 tx:108 rx:65 RTS|CTS|DTR|DSR (#4 now unused as mentioned)
Some observations I have found when trying to find the maximum set of processes to keep my server stable. Due to quite low frequency of crashes these obvervations migth not be statistically very significant... -When running two proces 'X's on ttyS4 and ttyS6 I got kernel hang 3-5 times per 24 hours. -When I run one proces 'X' on ttyS4 uptime was over 50 hours before I made changes to try something else. -When I run one process 'X' on ttyS6 I got a hang within 12 hours. -Now I changed serial lines so that I run the two processes 'X' on ttyS1 and ttyS4. I changed previous device/proces from ttyS1 to ttyS6 which communicates a little less that 'X' (Less than the 16 chars -LILO size- at once...) Currently I have no crashes after 40 hours with everything running. Also it should be mentioned that the devices which 'X' communicates with are connected with *about* 15 meter (50 feet. hand-wired...) cables. Therefore there could maybe be some noice or wrong signals on the line. Maybe error- path has some problems... This makes me suspect a) Many serial lines having same interrupt (11 in my case) migth cause some problems if 1) data arrives on different ports at the same time 2) There is line noice at certain time during certain other activity 3) LILO gets full on some line with some other activity. b) Maybe the ioport A000 conflicts with something under higher usage.
Just to inform that after swapping the two serial lines (Comment #10) I have now uptime of over 14 days with no problems observed at all.
Error path might explain its rarity if its noise on the carrier especially. The traces are useful, although they dont immediately tell me what is going on, they do suggest candidates
Uptime now 31 days with no problems. Concerning noise I have not seen any corrupted data coming from serial lines during this time period. However please note that I do not validate the incoming data extensively (I only check that correct amount (26) of comma separated fields have arrived and that the line starts with certain two chars)
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/