Bug 91307
Summary: | (SERIAL)Serial line usage sometimes freezes the system | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | jouni |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 9 | CC: | alan, dwmw2, redhat |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-09-30 15:40:58 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
jouni
2003-05-21 08:20:46 UTC
Now after starting to run a daemon constantly which reads the two last ports the server hanged 4 times in 24 hours. This is very bad for me. Typing: "echo 1 >/proc/sys/kernel/sysrq" ... [hang] ... 'alt', 'sysrq', 't' Did not have any reaction (on unlogged terminal) After lots of testing to find out what might have caused the kernel hangs by adding a new running daemon I have found following additional information. The program after which I started to notice freezes used to open two serial lines /dev/ttyS4 and /dev/ttys6. The serial lines were opened with NONBLOCK'ing flag and program waited in select call for data to arrive. Then 'read' was called in loop until either '\n' is read or errno == EWOULDBLOCK. I splitted program for two separate processes to make one process read only one serial line. Still freezes happened (mentioned 3-5 times per 24hours). Now I tried to open the descriptors BLOCKING and let the read calls to block untill data arrives. Now I got 36 hours continouos run without a hitch. After stopping the program again and testing again with the version using select and nonblocking line -> 30 minitutes and freeze again! One extra info: Once during the testing I noticed that 'make' stopped to work. I realized that when saving the file from emacs the new contents were saved on disk (buffer?) but the modification time which is shown by 'ls -l' stopped changing any more. Something got corrupted in kernel... (One nigth the system time suddenly jumped 4295 - reported by ntpd. I rebooted rigth away, but it looks like the kernel was hanged for that time but recovered somehow at 6:00 am) I hope this helps you a little to find the problem. Please note that I still have a program running which uses one NONBLOCKING serial line (ttyS0) and blocks in select... But calls 'read' only once after select shows data is available. With above mentioned blocking reads I have had no problems lately. However, now when I tried to restart the program (2 process shutdowns and 2 process starts) I got an hang again. This time I was able to capture following information from console (Just hand- copied one screenfulls showing: alt-sysrq-t =========== ekowell S D7927000 4 2765 1 2769 19142 (NOTLB) Call Trace [<c0125e1d>] schedule timeout [kernel] 0xad (0xcfa05ee0)) [<c018782d>] read_chan [kernel] 0x28d (0xcfa05f18)) [<c0183885>] tty_read [kernel] 0xd5 (0xcfa05f74)) [<c0146f83>] sys_read [kernel] 0xa3 (0xcfa05f94)) [<c010953f>] system_call [kernel] 0x33 (0xcfa05fc0)) initlog S 00000292 8 2767 2759 2768 (NOTLB) Call Trace: [<c0125dcd>] schedule_timeput [kernel] 0x5d (oxd89ebf64)) [<c0125d60>] process_timeout [kernel] 0x0 (0xd89ebf64)) [<c0125f03>] sys_nanosleep [kernel] 0xd3 (0xd89ebf84)) [<c010953f>] system_call [kernel] 0x33 (0xd89ebfc0)) ekowell 2 00000000 1360 2768 2767 (L-TLB) Call Trace: [<c011fa2e>] do_exit [kernel] 0x26e (oxde943f94)) [<c011fb54>] do_group_exit [kernel] 0x54 (0xde943fb0)) [<c010953f>] system_call [kernel] 0x33 (0xde943fc0)) ekowell R CAE87FC4 0 2769 1 2770 2765 (NOTLB) Call Trace: [<c01095ed>] reschedule [kernel] 0x5 (0xcae87fc0)) ekowell R current 0 2770 2769 (NOTLB) Call Trace: *********************************************************** Alt-SysRq-P =========== Call Trace [<c0197364>] rs_interrupt [kernel] 0xc4 (0xd3c53ce8)) [<c010aac5>] handle_IRQ_event [kernel] 0x45 (0xd3c53d14)) [<c010aa64>] do_IRQ [kernel] 0x84 (0xd3c53d34)) [<c010d778>] call_do_IRQ [kernel] 0x5 (0xd3c53d54)) [<c010aab0>] handle_IRQ_event [kernel] 0x30 (0xd3c53d80)) [<c010ac64>] do_IRQ [kernel] 0x84 (0xd3c53da8)) [<c010d778>] call_do_IRQ [kernel] 0x5 (0xd3c53dc8)) [<c01217ee>] do_softirq [kernel] 0x3e (0xd3c53df4)) [<c010ac9e>] do_IRQ [kernel] 0xbe (0xd3c53e14)) [<c010d778>] call_do_IRQ [kernel] 0x5 (0xd3c53e34)) [<c0190068>] con_clear_unimap [kernel] 0x78 (0xd3c53e58)) [<c0196b2e>] serial_out [kernel] 0x1e (0xd3c53e60)) [<c0198769>] change_speed [kernel] 0x2c9 (0xd3c53e70)) [<c019a12f>] rs_set_termios [kernel] 0x3f (0xd3c53ea4)) [<c0188be9>] change_termios [kernel] 0x1a9 (0xd3c53ec8)) [<c0188d62>] set_termios [kernel] 0x112 (0xd3c53f14)) [<c0189172>] n_tty_ioctl [kernel] 0x1e2 (0xd3c53f48)) [<c0185597>] tty_ioctl [kernel] 0x2e7 (0xd3c53f64)) [<c0156919>] sys_ioctl [kernel] 0xc9 (0xd3c53f94)) [<c010953f>] system_call [kernel] 0x33 (0xd3c53fc0)) After typing the above, the system did not respond to magic keys anymore... Nothing on kernel log. Just to update that I got today first freeze with the blocking reads. Seems that it only made hangs more infrequent. Problems exists still with kernel 2.4.20-18.9 Does it occur if you avoiid the USB module loading "nousb". This looks like some kind of shared IRQ problem at first glance I rebooted the server with "nousb". Let's see what happens. I now have to leave town for two days - I really do hope that the server does not do its hang soon... Unfortunately "nousb" did not help. Server froze again in half an hour. Serial lines do not share interrupts with other devices any more. After unsuccessfull nousb trial I leaft server with only one process reading /dev/ttyS4 instead of the two separate processes for ttyS4 & ttyS6. When I came back server was healthy (?) and uptime over 50 hours. I stopped the above mentioned process and started it again to read ttyS6. Instant freeze. Alt-SysRq keys worked *a while* again. I was not able to hand copy the information now, but Alt-SysRq-P togled between trace of check_modem_status, rs_interrupt and transmit_char. The stack was similar to what is quoted above on Comment #3. (But shorter on top). alt-sysrq-t: (Shift-PageUp does not work on this freezed state so I see only one screenfull). There was again 'Ekowell' in R state and trace something like: call_do_IRQ, do_softirq, do_IRQ, call_do_IRQ, con_clear_unimap, serial_out, change_speed, rs_set_termios, change_termios, set_termios, n_tty_ioctl, tty_ioctl, sys_ioctl, system_call This looks to me like "tcsetattr" in my serial line setup during program initialition. Top screen showed following end of some trace: write_chan, do_page_fault, read_chan, tty_read, sys_read, system_call (I am still "nousb": # cat /proc/interrupts CPU0 0: 141000 XT-PIC timer 1: 4 XT-PIC keyboard 2: 0 XT-PIC cascade 3: 267 XT-PIC serial 4: 18037 XT-PIC serial 5: 21683 XT-PIC bttv 8: 1 XT-PIC rtc 9: 16074 XT-PIC 3ware Storage Controller, eth0 10: 34 XT-PIC HiSax 11: 9109 XT-PIC serial 12: 42 XT-PIC PS/2 Mouse 15: 0 XT-PIC ide1 NMI: 0 ERR: 0 # cat /proc/tty/driver/serial serinfo:1.0 driver:5.05c revision:2001-07-08 0: uart:16550A port:3F8 irq:4 baud:19200 tx:8233 rx:111941 fe:10 brk:3 RTS|CTS|DTR|DSR|CD 1: uart:16550A port:2F8 irq:3 baud:2400 tx:591 rx:1274 RTS|DTR 4: uart:16550A port:A800 irq:11 baud:9600 tx:0 rx:0 5: uart:16550A port:A400 irq:11 baud:2400 tx:4796 rx:28659 CTS|DTR|DSR|CD 6: uart:16550A port:A000 irq:11 baud:9600 tx:0 rx:2689 RTS|DTR 7: uart:16550A port:9800 irq:11 baud:57600 tx:108 rx:65 RTS|CTS|DTR|DSR (#4 now unused as mentioned) Some observations I have found when trying to find the maximum set of processes to keep my server stable. Due to quite low frequency of crashes these obvervations migth not be statistically very significant... -When running two proces 'X's on ttyS4 and ttyS6 I got kernel hang 3-5 times per 24 hours. -When I run one proces 'X' on ttyS4 uptime was over 50 hours before I made changes to try something else. -When I run one process 'X' on ttyS6 I got a hang within 12 hours. -Now I changed serial lines so that I run the two processes 'X' on ttyS1 and ttyS4. I changed previous device/proces from ttyS1 to ttyS6 which communicates a little less that 'X' (Less than the 16 chars -LILO size- at once...) Currently I have no crashes after 40 hours with everything running. Also it should be mentioned that the devices which 'X' communicates with are connected with *about* 15 meter (50 feet. hand-wired...) cables. Therefore there could maybe be some noice or wrong signals on the line. Maybe error- path has some problems... This makes me suspect a) Many serial lines having same interrupt (11 in my case) migth cause some problems if 1) data arrives on different ports at the same time 2) There is line noice at certain time during certain other activity 3) LILO gets full on some line with some other activity. b) Maybe the ioport A000 conflicts with something under higher usage. Just to inform that after swapping the two serial lines (Comment #10) I have now uptime of over 14 days with no problems observed at all. Error path might explain its rarity if its noise on the carrier especially. The traces are useful, although they dont immediately tell me what is going on, they do suggest candidates Uptime now 31 days with no problems. Concerning noise I have not seen any corrupted data coming from serial lines during this time period. However please note that I do not validate the incoming data extensively (I only check that correct amount (26) of comma separated fields have arrived and that the line starts with certain two chars) Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/ |