|Summary:||Processes start hanging after a long while in 2300 and 2307, but not 2288|
|Product:||[Fedora] Fedora||Reporter:||Bruno Wolff III <bruno>|
|Component:||kernel||Assignee:||Kernel Maintainer List <kernel-maint>|
|Status:||CLOSED CURRENTRELEASE||QA Contact:||Brian Brock <bbrock>|
|Fixed In Version:||2.6.24||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2008-02-03 21:29:10 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Bruno Wolff III 2007-04-03 15:17:04 UTC
Description of problem: After a significant period of time (about one day on one machine and about a week on another) processes start hanging. However, new processes can be started and the network stack continues to function. While this effectively hoses the system when services start hanging, some things do continue to work. Version-Release number of selected component (if applicable): I have seen this when using kernel-doc-2.6.20-1.2300.fc5 and kernel-doc-2.6.20-1.2307.fc5, but not with earlier kernels. In particular, falling back to kernel-doc-2.6.19-1.2288.2.4.fc5 seems to alleviate the problem. How reproducible: I haven't found a way to speed up the process, but it seems that leaving a system running long enough will result in it happening. The machine that started having hangs in a day is typically busier than the one that hangs in a week. I have another machine that isn't very busy and that I reboot for testing fairly regularly, on which I haven't seen the problem occur. Steps to Reproduce: 1. Leave system run under one of the affected kernels. 2. 3. Actual results: System runs normally for a long time and then processes start hanging. Once this starts happening, it seems to happen a lot. Expected results: System continues to run nromally no matter how long it is up. Additional info:
Comment 1 Chuck Ebbert 2007-04-03 15:27:08 UTC
Can you do echo "t" >/proc/sysrq-trigger when processes freeze and post the syslog messages generated? Also post output of the "lsmod" command.
Comment 2 Bruno Wolff III 2007-04-03 15:38:16 UTC
Created attachment 151577 [details] lsmod output I try this, but it will take a while. I don't want the machine that takes about a day to hang to hang as that causes me a bit too much grief. I'll leave the machine that takes a about a week to hang running 2307 and try the above when it happens again. Unfortunately I just had to reboot it to clear the problem. I'll attach the output from lsmod as that should be the same now as when the problem occured and it might rule out some possible causes.
Comment 3 Bruno Wolff III 2007-04-11 18:09:49 UTC
The lockup happened again while I had access to the console, but changing sysrq-trigger didn't have any noticeable effect. When I did a tail -f of /var/log/messages that terminal window hung. After I rebooted I didn't see any unusual log messages in /var/log/messages or other log files last updated around when the machine was having problems. How much output will setting this flag produce? Is it reasonable to run a machine with the flag set for serveral days with around 5GB of space to devote to the extra logs?
Comment 4 Chuck Ebbert 2007-04-11 18:51:27 UTC
(In reply to comment #3) > How much output will setting this flag produce? Is it reasonable to run a > machine with the flag set for serveral days with around 5GB of space to devote > to the extra logs? It triggers output, once, when you write to it. If you have the kernel-doc package installed the description is in /usr/share/doc/kernel-doc-<version>/Documentation/sysrq.txt
Comment 5 Bruno Wolff III 2007-04-11 19:40:21 UTC
I tried sysrq command after the machine had been rebooted and the output ended up in the /var/log/messages file. So I probably was too late in this last case. Next time it happens, I'll try again and maybe I'll get something useful.
Comment 6 Bruno Wolff III 2007-04-28 17:38:11 UTC
It also happens with 2312. I had another hang after about 7 days of uptime. Unfortunately before I realized that the hang was starting, I could no longer login again using ssh or use su to get a root shell. So I didn't get to try the sysrq dump. I am going to be updating the machine to Fedora 7 now so I can verify that this isn't a problem there. So it will be unlikely that I will see this bug again in FC5 as I am not doing kernel updates on one machine, and the other doesn't seem to have the problem occur. (Probably a combination of it lot being loaded and frequent reboots to test raw hide installs and deal with 3d video lockups.)
Comment 7 Chuck Ebbert 2007-04-30 22:55:06 UTC
With access to the console, alt-sysrq-t (hold down alt and sysrq, then hit T while holding them) will also give trace output. But you need to do "echo 1 >/proc/sys/kernel/sysrq" first, so putting that in /etc/rc.local is necessary.
Comment 8 Bruno Wolff III 2007-05-01 14:43:08 UTC
I might be able to give this a try. My upgrade to f7 crashed before starting the install, so I am probably going to need to do a fresh install. That combined with seeing some kernel bug messages from a test f7 over the weekend makes me nervous about rushing that upgrade. I set /proc/sys/kernel/sysrq and tested that it works. I also put kernel.sysrq = 1 in /etc/sysctl.conf so it will be set after a reboot.
Comment 9 Bruno Wolff III 2007-05-13 20:31:06 UTC
Created attachment 154616 [details] logs from sysrq t I think I was finally able to get a sysrq t dump while my system was in the screwed up state. I had upgraded to 2316, so that is the kernel version that was running when this happened. One possible clue is that the length of time between reboot and when the system starts hanging up is fairly consistant for a partituclar system. This one appears to take about 7 or 8 days. The other system that I can't let this happen on, was taking about 1 and 1/2 days for the couple of times it happened before I went back to 2288. Hopefully this has enough info for you to figure out where the problem is.
Comment 10 Bruno Wolff III 2007-05-21 13:35:45 UTC
This happened again with about the same 7-8 day gap. It looks like the log file stopped recording information sometime yesterday. Alt Sysrq t dumped some output on the console, but there was nothing in /var/log/messages after the reboot. There was also a warning message on the console about the audit backlog limit being reached.
Comment 11 Bruno Wolff III 2007-06-26 14:35:29 UTC
I have now seen something similar in F7. It is enough different, that I am not sure it is the same problem. I have seen things start to lock up where I can't start processes by clicking on icons on the the menu bar, but I can change windows. I have also had the tail command hang. However, I am not seeing the time regularity I did previously. Twice the lock up has occurred after install the wine-core update (the second one was forced, because I wasn't sure the first one had completed). I still have issues getting sysrq output as it doesn't seem to write the output to /var/log/messages when this problem is occurring. I can get it to dump to the screen when using a virtual terminal. In the most recent case I was able to restart X (ctrl backspace) and that seemed to clear things up. Since there seems to be a possible tie to disk I/O, I should mention that I use ext3 file systems with write barriers enabled (which isn't the default) on top of software raid using mirrored pata disks with write caching enabled.
Comment 12 Chuck Ebbert 2007-06-26 15:47:10 UTC
There are lots of cron jobs stuck, apparently trying to write to the syslog. I can't find the references but this is some kind of known bug with crond and/or syslogd.
Comment 13 Bruno Wolff III 2007-07-10 20:25:07 UTC
Created attachment 158891 [details] tail of /var/log/messages with some firewall warnings removed. I had another occurence today after about 10 days of uptime. I had windows slowly starting to lock up. I switched a virtual terminal and tried to login as root, but the login hung. I then went back to X and restarted the X server and then the VT login completed and things started working normally again. I did several sysrq t's during the above, but I am not sure if what was written to /var/log/messages reflected any from before things started working normally again. There were also some selinux syslog errors of which I included a sample, but I doubt they are connected to the problem. The kernel version is: 2.6.21-1.3228.fc7
Comment 14 Bruno Wolff III 2007-10-17 18:11:57 UTC
This appears to be still happening in 18.104.22.168-4.fc7. I was remotely connected and was able to see a lot of processes in the D wait state (including syslog).
Comment 15 Bruno Wolff III 2007-10-19 19:46:19 UTC
I had the situation start happening again today and was able to get a list of processes in a D state before there were too many of them, so that might be useful. The list did include syslog, but that might be because syslog does writes reasonably often not because it starts the problem. I wasn't able to shutdown syslog immediately, but after doing a telinit 3 it finished shutting down. I went back to run level 5 without syslog and managed to get hung up again and I ended up needing to restart X with ctrl-alt-bs to get things going again. I don't believe syslog was running during this time. I was able to restart it again after X came back up. I seem to be able to lock things up on a fairly regular basis while doing yum updates. And this seems (but this is just a subjective impression) that this is even more likely if I am using firefox at the same time. Here is the ps auxww output that just lists processes that were in a D state when I started have problems with firefox and yum locking up: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1879 0.0 0.1 1800 596 ? Ds Oct17 0:04 syslogd -m 0 qmails 2608 0.0 0.0 1772 444 ? D Oct17 0:44 qmail-send qmaill 2609 0.0 0.0 1732 376 ? D Oct17 0:01 /usr/local/bin/multilog t /var/log/qmail/smtpd dnslog 2617 0.0 0.0 1732 380 ? D Oct17 0:17 multilog t ./main bruno 30914 6.7 7.1 136828 36572 ? Dl 13:29 0:31 /usr/lib/firefox-22.214.171.124/firefox-bin root 30933 8.2 41.6 224872 214476 pts/0 D+ 13:29 0:36 /usr/bin/python /usr/bin/yum update qmailr 30976 0.0 0.1 1828 532 ? D 13:32 0:00 qmail-remote devida.gob.pe firstname.lastname@example.org
Comment 16 Bruno Wolff III 2007-12-12 22:11:57 UTC
When this is happening there are processes in a 'D' state that show up with 99.99% usage according to iotop. More than one process can show up with 99.99% usage each. This is happening with 126.96.36.199-63.fc8. This issue is likely a duplicate of 249563, which I am following.
Comment 17 Bruno Wolff III 2008-01-28 20:03:08 UTC
I updated the version since FC 5 is long out of support. The problem still exists on F8, but I upgraded the machine to rawhide. So far the problem hasn't reoccurred, but it hasn't been long enough yet to say the problem appears to be fixed.
Comment 18 Bruno Wolff III 2008-02-03 17:03:28 UTC
My machine went a week before I rebooted to start using a kernel update. I think it is very likely the 2.6.24 kernel fixes this problem. I am not going to be able to test this easily if F7 or F8 gets a 2.6.24 kernel (unless it happens real soon) as I am moving the affected machines to rawhide.