Red Hat Bugzilla – Bug 235043
Processes start hanging after a long while in 2300 and 2307, but not 2288
Last modified: 2008-02-03 16:29:10 EST
Description of problem:
After a significant period of time (about one day on one machine and about a
week on another) processes start hanging. However, new processes can be started
and the network stack continues to function. While this effectively hoses the
system when services start hanging, some things do continue to work.
Version-Release number of selected component (if applicable):
I have seen this when using kernel-doc-2.6.20-1.2300.fc5 and
kernel-doc-2.6.20-1.2307.fc5, but not with earlier kernels. In particular,
falling back to kernel-doc-2.6.19-1.2288.2.4.fc5 seems to alleviate the problem.
I haven't found a way to speed up the process, but it seems that leaving a
system running long enough will result in it happening. The machine that started
having hangs in a day is typically busier than the one that hangs in a week. I
have another machine that isn't very busy and that I reboot for testing fairly
regularly, on which I haven't seen the problem occur.
Steps to Reproduce:
1. Leave system run under one of the affected kernels.
System runs normally for a long time and then processes start hanging. Once this
starts happening, it seems to happen a lot.
System continues to run nromally no matter how long it is up.
Can you do
echo "t" >/proc/sysrq-trigger
when processes freeze and post the syslog messages generated?
Also post output of the "lsmod" command.
Created attachment 151577 [details]
I try this, but it will take a while. I don't want the machine that takes about
a day to hang to hang as that causes me a bit too much grief. I'll leave the
machine that takes a about a week to hang running 2307 and try the above when
it happens again. Unfortunately I just had to reboot it to clear the problem.
I'll attach the output from lsmod as that should be the same now as when the
problem occured and it might rule out some possible causes.
The lockup happened again while I had access to the console, but changing
sysrq-trigger didn't have any noticeable effect. When I did a tail -f of
/var/log/messages that terminal window hung. After I rebooted I didn't see any
unusual log messages in /var/log/messages or other log files last updated around
when the machine was having problems.
How much output will setting this flag produce? Is it reasonable to run a
machine with the flag set for serveral days with around 5GB of space to devote
to the extra logs?
(In reply to comment #3)
> How much output will setting this flag produce? Is it reasonable to run a
> machine with the flag set for serveral days with around 5GB of space to devote
> to the extra logs?
It triggers output, once, when you write to it.
If you have the kernel-doc package installed the description is in
I tried sysrq command after the machine had been rebooted and the output ended
up in the /var/log/messages file. So I probably was too late in this last case.
Next time it happens, I'll try again and maybe I'll get something useful.
It also happens with 2312. I had another hang after about 7 days of uptime.
Unfortunately before I realized that the hang was starting, I could no longer
login again using ssh or use su to get a root shell. So I didn't get to try the
I am going to be updating the machine to Fedora 7 now so I can verify that this
isn't a problem there. So it will be unlikely that I will see this bug again in
FC5 as I am not doing kernel updates on one machine, and the other doesn't seem
to have the problem occur. (Probably a combination of it lot being loaded and
frequent reboots to test raw hide installs and deal with 3d video lockups.)
With access to the console, alt-sysrq-t (hold down alt and sysrq, then hit
T while holding them) will also give trace output.
But you need to do "echo 1 >/proc/sys/kernel/sysrq" first, so putting
that in /etc/rc.local is necessary.
I might be able to give this a try. My upgrade to f7 crashed before starting the
install, so I am probably going to need to do a fresh install. That combined
with seeing some kernel bug messages from a test f7 over the weekend makes me
nervous about rushing that upgrade.
I set /proc/sys/kernel/sysrq and tested that it works. I also put kernel.sysrq =
1 in /etc/sysctl.conf so it will be set after a reboot.
Created attachment 154616 [details]
logs from sysrq t
I think I was finally able to get a sysrq t dump while my system was in the
screwed up state.
I had upgraded to 2316, so that is the kernel version that was running when
One possible clue is that the length of time between reboot and when the system
starts hanging up is fairly consistant for a partituclar system. This one
appears to take about 7 or 8 days. The other system that I can't let this
happen on, was taking about 1 and 1/2 days for the couple of times it happened
before I went back to 2288.
Hopefully this has enough info for you to figure out where the problem is.
This happened again with about the same 7-8 day gap. It looks like the log file
stopped recording information sometime yesterday. Alt Sysrq t dumped some output
on the console, but there was nothing in /var/log/messages after the reboot.
There was also a warning message on the console about the audit backlog limit
I have now seen something similar in F7. It is enough different, that I am not
sure it is the same problem.
I have seen things start to lock up where I can't start processes by clicking on
icons on the the menu bar, but I can change windows. I have also had the tail
However, I am not seeing the time regularity I did previously. Twice the lock up
has occurred after install the wine-core update (the second one was forced,
because I wasn't sure the first one had completed).
I still have issues getting sysrq output as it doesn't seem to write the output
to /var/log/messages when this problem is occurring. I can get it to dump to the
screen when using a virtual terminal. In the most recent case I was able to
restart X (ctrl backspace) and that seemed to clear things up.
Since there seems to be a possible tie to disk I/O, I should mention that I use
ext3 file systems with write barriers enabled (which isn't the default) on top
of software raid using mirrored pata disks with write caching enabled.
There are lots of cron jobs stuck, apparently trying to write to the syslog.
I can't find the references but this is some kind of known bug with crond and/or
Created attachment 158891 [details]
tail of /var/log/messages with some firewall warnings removed.
I had another occurence today after about 10 days of uptime. I had windows
slowly starting to lock up. I switched a virtual terminal and tried to login as
root, but the login hung. I then went back to X and restarted the X server and
then the VT login completed and things started working normally again. I did
several sysrq t's during the above, but I am not sure if what was written to
/var/log/messages reflected any from before things started working normally
again. There were also some selinux syslog errors of which I included a sample,
but I doubt they are connected to the problem. The kernel version is:
This appears to be still happening in 126.96.36.199-4.fc7. I was remotely connected
and was able to see a lot of processes in the D wait state (including syslog).
I had the situation start happening again today and was able to get a list of
processes in a D state before there were too many of them, so that might be useful.
The list did include syslog, but that might be because syslog does writes
reasonably often not because it starts the problem. I wasn't able to shutdown
syslog immediately, but after doing a telinit 3 it finished shutting down. I
went back to run level 5 without syslog and managed to get hung up again and I
ended up needing to restart X with ctrl-alt-bs to get things going again. I
don't believe syslog was running during this time. I was able to restart it
again after X came back up.
I seem to be able to lock things up on a fairly regular basis while doing yum
updates. And this seems (but this is just a subjective impression) that this is
even more likely if I am using firefox at the same time.
Here is the ps auxww output that just lists processes that were in a D state
when I started have problems with firefox and yum locking up:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1879 0.0 0.1 1800 596 ? Ds Oct17 0:04 syslogd -m 0
qmails 2608 0.0 0.0 1772 444 ? D Oct17 0:44 qmail-send
qmaill 2609 0.0 0.0 1732 376 ? D Oct17 0:01
/usr/local/bin/multilog t /var/log/qmail/smtpd
dnslog 2617 0.0 0.0 1732 380 ? D Oct17 0:17 multilog t ./main
bruno 30914 6.7 7.1 136828 36572 ? Dl 13:29 0:31
root 30933 8.2 41.6 224872 214476 pts/0 D+ 13:29 0:36 /usr/bin/python
qmailr 30976 0.0 0.1 1828 532 ? D 13:32 0:00 qmail-remote
When this is happening there are processes in a 'D' state that show up with
99.99% usage according to iotop. More than one process can show up with 99.99%
usage each. This is happening with 188.8.131.52-63.fc8.
This issue is likely a duplicate of 249563, which I am following.
I updated the version since FC 5 is long out of support. The problem still
exists on F8, but I upgraded the machine to rawhide. So far the problem hasn't
reoccurred, but it hasn't been long enough yet to say the problem appears to be
My machine went a week before I rebooted to start using a kernel update. I think
it is very likely the 2.6.24 kernel fixes this problem. I am not going to be
able to test this easily if F7 or F8 gets a 2.6.24 kernel (unless it happens
real soon) as I am moving the affected machines to rawhide.
Closing as per bug #249563 which you indicate it may well be a duplicate of.
Also you do not appear to be having the issues with 2.6.24 kernels so will hope
this is resolved - if not, please re-open.