Bug 235043
Summary: | Processes start hanging after a long while in 2300 and 2307, but not 2288 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Bruno Wolff III <bruno> | ||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 8 | CC: | chris.brown | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | 2.6.24 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2008-02-03 21:29:10 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Bruno Wolff III
2007-04-03 15:17:04 UTC
Can you do echo "t" >/proc/sysrq-trigger when processes freeze and post the syslog messages generated? Also post output of the "lsmod" command. Created attachment 151577 [details]
lsmod output
I try this, but it will take a while. I don't want the machine that takes about
a day to hang to hang as that causes me a bit too much grief. I'll leave the
machine that takes a about a week to hang running 2307 and try the above when
it happens again. Unfortunately I just had to reboot it to clear the problem.
I'll attach the output from lsmod as that should be the same now as when the
problem occured and it might rule out some possible causes.
The lockup happened again while I had access to the console, but changing sysrq-trigger didn't have any noticeable effect. When I did a tail -f of /var/log/messages that terminal window hung. After I rebooted I didn't see any unusual log messages in /var/log/messages or other log files last updated around when the machine was having problems. How much output will setting this flag produce? Is it reasonable to run a machine with the flag set for serveral days with around 5GB of space to devote to the extra logs? (In reply to comment #3) > How much output will setting this flag produce? Is it reasonable to run a > machine with the flag set for serveral days with around 5GB of space to devote > to the extra logs? It triggers output, once, when you write to it. If you have the kernel-doc package installed the description is in /usr/share/doc/kernel-doc-<version>/Documentation/sysrq.txt I tried sysrq command after the machine had been rebooted and the output ended up in the /var/log/messages file. So I probably was too late in this last case. Next time it happens, I'll try again and maybe I'll get something useful. It also happens with 2312. I had another hang after about 7 days of uptime. Unfortunately before I realized that the hang was starting, I could no longer login again using ssh or use su to get a root shell. So I didn't get to try the sysrq dump. I am going to be updating the machine to Fedora 7 now so I can verify that this isn't a problem there. So it will be unlikely that I will see this bug again in FC5 as I am not doing kernel updates on one machine, and the other doesn't seem to have the problem occur. (Probably a combination of it lot being loaded and frequent reboots to test raw hide installs and deal with 3d video lockups.) With access to the console, alt-sysrq-t (hold down alt and sysrq, then hit T while holding them) will also give trace output. But you need to do "echo 1 >/proc/sys/kernel/sysrq" first, so putting that in /etc/rc.local is necessary. I might be able to give this a try. My upgrade to f7 crashed before starting the install, so I am probably going to need to do a fresh install. That combined with seeing some kernel bug messages from a test f7 over the weekend makes me nervous about rushing that upgrade. I set /proc/sys/kernel/sysrq and tested that it works. I also put kernel.sysrq = 1 in /etc/sysctl.conf so it will be set after a reboot. Created attachment 154616 [details]
logs from sysrq t
I think I was finally able to get a sysrq t dump while my system was in the
screwed up state.
I had upgraded to 2316, so that is the kernel version that was running when
this happened.
One possible clue is that the length of time between reboot and when the system
starts hanging up is fairly consistant for a partituclar system. This one
appears to take about 7 or 8 days. The other system that I can't let this
happen on, was taking about 1 and 1/2 days for the couple of times it happened
before I went back to 2288.
Hopefully this has enough info for you to figure out where the problem is.
This happened again with about the same 7-8 day gap. It looks like the log file stopped recording information sometime yesterday. Alt Sysrq t dumped some output on the console, but there was nothing in /var/log/messages after the reboot. There was also a warning message on the console about the audit backlog limit being reached. I have now seen something similar in F7. It is enough different, that I am not sure it is the same problem. I have seen things start to lock up where I can't start processes by clicking on icons on the the menu bar, but I can change windows. I have also had the tail command hang. However, I am not seeing the time regularity I did previously. Twice the lock up has occurred after install the wine-core update (the second one was forced, because I wasn't sure the first one had completed). I still have issues getting sysrq output as it doesn't seem to write the output to /var/log/messages when this problem is occurring. I can get it to dump to the screen when using a virtual terminal. In the most recent case I was able to restart X (ctrl backspace) and that seemed to clear things up. Since there seems to be a possible tie to disk I/O, I should mention that I use ext3 file systems with write barriers enabled (which isn't the default) on top of software raid using mirrored pata disks with write caching enabled. There are lots of cron jobs stuck, apparently trying to write to the syslog. I can't find the references but this is some kind of known bug with crond and/or syslogd. Created attachment 158891 [details]
tail of /var/log/messages with some firewall warnings removed.
I had another occurence today after about 10 days of uptime. I had windows
slowly starting to lock up. I switched a virtual terminal and tried to login as
root, but the login hung. I then went back to X and restarted the X server and
then the VT login completed and things started working normally again. I did
several sysrq t's during the above, but I am not sure if what was written to
/var/log/messages reflected any from before things started working normally
again. There were also some selinux syslog errors of which I included a sample,
but I doubt they are connected to the problem. The kernel version is:
2.6.21-1.3228.fc7
This appears to be still happening in 2.6.23.1-4.fc7. I was remotely connected and was able to see a lot of processes in the D wait state (including syslog). I had the situation start happening again today and was able to get a list of processes in a D state before there were too many of them, so that might be useful. The list did include syslog, but that might be because syslog does writes reasonably often not because it starts the problem. I wasn't able to shutdown syslog immediately, but after doing a telinit 3 it finished shutting down. I went back to run level 5 without syslog and managed to get hung up again and I ended up needing to restart X with ctrl-alt-bs to get things going again. I don't believe syslog was running during this time. I was able to restart it again after X came back up. I seem to be able to lock things up on a fairly regular basis while doing yum updates. And this seems (but this is just a subjective impression) that this is even more likely if I am using firefox at the same time. Here is the ps auxww output that just lists processes that were in a D state when I started have problems with firefox and yum locking up: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1879 0.0 0.1 1800 596 ? Ds Oct17 0:04 syslogd -m 0 qmails 2608 0.0 0.0 1772 444 ? D Oct17 0:44 qmail-send qmaill 2609 0.0 0.0 1732 376 ? D Oct17 0:01 /usr/local/bin/multilog t /var/log/qmail/smtpd dnslog 2617 0.0 0.0 1732 380 ? D Oct17 0:17 multilog t ./main bruno 30914 6.7 7.1 136828 36572 ? Dl 13:29 0:31 /usr/lib/firefox-2.0.0.5/firefox-bin root 30933 8.2 41.6 224872 214476 pts/0 D+ 13:29 0:36 /usr/bin/python /usr/bin/yum update qmailr 30976 0.0 0.1 1828 532 ? D 13:32 0:00 qmail-remote devida.gob.pe jquinones.pe When this is happening there are processes in a 'D' state that show up with 99.99% usage according to iotop. More than one process can show up with 99.99% usage each. This is happening with 2.6.23.8-63.fc8. This issue is likely a duplicate of 249563, which I am following. I updated the version since FC 5 is long out of support. The problem still exists on F8, but I upgraded the machine to rawhide. So far the problem hasn't reoccurred, but it hasn't been long enough yet to say the problem appears to be fixed. My machine went a week before I rebooted to start using a kernel update. I think it is very likely the 2.6.24 kernel fixes this problem. I am not going to be able to test this easily if F7 or F8 gets a 2.6.24 kernel (unless it happens real soon) as I am moving the affected machines to rawhide. Hi Bruno, Closing as per bug #249563 which you indicate it may well be a duplicate of. Also you do not appear to be having the issues with 2.6.24 kernels so will hope this is resolved - if not, please re-open. Cheers Chris |