Bug 235043

Summary:

Processes start hanging after a long while in 2300 and 2307, but not 2288

Product:

[Fedora] Fedora

Reporter:

Bruno Wolff III <bruno>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

chris.brown

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

2.6.24

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-02-03 21:29:10 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
lsmod output	none
logs from sysrq t	none
tail of /var/log/messages with some firewall warnings removed.	none

Description Bruno Wolff III 2007-04-03 15:17:04 UTC

Description of problem:
After a significant period of time (about one day on one machine and about a
week on another) processes start hanging. However, new processes can be started
and the network stack continues to function. While this effectively hoses the
system when services start hanging, some things do continue to work.

Version-Release number of selected component (if applicable):
I have seen this when using kernel-doc-2.6.20-1.2300.fc5 and
kernel-doc-2.6.20-1.2307.fc5, but not with earlier kernels. In particular,
falling back to kernel-doc-2.6.19-1.2288.2.4.fc5 seems to alleviate the problem.

How reproducible:
I haven't found a way to speed up the process, but it seems that leaving a
system running long enough will result in it happening. The machine that started
having hangs in a day is typically busier than the one that hangs in a week. I
have another machine that isn't very busy and that I reboot for testing fairly
regularly, on which I haven't seen the problem occur.

Steps to Reproduce:
1. Leave system run under one of the affected kernels.
2.
3.
  
Actual results:
System runs normally for a long time and then processes start hanging. Once this
starts happening, it seems to happen a lot.

Expected results:
System continues to run nromally no matter how long it is up.

Additional info:

Comment 1 Chuck Ebbert 2007-04-03 15:27:08 UTC

Can you do
   echo "t" >/proc/sysrq-trigger
when processes freeze and post the syslog messages generated?

Also post output of the "lsmod" command.

Comment 2 Bruno Wolff III 2007-04-03 15:38:16 UTC

Created attachment 151577 [details]
lsmod output

I try this, but it will take a while. I don't want the machine that takes about
a day to hang to hang as that causes me a bit too much grief. I'll leave the
machine that takes a about a week to hang running 2307 and try the above when
it happens again. Unfortunately I just had to reboot it to clear the problem.
I'll attach the output from lsmod as that should be the same now as when the
problem occured and it might rule out some possible causes.

Comment 3 Bruno Wolff III 2007-04-11 18:09:49 UTC

The lockup happened again while I had access to the console, but changing
sysrq-trigger didn't have any noticeable effect. When I did a tail -f of
/var/log/messages that terminal window hung. After I rebooted I didn't see any
unusual log messages in /var/log/messages or other log files last updated around
when the machine was having problems.
How much output will setting this flag produce? Is it reasonable to run a
machine with the flag set for serveral days with around 5GB of space to devote
to the extra logs?

Comment 4 Chuck Ebbert 2007-04-11 18:51:27 UTC

(In reply to comment #3)
> How much output will setting this flag produce? Is it reasonable to run a
> machine with the flag set for serveral days with around 5GB of space to devote
> to the extra logs?

It triggers output, once, when you write to it.

If you have the kernel-doc package installed the description is in
/usr/share/doc/kernel-doc-<version>/Documentation/sysrq.txt

Comment 5 Bruno Wolff III 2007-04-11 19:40:21 UTC

I tried sysrq command after the machine had been rebooted and the output ended
up in the /var/log/messages file. So I probably was too late in this last case.
Next time it happens, I'll try again and maybe I'll get something useful.

Comment 6 Bruno Wolff III 2007-04-28 17:38:11 UTC

It also happens with 2312. I had another hang after about 7 days of uptime.
Unfortunately before I realized that the hang was starting, I could no longer
login again using ssh or use su to get a root shell. So I didn't get to try the
sysrq dump.
I am going to be updating the machine to Fedora 7 now so I can verify that this
isn't a problem there. So it will be unlikely that I will see this bug again in
FC5 as I am not doing kernel updates on one machine, and the other doesn't seem
to have the problem occur. (Probably a combination of it lot being loaded and
frequent reboots to test raw hide installs and deal with 3d video lockups.)

Comment 7 Chuck Ebbert 2007-04-30 22:55:06 UTC

With access to the console, alt-sysrq-t (hold down alt and sysrq, then hit
T while holding them) will also give trace output.

But you need to do "echo 1 >/proc/sys/kernel/sysrq" first, so putting
that in /etc/rc.local is necessary.

Comment 8 Bruno Wolff III 2007-05-01 14:43:08 UTC

I might be able to give this a try. My upgrade to f7 crashed before starting the
install, so I am probably going to need to do a fresh install. That combined
with seeing some kernel bug messages from a test f7 over the weekend makes me
nervous about rushing that upgrade.
I set /proc/sys/kernel/sysrq and tested that it works. I also put kernel.sysrq =
1 in /etc/sysctl.conf so it will be set after a reboot.

Comment 9 Bruno Wolff III 2007-05-13 20:31:06 UTC

Created attachment 154616 [details]
logs from sysrq t

I think I was finally able to get a sysrq t dump while my system was in the
screwed up state.
I had upgraded to 2316, so that is the kernel version that was running when
this happened.
One possible clue is that the length of time between reboot and when the system
starts hanging up is fairly consistant for a partituclar system. This one
appears to take about 7 or 8 days. The other system that I can't let this
happen on, was taking about 1 and 1/2 days for the couple of times it happened
before I went back to 2288.
Hopefully this has enough info for you to figure out where the problem is.

Comment 10 Bruno Wolff III 2007-05-21 13:35:45 UTC

This happened again with about the same 7-8 day gap. It looks like the log file
stopped recording information sometime yesterday. Alt Sysrq t dumped some output
on the console, but there was nothing in /var/log/messages after the reboot.
There was also a warning message on the console about the audit backlog limit
being reached.

Comment 11 Bruno Wolff III 2007-06-26 14:35:29 UTC

I have now seen something similar in F7. It is enough different, that I am not
sure it is the same problem.
I have seen things start to lock up where I can't start processes by clicking on
icons on the the menu bar, but I can change windows. I have also had the tail
command hang.
However, I am not seeing the time regularity I did previously. Twice the lock up
has occurred after install the wine-core update (the second one was forced,
because I wasn't sure the first one had completed).
I still have issues getting sysrq output as it doesn't seem to write the output
to /var/log/messages when this problem is occurring. I can get it to dump to the
screen when using a virtual terminal. In the most recent case I was able to
restart X (ctrl backspace) and that seemed to clear things up.
Since there seems to be a possible tie to disk I/O, I should mention that I use
ext3 file systems with write barriers enabled (which isn't the default) on top
of software raid using mirrored pata disks with write caching enabled.

Comment 12 Chuck Ebbert 2007-06-26 15:47:10 UTC

There are lots of cron jobs stuck, apparently trying to write to the syslog.
I can't find the references but this is some kind of known bug with crond and/or
 syslogd.

Comment 13 Bruno Wolff III 2007-07-10 20:25:07 UTC

Created attachment 158891 [details]
tail of /var/log/messages with some firewall warnings removed.

I had another occurence today after about 10 days of uptime. I had windows
slowly starting to lock up. I switched a virtual terminal and tried to login as
root, but the login hung. I then went back to X and restarted the X server and
then the VT login completed and things started working normally again. I did
several sysrq t's during the above, but I am not sure if what was written to
/var/log/messages reflected any from before things started working normally
again. There were also some selinux syslog errors of which I included a sample,
but I doubt they are connected to the problem. The kernel version is:
2.6.21-1.3228.fc7

Comment 14 Bruno Wolff III 2007-10-17 18:11:57 UTC

This appears to be still happening in 2.6.23.1-4.fc7. I was remotely connected
and was able to see a lot of processes in the D wait state (including syslog).

Comment 15 Bruno Wolff III 2007-10-19 19:46:19 UTC

I had the situation start happening again today and was able to get a list of
processes in a D state before there were too many of them, so that might be useful.
The list did include syslog, but that might be because syslog does writes
reasonably often not because it starts the problem. I wasn't able to shutdown
syslog immediately, but after doing a telinit 3 it finished shutting down. I
went back to run level 5 without syslog and managed to get hung up again and I
ended up needing to restart X with ctrl-alt-bs to get things going again. I
don't believe syslog was running during this time. I was able to restart it
again after X came back up.
I seem to be able to lock things up on a fairly regular basis while doing yum
updates. And this seems (but this is just a subjective impression) that this is
even more likely if I am using firefox at the same time.
Here is the ps auxww output that just lists processes that were in a D state
when I started have problems with firefox and yum locking up:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1879  0.0  0.1   1800   596 ?        Ds   Oct17   0:04 syslogd -m 0
qmails    2608  0.0  0.0   1772   444 ?        D    Oct17   0:44 qmail-send
qmaill    2609  0.0  0.0   1732   376 ?        D    Oct17   0:01
/usr/local/bin/multilog t /var/log/qmail/smtpd
dnslog    2617  0.0  0.0   1732   380 ?        D    Oct17   0:17 multilog t ./main
bruno    30914  6.7  7.1 136828 36572 ?        Dl   13:29   0:31
/usr/lib/firefox-2.0.0.5/firefox-bin
root     30933  8.2 41.6 224872 214476 pts/0   D+   13:29   0:36 /usr/bin/python
/usr/bin/yum update
qmailr   30976  0.0  0.1   1828   532 ?        D    13:32   0:00 qmail-remote
devida.gob.pe  jquinones.pe

Comment 16 Bruno Wolff III 2007-12-12 22:11:57 UTC

When this is happening there are processes in a 'D' state that show up with
99.99% usage according to iotop. More than one process can show up with 99.99%
usage each. This is happening with 2.6.23.8-63.fc8.
This issue is likely a duplicate of 249563, which I am following.

Comment 17 Bruno Wolff III 2008-01-28 20:03:08 UTC

I updated the version since FC 5 is long out of support. The problem still
exists on F8, but I upgraded the machine to rawhide. So far the problem hasn't
reoccurred, but it hasn't been long enough yet to say the problem appears to be
fixed.

Comment 18 Bruno Wolff III 2008-02-03 17:03:28 UTC

My machine went a week before I rebooted to start using a kernel update. I think
it is very likely the 2.6.24 kernel fixes this problem. I am not going to be
able to test this easily if F7 or F8 gets a 2.6.24 kernel (unless it happens
real soon) as I am moving the affected machines to rawhide.

Comment 19 Christopher Brown 2008-02-03 21:29:10 UTC

Hi Bruno,

Closing as per bug #249563 which you indicate it may well be a duplicate of.
Also you do not appear to be having the issues with 2.6.24 kernels so will hope
this is resolved - if not, please re-open.

Cheers
Chris