127048 – Processes stuck in a system call

Bug 127048 - Processes stuck in a system call

Summary: Processes stuck in a system call

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-06-30 23:13 UTC by Michal Jaegermann
Modified:	2015-01-04 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-01-14 06:11:17 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
output from "sysrq t" with stuck processes (48.31 KB, text/plain) 2004-06-30 23:15 UTC, Michal Jaegermann	no flags	Details
strace of konsole startup (58.51 KB, application/x-bzip2) 2004-08-01 21:40 UTC, Ellen Shull	no flags	Details
View All

Description Michal Jaegermann 2004-06-30 23:13:57 UTC

Description of problem:

An attempt to ssh to a machine sitting idle for a while
(an order of two hours) resulted in a stuck bash
process. A ps output showed the following:


root      2098    Ss   12:42   0:00 /usr/sbin/sshd
root      9589    Ss   16:22   0:00  \_ sshd: michal [priv]
michal    9591    S    16:22   0:00      \_ sshd: michal@notty
michal   10622    Ss   16:22   0:00          \_ -bash

while an attempt to strace gave this:

Process 10622 attached - interrupt to quit
read(0,

and that was all.

While trying to open a new terminal window this, for a
change bogged down in gconfd-2.  A new window showed up
but it remained empty while ps told that

root     10663    S    16:25   0:00 /usr/libexec/gconfd-2 14

with strace like this:

Process 10663 attached - interrupt to quit
poll(

Getting new login shells on a text console luckily was
not a problem.  The whole "Show State" from sysrq is
attached.

The trouble is not obvious to reproduce.  After
rebooting into another kernel (2.6.7-1.457) I could
open terminal windows and ssh to a machine in question
without further troubles but ... I can say the same
after I rebooted with 2.6.7-1.459 again.

It is quite possible that I misattributed what I
reported in bug #126814 and that this was really
another manifestation of the same trouble.  That one,
for a change, showed up immediately after a yum run
when bunch of updates was performed. I failed to notice
such behaviour in any circumstances on i386
installation (but maybe I was just "lucky").

Version-Release number of selected component (if applicable):
kernel-2.6.7-1.459

Comment 1 Michal Jaegermann 2004-06-30 23:15:42 UTC

Created attachment 101548 [details]
output from "sysrq t" with stuck processes

Comment 2 Michal Jaegermann 2004-07-04 17:58:38 UTC

I got myself today in the same situation but running 2.6.7-1.469.
Symptoms are the same, i.e. sys_read() is sitting and not reading
anything.  I have results of "Show State" saved is somebody wants
to have a look.

Comment 3 Michal Jaegermann 2004-07-22 20:33:13 UTC

The same problem struck again after a running a bunch of recent
updates.  This time the kernel was 2.6.7-1.492.  Symptoms do not
change.  Cannot ssh to a stricken machine as bash is blocked on
'read()', cannot open a new gnome terminal window because
gconfd-2 waits for 'poll()' to return.  No problems with a new
login on a local text console.

The only possible clue is that a spawned sshd process shows
'sshd: michal@notty' instead of an expected 'sshd: michal@pts/0'
or something like that.  But devpts still seems to be mounted
on /dev/pts and nothing appears to be wrong with it.

Although running yum with a bunch of updates seems like a pretty
good way to generate that condition this happens on other occasions
too even if I have no way to do that on demand.

Comment 4 Michal Jaegermann 2004-07-23 19:07:08 UTC

The same problem struck again after a running a bunch of recent
updates.  This time the kernel was 2.6.7-1.492.  Symptoms do not
change.  Cannot ssh to a stricken machine as bash is blocked on
'read()', cannot open a new gnome terminal window because
gconfd-2 waits for 'poll()' to return.  No problems with a new
login on a local text console.

The only possible clue is that a spawned sshd process shows
'sshd: michal@notty' instead of an expected 'sshd: michal@pts/0'
or something like that.  But devpts still seems to be mounted
on /dev/pts and nothing appears to be wrong with it.

Although running yum with a bunch of updates seems like a pretty
good way to generate that condition this happens on other occasions
too even if I have no way to do that on demand.

Comment 5 Michal Jaegermann 2004-07-23 19:30:22 UTC

(The other comment was supposed to be mailed yesterday but it got
stuck somewhere and I failed to notice :-).

The same happened again after updates from fedora-3-0.20040723.
But this time I tried to logout from X and on a console to
unmount and mount again devpts on /dev/pts.  After that operation
I can log from a remote via ssh and from a terminal window on
a console again.  There is still someting weird, though.  
Namely 'who' shows this:

michal   :0           Jul 23 13:05
michal   pts/0        Jul 23 13:05 (:0.0)
root     pts/1        Jul 23 12:57 (:0.0)
michal   pts/1        Jul 23 13:06 ("remote")

"root" login at 12:57 is from _before_ things got tied into a knot;
all other are _after_ /dev/pts got remounted.  No root login is
active at this moment and both /dev/pts/0 and /dev/pts/1, which
are the only pseudo-ttys at the moment, are owned by me. 
OTOH 'last' has this to say about that "root" login in question:

root     pts/1        :0.0             Fri Jul 23 12:57 - 13:06 (0:09)

with a logout timestmap coinciding with the moment when pts/1 was
"taken over" by my remote login.

Comment 6 Michal Jaegermann 2004-07-27 19:42:51 UTC

I got hit by the problem again (kernel 2.6.7-1.494 this time).
I can only add that I was able to remount devpts, making it read-only
and read-write again, but this did not have any effect.  Only
unmounting devpts and mounting it again helps.

When the problem struck I also tried, from an already exisiting
window, something like 'gnome-terminal -e tcsh'.  Although a terminal
window eventually opens, after a very long wait, I could not find
any traces of csh in a ps output even after quite some time.  This
is likely not very surprising as 'tcsh' seem to be starting in such
situation as a child of bash (or rather probably of $SHELL) which
is waiting for a never coming pseudo-tty.  Pseudo-ttys already
in use are fine.

There was also this report:
http://www.redhat.com/archives/fedora-test-list/2004-July/msg00577.html
(which seems to be for x86) but I have no idea if it is related.

Comment 7 Ellen Shull 2004-07-29 00:54:57 UTC

I have experienced this as well several times, only since upgrading 
to FC3/rawhide, on kernel .492 and probably early ones as well.  It 
always strikes after the system has been up for some time; almost 12 
days the last time it occurred, but I normally open a bunch of shells 
right after I boot and then don't open any more, so I couldn't say 
when the condition actually started. 
 
Note that I am running the i686 kernels on an Athlon XP, so this 
definitely affects more than just x86_64. 
 
Before I had no idea how to approach the problem...  next it happens 
I'll try to verify the various tests that Michal tried above.

Comment 8 Ellen Shull 2004-08-01 21:39:39 UTC

Less than 24 hours after my latest reboot (updated to stuff from 
20040729), the problem hit again. 
 
Currently using: 
kernel-2.6.7-1.499.i686.rpm 
glibc-2.3.3-39.i686.rpm 
devlabel-0.48.03-2 
device-mapper-1.00.19-1 
udev-030-3 
dev-3.8.2-1 
(I'm not entirely sure which of the last four are germane; you'll see 
why I listed /dev-related stuff shortly) 
 
After some testing, I'm beginning to think that I may not have the 
same problem as Michal.  LMK if I should file this as a separate bug. 
 
First, the problem occurs for me when trying to open another tab in 
Konsole (haven't tried ssh'ing in, don't have sshd configured).  The 
tab opens, but you just get the cursor, no shell prompt. 
 
Unlike what Michal reports, I don't get another process hanging; I 
don't get another process at all.  I did a "strace -f -ff -F -ttt -o 
konsole.strace konsole" to see what was going on; that's attached (I 
got no .pid files so either I'm running it wrong or it didn't get 
around to forking a bash).  In the strace, after all the KDE Krap, 
you'll notice it spends a good ten seconds trying to open a pty, with 
no joy. 
 
As Michal reports, I can still login fine on the text console.  Also, 
"bash" or "sh" or whatever from one of the already-open shells in 
konsole works fine. 
 
All the /dev/pt* are 0666, so it doesn't appear to be a permissions 
problem. 
 
$ xterm 
xterm: Error 32, errno 2: No such file or directory 
Reason: get_pty: not enough ptys 
 
... which also points to some kind of pty exhaustion. 
 
I'm not sure what else to check.  If you can't reproduce this and 
being able to telnet in (yeah I really should set up sshd) would be 
helpful, let me know; my system is still otherwise stable so I'm 
keeping it running in this state.

Comment 9 Ellen Shull 2004-08-01 21:40:50 UTC

Created attachment 102342 [details]
strace of konsole startup

Comment 10 Michal Jaegermann 2004-08-01 22:32:09 UTC

I never said that this is permission problem; only that unmounting
devpts and mounting it again makes things work.  Just remount is
not sufficient.

The fact that Wes got "not enough ptys" from xterm, where likely
only a few were really in use, seems to indicate that this is another
manifestation of the same trouble.

The next time I will run into that I will try if I can get the
same error from 'xterm' and also what 'sysctl kernel.pty.nr'
and 'sysctl kernel.pty.max' have to say.

Comment 11 Rik van Riel 2004-08-01 23:50:05 UTC

Would also be interesting to know if 'who' shows anything funny ...

Comment 12 Ellen Shull 2004-08-02 07:44:47 UTC

(system is still in the bad state) 
 
$ cat /proc/sys/kernel/pty/nr 
7 
$ cat /proc/sys/kernel/pty/max 
4096 
 
doesn't look like pty exhaustion, assuming the accounting is 
working... 
 
Output of "who" is completely normal, and does not change no matter 
how many hung-trying-to-start-shell konsole tabs I have open--but I 
haven't used Michal's unmount-mount devpts trick.

Comment 13 Michal Jaegermann 2004-08-02 19:30:56 UTC

I just run update through yum which updated 25 packages.  After that
I got stuck in "no-new-shell mode" again although mozilla, for
example, got up without any problems. 'who' says

root     :0           Aug  2 12:20
root     pts/0        Aug  2 12:20 (:0.0)
root     pts/1        Aug  2 12:25 (:0.0)

(with the current local time beeing roughly an hour later), and
'sysctl' shows:

kernel.pty.nr = 2
kernel.pty.max = 4096

Nothing unusual in permits and no visible changes in mount.
Shucks!  Forgot to look directely what 'cat /proc/mounts' has
to say but probably nothing special.

After logging out, unmounting devpts and mounting it again I am
back in business.  More and more baffled with every incident.

Comment 14 Michal Jaegermann 2004-08-02 21:30:08 UTC

I got into the same funk once again.  In the meantime I
run rsync to get grab some more packages and other than that a machine
was sitting idle for a while. 'cat /proc/mounts' does not show
up anything unusual and "none /dev/pts devpts rw 0 0" in particular.
An output of 'who' is a bit more interesting:

root     tty1         Aug  2 13:18
root     :0           Aug  2 13:19
root     pts/0        Aug  2 13:19 (:0.0)
root     pts/1        Aug  2 12:25 (:0.0)
michal   pts/1        Aug  2 13:19 ("remote")

but this could be just 'who' a bit confused.

BTW - kernel this time is 2.6.7-1.501

Comment 15 Michal Jaegermann 2004-08-02 21:44:52 UTC

I should add that after logging out and unmounting /dev/pts
'w' says "3 users" but it shows only the current login on tty1.
OTOH 'who' "remembers" a non-existent now logins from "remote"
and from ":0.0".  In any case these numbers are far from
kernel.pty.max.

Comment 16 Bill Nottingham 2004-08-05 21:52:32 UTC

This pty bug is already in bugzilla somewhere...

Comment 17 Michal Jaegermann 2004-08-05 22:04:30 UTC

Do you mean bug #128154? Surely it was not there when I filed
the original report. :-)

Comment 18 Ellen Shull 2004-08-30 20:59:52 UTC

I haven't experienced this bug in a while.  Is anyone else (who like
me is keeping up with new kernel builds) still seeing it?  Over on <a
href="https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128154">#128154</a>
Warren he hasn't seen it since 525...  Maybe time to close both of
these bugs with resolution rawhide?

Comment 19 Dave Jones 2005-01-14 05:38:52 UTC

any reoccurance of this bug with the last few update kernels ?

Comment 20 Ellen Shull 2005-01-14 05:56:10 UTC

No recurrances here yet; it still appears to be fixed.  I'm currently using
2.6.10-1.1076_FC4, haven't had a chance to reboot to today's rawhide kernel
(1087) yet.  Rest assured you'll hear about it if I do see the bug again ;-)

Comment 21 Dave Jones 2005-01-14 06:11:17 UTC

ok, I'll close this, feel free to reopen if it reoccurs.

Comment 22 Michal Jaegermann 2005-01-14 16:40:20 UTC

> any reoccurance of this bug with the last few update kernels ?
I am not sure if a bug #145021 gives a new instance of the same
problem or this is something new.

Note You need to log in before you can comment on or make changes to this bug.