Description of problem: An attempt to ssh to a machine sitting idle for a while (an order of two hours) resulted in a stuck bash process. A ps output showed the following: root 2098 Ss 12:42 0:00 /usr/sbin/sshd root 9589 Ss 16:22 0:00 \_ sshd: michal [priv] michal 9591 S 16:22 0:00 \_ sshd: michal@notty michal 10622 Ss 16:22 0:00 \_ -bash while an attempt to strace gave this: Process 10622 attached - interrupt to quit read(0, and that was all. While trying to open a new terminal window this, for a change bogged down in gconfd-2. A new window showed up but it remained empty while ps told that root 10663 S 16:25 0:00 /usr/libexec/gconfd-2 14 with strace like this: Process 10663 attached - interrupt to quit poll( Getting new login shells on a text console luckily was not a problem. The whole "Show State" from sysrq is attached. The trouble is not obvious to reproduce. After rebooting into another kernel (2.6.7-1.457) I could open terminal windows and ssh to a machine in question without further troubles but ... I can say the same after I rebooted with 2.6.7-1.459 again. It is quite possible that I misattributed what I reported in bug #126814 and that this was really another manifestation of the same trouble. That one, for a change, showed up immediately after a yum run when bunch of updates was performed. I failed to notice such behaviour in any circumstances on i386 installation (but maybe I was just "lucky"). Version-Release number of selected component (if applicable): kernel-2.6.7-1.459
Created attachment 101548 [details] output from "sysrq t" with stuck processes
I got myself today in the same situation but running 2.6.7-1.469. Symptoms are the same, i.e. sys_read() is sitting and not reading anything. I have results of "Show State" saved is somebody wants to have a look.
The same problem struck again after a running a bunch of recent updates. This time the kernel was 2.6.7-1.492. Symptoms do not change. Cannot ssh to a stricken machine as bash is blocked on 'read()', cannot open a new gnome terminal window because gconfd-2 waits for 'poll()' to return. No problems with a new login on a local text console. The only possible clue is that a spawned sshd process shows 'sshd: michal@notty' instead of an expected 'sshd: michal@pts/0' or something like that. But devpts still seems to be mounted on /dev/pts and nothing appears to be wrong with it. Although running yum with a bunch of updates seems like a pretty good way to generate that condition this happens on other occasions too even if I have no way to do that on demand.
(The other comment was supposed to be mailed yesterday but it got stuck somewhere and I failed to notice :-). The same happened again after updates from fedora-3-0.20040723. But this time I tried to logout from X and on a console to unmount and mount again devpts on /dev/pts. After that operation I can log from a remote via ssh and from a terminal window on a console again. There is still someting weird, though. Namely 'who' shows this: michal :0 Jul 23 13:05 michal pts/0 Jul 23 13:05 (:0.0) root pts/1 Jul 23 12:57 (:0.0) michal pts/1 Jul 23 13:06 ("remote") "root" login at 12:57 is from _before_ things got tied into a knot; all other are _after_ /dev/pts got remounted. No root login is active at this moment and both /dev/pts/0 and /dev/pts/1, which are the only pseudo-ttys at the moment, are owned by me. OTOH 'last' has this to say about that "root" login in question: root pts/1 :0.0 Fri Jul 23 12:57 - 13:06 (0:09) with a logout timestmap coinciding with the moment when pts/1 was "taken over" by my remote login.
I got hit by the problem again (kernel 2.6.7-1.494 this time). I can only add that I was able to remount devpts, making it read-only and read-write again, but this did not have any effect. Only unmounting devpts and mounting it again helps. When the problem struck I also tried, from an already exisiting window, something like 'gnome-terminal -e tcsh'. Although a terminal window eventually opens, after a very long wait, I could not find any traces of csh in a ps output even after quite some time. This is likely not very surprising as 'tcsh' seem to be starting in such situation as a child of bash (or rather probably of $SHELL) which is waiting for a never coming pseudo-tty. Pseudo-ttys already in use are fine. There was also this report: http://www.redhat.com/archives/fedora-test-list/2004-July/msg00577.html (which seems to be for x86) but I have no idea if it is related.
I have experienced this as well several times, only since upgrading to FC3/rawhide, on kernel .492 and probably early ones as well. It always strikes after the system has been up for some time; almost 12 days the last time it occurred, but I normally open a bunch of shells right after I boot and then don't open any more, so I couldn't say when the condition actually started. Note that I am running the i686 kernels on an Athlon XP, so this definitely affects more than just x86_64. Before I had no idea how to approach the problem... next it happens I'll try to verify the various tests that Michal tried above.
Less than 24 hours after my latest reboot (updated to stuff from 20040729), the problem hit again. Currently using: kernel-2.6.7-1.499.i686.rpm glibc-2.3.3-39.i686.rpm devlabel-0.48.03-2 device-mapper-1.00.19-1 udev-030-3 dev-3.8.2-1 (I'm not entirely sure which of the last four are germane; you'll see why I listed /dev-related stuff shortly) After some testing, I'm beginning to think that I may not have the same problem as Michal. LMK if I should file this as a separate bug. First, the problem occurs for me when trying to open another tab in Konsole (haven't tried ssh'ing in, don't have sshd configured). The tab opens, but you just get the cursor, no shell prompt. Unlike what Michal reports, I don't get another process hanging; I don't get another process at all. I did a "strace -f -ff -F -ttt -o konsole.strace konsole" to see what was going on; that's attached (I got no .pid files so either I'm running it wrong or it didn't get around to forking a bash). In the strace, after all the KDE Krap, you'll notice it spends a good ten seconds trying to open a pty, with no joy. As Michal reports, I can still login fine on the text console. Also, "bash" or "sh" or whatever from one of the already-open shells in konsole works fine. All the /dev/pt* are 0666, so it doesn't appear to be a permissions problem. $ xterm xterm: Error 32, errno 2: No such file or directory Reason: get_pty: not enough ptys ... which also points to some kind of pty exhaustion. I'm not sure what else to check. If you can't reproduce this and being able to telnet in (yeah I really should set up sshd) would be helpful, let me know; my system is still otherwise stable so I'm keeping it running in this state.
Created attachment 102342 [details] strace of konsole startup
I never said that this is permission problem; only that unmounting devpts and mounting it again makes things work. Just remount is not sufficient. The fact that Wes got "not enough ptys" from xterm, where likely only a few were really in use, seems to indicate that this is another manifestation of the same trouble. The next time I will run into that I will try if I can get the same error from 'xterm' and also what 'sysctl kernel.pty.nr' and 'sysctl kernel.pty.max' have to say.
Would also be interesting to know if 'who' shows anything funny ...
(system is still in the bad state) $ cat /proc/sys/kernel/pty/nr 7 $ cat /proc/sys/kernel/pty/max 4096 doesn't look like pty exhaustion, assuming the accounting is working... Output of "who" is completely normal, and does not change no matter how many hung-trying-to-start-shell konsole tabs I have open--but I haven't used Michal's unmount-mount devpts trick.
I just run update through yum which updated 25 packages. After that I got stuck in "no-new-shell mode" again although mozilla, for example, got up without any problems. 'who' says root :0 Aug 2 12:20 root pts/0 Aug 2 12:20 (:0.0) root pts/1 Aug 2 12:25 (:0.0) (with the current local time beeing roughly an hour later), and 'sysctl' shows: kernel.pty.nr = 2 kernel.pty.max = 4096 Nothing unusual in permits and no visible changes in mount. Shucks! Forgot to look directely what 'cat /proc/mounts' has to say but probably nothing special. After logging out, unmounting devpts and mounting it again I am back in business. More and more baffled with every incident.
I got into the same funk once again. In the meantime I run rsync to get grab some more packages and other than that a machine was sitting idle for a while. 'cat /proc/mounts' does not show up anything unusual and "none /dev/pts devpts rw 0 0" in particular. An output of 'who' is a bit more interesting: root tty1 Aug 2 13:18 root :0 Aug 2 13:19 root pts/0 Aug 2 13:19 (:0.0) root pts/1 Aug 2 12:25 (:0.0) michal pts/1 Aug 2 13:19 ("remote") but this could be just 'who' a bit confused. BTW - kernel this time is 2.6.7-1.501
I should add that after logging out and unmounting /dev/pts 'w' says "3 users" but it shows only the current login on tty1. OTOH 'who' "remembers" a non-existent now logins from "remote" and from ":0.0". In any case these numbers are far from kernel.pty.max.
This pty bug is already in bugzilla somewhere...
Do you mean bug #128154? Surely it was not there when I filed the original report. :-)
I haven't experienced this bug in a while. Is anyone else (who like me is keeping up with new kernel builds) still seeing it? Over on <a href="https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128154">#128154</a> Warren he hasn't seen it since 525... Maybe time to close both of these bugs with resolution rawhide?
any reoccurance of this bug with the last few update kernels ?
No recurrances here yet; it still appears to be fixed. I'm currently using 2.6.10-1.1076_FC4, haven't had a chance to reboot to today's rawhide kernel (1087) yet. Rest assured you'll hear about it if I do see the bug again ;-)
ok, I'll close this, feel free to reopen if it reoccurs.
> any reoccurance of this bug with the last few update kernels ? I am not sure if a bug #145021 gives a new instance of the same problem or this is something new.