Bug 127048
| Summary: | Processes stuck in a system call | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Michal Jaegermann <michal> | ||||||
| Component: | kernel | Assignee: | Dave Jones <davej> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | rawhide | CC: | barryn, ellenshull, pfrields | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2005-01-14 06:11:17 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Michal Jaegermann
2004-06-30 23:13:57 UTC
Created attachment 101548 [details]
output from "sysrq t" with stuck processes
I got myself today in the same situation but running 2.6.7-1.469. Symptoms are the same, i.e. sys_read() is sitting and not reading anything. I have results of "Show State" saved is somebody wants to have a look. The same problem struck again after a running a bunch of recent updates. This time the kernel was 2.6.7-1.492. Symptoms do not change. Cannot ssh to a stricken machine as bash is blocked on 'read()', cannot open a new gnome terminal window because gconfd-2 waits for 'poll()' to return. No problems with a new login on a local text console. The only possible clue is that a spawned sshd process shows 'sshd: michal@notty' instead of an expected 'sshd: michal@pts/0' or something like that. But devpts still seems to be mounted on /dev/pts and nothing appears to be wrong with it. Although running yum with a bunch of updates seems like a pretty good way to generate that condition this happens on other occasions too even if I have no way to do that on demand. The same problem struck again after a running a bunch of recent updates. This time the kernel was 2.6.7-1.492. Symptoms do not change. Cannot ssh to a stricken machine as bash is blocked on 'read()', cannot open a new gnome terminal window because gconfd-2 waits for 'poll()' to return. No problems with a new login on a local text console. The only possible clue is that a spawned sshd process shows 'sshd: michal@notty' instead of an expected 'sshd: michal@pts/0' or something like that. But devpts still seems to be mounted on /dev/pts and nothing appears to be wrong with it. Although running yum with a bunch of updates seems like a pretty good way to generate that condition this happens on other occasions too even if I have no way to do that on demand. (The other comment was supposed to be mailed yesterday but it got
stuck somewhere and I failed to notice :-).
The same happened again after updates from fedora-3-0.20040723.
But this time I tried to logout from X and on a console to
unmount and mount again devpts on /dev/pts. After that operation
I can log from a remote via ssh and from a terminal window on
a console again. There is still someting weird, though.
Namely 'who' shows this:
michal :0 Jul 23 13:05
michal pts/0 Jul 23 13:05 (:0.0)
root pts/1 Jul 23 12:57 (:0.0)
michal pts/1 Jul 23 13:06 ("remote")
"root" login at 12:57 is from _before_ things got tied into a knot;
all other are _after_ /dev/pts got remounted. No root login is
active at this moment and both /dev/pts/0 and /dev/pts/1, which
are the only pseudo-ttys at the moment, are owned by me.
OTOH 'last' has this to say about that "root" login in question:
root pts/1 :0.0 Fri Jul 23 12:57 - 13:06 (0:09)
with a logout timestmap coinciding with the moment when pts/1 was
"taken over" by my remote login.
I got hit by the problem again (kernel 2.6.7-1.494 this time). I can only add that I was able to remount devpts, making it read-only and read-write again, but this did not have any effect. Only unmounting devpts and mounting it again helps. When the problem struck I also tried, from an already exisiting window, something like 'gnome-terminal -e tcsh'. Although a terminal window eventually opens, after a very long wait, I could not find any traces of csh in a ps output even after quite some time. This is likely not very surprising as 'tcsh' seem to be starting in such situation as a child of bash (or rather probably of $SHELL) which is waiting for a never coming pseudo-tty. Pseudo-ttys already in use are fine. There was also this report: http://www.redhat.com/archives/fedora-test-list/2004-July/msg00577.html (which seems to be for x86) but I have no idea if it is related. I have experienced this as well several times, only since upgrading to FC3/rawhide, on kernel .492 and probably early ones as well. It always strikes after the system has been up for some time; almost 12 days the last time it occurred, but I normally open a bunch of shells right after I boot and then don't open any more, so I couldn't say when the condition actually started. Note that I am running the i686 kernels on an Athlon XP, so this definitely affects more than just x86_64. Before I had no idea how to approach the problem... next it happens I'll try to verify the various tests that Michal tried above. Less than 24 hours after my latest reboot (updated to stuff from 20040729), the problem hit again. Currently using: kernel-2.6.7-1.499.i686.rpm glibc-2.3.3-39.i686.rpm devlabel-0.48.03-2 device-mapper-1.00.19-1 udev-030-3 dev-3.8.2-1 (I'm not entirely sure which of the last four are germane; you'll see why I listed /dev-related stuff shortly) After some testing, I'm beginning to think that I may not have the same problem as Michal. LMK if I should file this as a separate bug. First, the problem occurs for me when trying to open another tab in Konsole (haven't tried ssh'ing in, don't have sshd configured). The tab opens, but you just get the cursor, no shell prompt. Unlike what Michal reports, I don't get another process hanging; I don't get another process at all. I did a "strace -f -ff -F -ttt -o konsole.strace konsole" to see what was going on; that's attached (I got no .pid files so either I'm running it wrong or it didn't get around to forking a bash). In the strace, after all the KDE Krap, you'll notice it spends a good ten seconds trying to open a pty, with no joy. As Michal reports, I can still login fine on the text console. Also, "bash" or "sh" or whatever from one of the already-open shells in konsole works fine. All the /dev/pt* are 0666, so it doesn't appear to be a permissions problem. $ xterm xterm: Error 32, errno 2: No such file or directory Reason: get_pty: not enough ptys ... which also points to some kind of pty exhaustion. I'm not sure what else to check. If you can't reproduce this and being able to telnet in (yeah I really should set up sshd) would be helpful, let me know; my system is still otherwise stable so I'm keeping it running in this state. Created attachment 102342 [details]
strace of konsole startup
I never said that this is permission problem; only that unmounting devpts and mounting it again makes things work. Just remount is not sufficient. The fact that Wes got "not enough ptys" from xterm, where likely only a few were really in use, seems to indicate that this is another manifestation of the same trouble. The next time I will run into that I will try if I can get the same error from 'xterm' and also what 'sysctl kernel.pty.nr' and 'sysctl kernel.pty.max' have to say. Would also be interesting to know if 'who' shows anything funny ... (system is still in the bad state) $ cat /proc/sys/kernel/pty/nr 7 $ cat /proc/sys/kernel/pty/max 4096 doesn't look like pty exhaustion, assuming the accounting is working... Output of "who" is completely normal, and does not change no matter how many hung-trying-to-start-shell konsole tabs I have open--but I haven't used Michal's unmount-mount devpts trick. I just run update through yum which updated 25 packages. After that I got stuck in "no-new-shell mode" again although mozilla, for example, got up without any problems. 'who' says root :0 Aug 2 12:20 root pts/0 Aug 2 12:20 (:0.0) root pts/1 Aug 2 12:25 (:0.0) (with the current local time beeing roughly an hour later), and 'sysctl' shows: kernel.pty.nr = 2 kernel.pty.max = 4096 Nothing unusual in permits and no visible changes in mount. Shucks! Forgot to look directely what 'cat /proc/mounts' has to say but probably nothing special. After logging out, unmounting devpts and mounting it again I am back in business. More and more baffled with every incident. I got into the same funk once again. In the meantime I
run rsync to get grab some more packages and other than that a machine
was sitting idle for a while. 'cat /proc/mounts' does not show
up anything unusual and "none /dev/pts devpts rw 0 0" in particular.
An output of 'who' is a bit more interesting:
root tty1 Aug 2 13:18
root :0 Aug 2 13:19
root pts/0 Aug 2 13:19 (:0.0)
root pts/1 Aug 2 12:25 (:0.0)
michal pts/1 Aug 2 13:19 ("remote")
but this could be just 'who' a bit confused.
BTW - kernel this time is 2.6.7-1.501
I should add that after logging out and unmounting /dev/pts 'w' says "3 users" but it shows only the current login on tty1. OTOH 'who' "remembers" a non-existent now logins from "remote" and from ":0.0". In any case these numbers are far from kernel.pty.max. This pty bug is already in bugzilla somewhere... Do you mean bug #128154? Surely it was not there when I filed the original report. :-) I haven't experienced this bug in a while. Is anyone else (who like me is keeping up with new kernel builds) still seeing it? Over on <a href="https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128154">#128154</a> Warren he hasn't seen it since 525... Maybe time to close both of these bugs with resolution rawhide? any reoccurance of this bug with the last few update kernels ? No recurrances here yet; it still appears to be fixed. I'm currently using 2.6.10-1.1076_FC4, haven't had a chance to reboot to today's rawhide kernel (1087) yet. Rest assured you'll hear about it if I do see the bug again ;-) ok, I'll close this, feel free to reopen if it reoccurs. > any reoccurance of this bug with the last few update kernels ? I am not sure if a bug #145021 gives a new instance of the same problem or this is something new. |