Description of problem: "su -" direct from user account works, "su" direct from user account works, but "su" followed by "su -" hangs. Version-Release number of selected component (if applicable): coreutils-4.5.3-10 rawhide-release-20030111-1 How reproducible: 100% Steps to Reproduce: 1. su 2. su - 3. Actual results: hang Expected results: root's shell and env Additional info:
Noticed this myself earlier today too. stty seems to get stuck by SIGSTOP.
Or more likely SIGTTOU.
This seems to be a regression relative to Phoebe. But coreutils-4.5.3-10 was in Phoebe too. Perhaps a kernel issue of some sort?
Could be. I'm using kernel-2.4.20-2.12 I noticed that the bug is exhibited in gnome-terminal and in the text consoles, but not in KDE's Konsole or in xterm. I tried downgrading to vte-0.10.7-2 and gnome-terminal-2.1.3-2 (phoebe), but that did not fix the problem.
This seems to have some similarities to bug #79859. Maybe coincidence.
Using rawhide-release-20030112-1 on: kernel-2.4.18-19.8.0 OK kernel-2.4.20-2.9 OK kernel-2.4.20-2.12 NOT OK I agree that its a kernel problem introduced after 2.4.20-2.9
This problem started after i added some scheduler enhancement in the area of run-parent-first / run-child-first. by reverting that optimization, the problem does not trigger - but this clearly shows that it's a userspace bug triggered by the kernel's choice of parent/child execution after fork().
Whenever it hangs, stty has the process group of the bash run from su, the session of the bash from which su ran, and the tty process group belongs to 'su -'.
Forcing BASH_SYS_PGRP_SYNC on in aclocal.m4 seems to at least work around the problem. bash-2.05b-14.
Seems to be working with bash-2.05b-14 and kernel-2.4.20-2.23.
This time, normal (unnested) "su -" and "su" seem to be broken with kernel-2.4.20-2.47, bash-2.05b-10 through -20, coreutils-4.5.3-11 through -14, glibc-2.3.1-41 through -43.
Forgot one thing: kernel-2.4.20-2.40 worked (though I don't know what versions of other packages -- large upgrade)
With what symptom? I don't see this.
Symptom(s): I'm nopt able to use su (neither is postgresql's init script BTW), but that's not all. As I just found out, I'm not able to log in via ssh as well so this might be a more generic problem. Unfortunately I can't tell you any package version beyond what I stated above ATM because I can't log in.
Latest pam_krb5?
I don't know whether I have installed pam_krb5 at all (I'm not using Kerberos on that machine). I have synced the beehive tree roughly at about midnight from Stuttgart and if I have pam_krb5 it will be that version.
Symptoms clarified: Any attempt to use su just hangs infinitely (unless interrupted) after asking the password, regardless of interactive use or init script use. Likewise with ssh login attempts and IIRC (it was late eh early this morning) login attempts on text consoles -- I'm able to type in the password, but that's all, it hangs there. I ran an strace on su and it hung some rt_sig* (rt_sigaction?) call.
I just confirmed that I don't have pam_krb5 installed. An "strace su" as root briongs this: [...] getpid() = 5303 rt_sigprocmask(SIG_SETMASK, NULL, [RTMIN], 8) = 0 rt_sigsuspend([] [...] Then it hangs (until interrupted). With regards to the text consoles and ssh login: The text console login seems to hang in the login process, this times out and mingetty restarts itself. The sshd has many sleeping sshd processes (user root) which have many zombie children (user sshd): root@wombat:~> ps auxw|grep sshd root 1818 0.0 0.2 3484 1216 ? S 08:29 0:00 /usr/sbin/sshd root 9458 0.0 0.6 10104 3140 ? S 14:22 0:00 /usr/sbin/sshd sshd 9459 0.0 0.0 0 0 ? Z 14:22 0:00 [sshd <defunct>] root 9460 0.0 0.6 10100 3116 ? S 14:22 0:00 /usr/sbin/sshd sshd 9461 0.0 0.0 0 0 ? Z 14:22 0:00 [sshd <defunct>] root 9671 0.0 0.6 10104 3100 ? S 14:26 0:00 /usr/sbin/sshd sshd 9672 0.0 0.0 0 0 ? Z 14:26 0:00 [sshd <defunct>] root 9675 0.0 0.6 10104 3100 ? S 14:26 0:00 /usr/sbin/sshd sshd 9676 0.0 0.0 0 0 ? Z 14:26 0:00 [sshd <defunct>] root 9948 0.0 0.6 10104 3144 ? S 14:31 0:00 /usr/sbin/sshd sshd 9949 0.0 0.0 0 0 ? Z 14:31 0:00 [sshd <defunct>] root 9950 0.0 0.5 10104 2752 ? S 14:31 0:00 /usr/sbin/sshd sshd 9951 0.0 0.0 0 0 ? Z 14:31 0:00 [sshd <defunct>] root 5789 0.0 0.1 3576 648 pts/1 S 20:41 0:00 grep sshd
kernel 2.4.20-2.47.1 (as opposed to -2.47) doesn't show these problems.
It happens for me on 2.4.20-2.48. The su ; su - sequence didn't trigger it, but su ; su - ; su - did.
Ben: what version of bash do you have? I think there are two issues getting intermingled in this report.
strace on the stuck stty shows an infinite stream of: --- SIGTTOU (Stopped (tty output)) --- --- SIGTTOU (Stopped (tty output)) --- ioctl(0, SNDCTL_TMR_STOP, {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS (To be restarted) --- SIGTTOU (Stopped (tty output)) --- --- SIGTTOU (Stopped (tty output)) --- ioctl(0, SNDCTL_TMR_STOP, {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS (To be restarted)
Ben: what version of bash do you have?
bash-2.05b-5. Still, this is a kernel bug if a syscall is infinitely being restarted.
Created attachment 90094 [details] strace of hung su -
Ingo, this is a kernel problem -- can you look at the trace?
Ben, do you see 'stty' looping infinitely? I think the infinite restarts are just an strace artifact. what might have happened is that stty tried to write to a terminal that is closed already (or something like that), due to some user-space race. The fact that turning on additional synchronization in bash solves the problem strenghtens this theory.
Take a look at the whole strace, specifically the part copied below: wait4() is returning -ECHILD when a child has just exited. [pid 10469] ioctl(0, SNDCTL_TMR_STOP, {B38400 opost isig icanon echo ...}) = 0 [pid 10469] ioctl(0, SNDCTL_TMR_TIMEBASE, {B38400 opost isig icanon echo ...}) = 0 [pid 10469] _exit(0) = ? [pid 10426] <... wait4 resumed> 0xbfffed38, 0, NULL) = -1 ECHILD (No child processes) [pid 10426] ioctl(255, SNDCTL_TMR_TIMEBASE, 0xbfffecf0) = -1 ENOTTY (Inappropriate ioctl for device) [pid 10426] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
The steps described here do not reproduce the problem for me, but the following one does: do su inside su, 100 times. (yeah, it's boring.) Then do Ctrl-D and keep it pressed, to exit the 100 shells. The whole chain of shell-exits will proceed as expected, until a point when an 'stty' process hangs. Roughly 30 shells exited before the race was triggered. i have the very latest rawhide packages, kernel -2.49, latest glibc, etc. in this hung state, stty produces the strange strace output described by Ben.
the stty process has this state: Name: stty State: T (stopped) Tgid: 3549 Pid: 3549 PPid: 3535 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 256 Groups: 0 1 2 3 4 6 10 VmSize: 3628 kB VmLck: 0 kB VmRSS: 476 kB VmData: 16 kB VmStk: 12 kB VmExe: 32 kB VmLib: 1292 kB SigPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 8000000000000000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 00000000fffffeff CapEff: 00000000fffffeff
stty has the following kernel state: stty T 00000000 2384 3549 3535 (NOTLB) Call Trace: [<c0127686>] finish_stop [kernel] 0x36 (0xe2f4deec)) [<c0127a02>] get_signal_to_deliver [kernel] 0x212 (0xe2f4def8)) [<c01092f4>] do_signal [kernel] 0x64 (0xe2f4df20)) [<c01860f7>] tty_ioctl [kernel] 0x2e7 (0xe2f4df64)) [<c0156847>] sys_ioctl [kernel] 0x97 (0xe2f4df94)) [<c010d5f1>] syscall_trace [kernel] 0x51 (0xe2f4dfac)) [<c0109570>] signal_return [kernel] 0x14 (0xe2f4dfc0))
Tim: is /usr/share/libtool/libltdl/aclocal.m4 supposed to have BASH_SYS_PGRP_SYNC set? It isnt on my box.
No. Get the bash from 8.0; that will reliably trigger the behaviour, as described.
Well, i can reproduce the hang even with current bash - which has the workaround installed. This makes this bug quite worrisome.
I have not reproduced the problem myself using any of the methods, though I still need to try with everything updated appropriately. However, I found some suspicious code in the kernel for TIOCSPGRP that could possibly explain this. If someone who can reproduce the bug can hack their kernel to show the backtrace from kill_pg when SIGTTOU is sent, that will be very helpful.
I was deluded, still have no clue. Will keep trying to reproduce it.
I have had no luck trying to reproduce this with Ingo's method. My machine has all current rawhide bits and kernel 2.49. I have tried both SMP and UP kernels. I ssh to the box as a non-root user, su with password, then type su again 100 or 200 times, then hold down C-d until back to the non-root prompt. No hangs.
Roland: if you use the bash from 8.0 you will find this extremely easy to reproduce. See comment #33.
No dice. bash-2.05b-5 (on otherwise rawhide system) does not make "su" followed by "su -" fail for me.
The way i reproduced it on my UP machine (which never showed this problem before, in ssh), was to do it under X, in gnome-terminal. Maybe that somehow influences timings. Was using an UP kernel.
No dice in gnome-terminal on 2.49 UP either. But I am confused as to how that scenario would do it. When does stty get run from shells exiting? It runs at shell startup because of /etc/bashrc.
looks like this is fixed...