81653 – "su -" from "su" session hangs.

Bug 81653 - "su -" from "su" session hangs.

Summary: "su -" from "su" session hangs.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Raw Hide
Classification:	Retired
Component:	kernel
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	79578 84225
TreeView+	depends on / blocked

Reported:	2003-01-12 06:21 UTC by John Ellson
Modified:	2007-04-18 16:49 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-12-15 19:07:14 UTC
Embargoed:

Attachments	(Terms of Use)
strace of hung su - (298.03 KB, text/plain) 2003-02-14 17:49 UTC, Ben LaHaise	no flags	Details
View All

Description John Ellson 2003-01-12 06:21:47 UTC

Description of problem:
"su -" direct from user account works,
"su" direct from user account works,
but "su" followed by "su -" hangs.

Version-Release number of selected component (if applicable):
coreutils-4.5.3-10
rawhide-release-20030111-1

How reproducible:
100%

Steps to Reproduce:
1. su
2. su -
3.
    
Actual results:
hang

Expected results:
root's shell and env

Additional info:

Comment 1 Tim Waugh 2003-01-12 14:56:49 UTC

Noticed this myself earlier today too.  stty seems to get stuck by SIGSTOP.

Comment 2 Tim Waugh 2003-01-13 11:16:27 UTC

Or more likely SIGTTOU.

Comment 3 Tim Waugh 2003-01-13 13:52:25 UTC

This seems to be a regression relative to Phoebe.  But coreutils-4.5.3-10 was in
Phoebe too.

Perhaps a kernel issue of some sort?

Comment 4 John Ellson 2003-01-13 14:23:25 UTC

Could be.   I'm using kernel-2.4.20-2.12

I noticed that the bug is exhibited in gnome-terminal and in the text consoles,
but not in KDE's Konsole or in xterm.

I tried downgrading to vte-0.10.7-2 and gnome-terminal-2.1.3-2 (phoebe),
but that did not fix the problem.

Comment 5 Tim Waugh 2003-01-13 14:27:58 UTC

This seems to have some similarities to bug #79859.  Maybe coincidence.

Comment 6 John Ellson 2003-01-13 14:36:53 UTC

Using rawhide-release-20030112-1 on:
  kernel-2.4.18-19.8.0    OK
  kernel-2.4.20-2.9       OK
  kernel-2.4.20-2.12      NOT OK

I agree that its a kernel problem introduced after 2.4.20-2.9

Comment 7 Ingo Molnar 2003-01-15 11:52:52 UTC

This problem started after i added some scheduler enhancement in the area of
run-parent-first / run-child-first.

by reverting that optimization, the problem does not trigger - but this clearly
shows that it's a userspace bug triggered by the kernel's choice of parent/child
execution after fork().

Comment 8 Tim Waugh 2003-01-15 15:27:03 UTC

Whenever it hangs, stty has the process group of the bash run from su, the
session of the bash from which su ran, and the tty process group belongs to 'su -'.

Comment 9 Tim Waugh 2003-01-15 18:34:09 UTC

Forcing BASH_SYS_PGRP_SYNC on in aclocal.m4 seems to at least work around the
problem.  bash-2.05b-14.

Comment 10 Jay Turner 2003-01-21 00:21:57 UTC

Seems to be working with bash-2.05b-14 and kernel-2.4.20-2.23.

Comment 11 Nils Philippsen 2003-02-12 02:16:44 UTC

This time, normal (unnested) "su -" and "su" seem to be broken with
kernel-2.4.20-2.47, bash-2.05b-10 through -20, coreutils-4.5.3-11 through -14,
glibc-2.3.1-41 through -43.

Comment 12 Nils Philippsen 2003-02-12 02:20:12 UTC

Forgot one thing: kernel-2.4.20-2.40 worked (though I don't know what versions
of other packages -- large upgrade)

Comment 13 Tim Waugh 2003-02-12 09:41:38 UTC

With what symptom?  I don't see this.

Comment 14 Nils Philippsen 2003-02-12 13:46:09 UTC

Symptom(s): I'm nopt able to use su (neither is postgresql's init script BTW),
but that's not all. As I just found out, I'm not able to log in via ssh as well
so this might be a more generic problem. Unfortunately I can't tell you any
package version beyond what I stated above ATM because I can't log in.

Comment 15 Tim Waugh 2003-02-12 13:47:23 UTC

Latest pam_krb5?

Comment 16 Nils Philippsen 2003-02-12 14:01:00 UTC

I don't know whether I have installed pam_krb5 at all (I'm not using Kerberos on
that machine). I have synced the beehive tree roughly at about midnight from
Stuttgart and if I have pam_krb5 it will be that version.

Comment 17 Nils Philippsen 2003-02-12 14:11:31 UTC

Symptoms clarified:

Any attempt to use su just hangs infinitely (unless interrupted) after asking
the password, regardless of interactive use or init script use. Likewise with
ssh login attempts and IIRC (it was late eh early this morning) login attempts
on text consoles -- I'm able to type in the password, but that's all, it hangs
there.

I ran an strace on su and it hung some rt_sig* (rt_sigaction?) call.

Comment 18 Nils Philippsen 2003-02-12 19:46:03 UTC

I just confirmed that I don't have pam_krb5 installed.

An "strace su" as root briongs this:

[...]
getpid()                                = 5303
rt_sigprocmask(SIG_SETMASK, NULL, [RTMIN], 8) = 0
rt_sigsuspend([]
[...]

Then it hangs (until interrupted).

With regards to the text consoles and ssh login: The text console login seems to
hang in the login process, this times out and mingetty restarts itself. The sshd
has many sleeping sshd processes (user root) which have many zombie children
(user sshd):

root@wombat:~> ps auxw|grep sshd
root      1818  0.0  0.2  3484 1216 ?        S    08:29   0:00 /usr/sbin/sshd
root      9458  0.0  0.6 10104 3140 ?        S    14:22   0:00 /usr/sbin/sshd
sshd      9459  0.0  0.0     0    0 ?        Z    14:22   0:00 [sshd <defunct>]
root      9460  0.0  0.6 10100 3116 ?        S    14:22   0:00 /usr/sbin/sshd
sshd      9461  0.0  0.0     0    0 ?        Z    14:22   0:00 [sshd <defunct>]
root      9671  0.0  0.6 10104 3100 ?        S    14:26   0:00 /usr/sbin/sshd
sshd      9672  0.0  0.0     0    0 ?        Z    14:26   0:00 [sshd <defunct>]
root      9675  0.0  0.6 10104 3100 ?        S    14:26   0:00 /usr/sbin/sshd
sshd      9676  0.0  0.0     0    0 ?        Z    14:26   0:00 [sshd <defunct>]
root      9948  0.0  0.6 10104 3144 ?        S    14:31   0:00 /usr/sbin/sshd
sshd      9949  0.0  0.0     0    0 ?        Z    14:31   0:00 [sshd <defunct>]
root      9950  0.0  0.5 10104 2752 ?        S    14:31   0:00 /usr/sbin/sshd
sshd      9951  0.0  0.0     0    0 ?        Z    14:31   0:00 [sshd <defunct>]
root      5789  0.0  0.1  3576  648 pts/1    S    20:41   0:00 grep sshd

Comment 19 Nils Philippsen 2003-02-13 07:44:56 UTC

kernel 2.4.20-2.47.1 (as opposed to -2.47) doesn't show these problems.

Comment 20 Ben LaHaise 2003-02-14 17:24:47 UTC

It happens for me on 2.4.20-2.48.  The su ; su - sequence didn't trigger it, but
su ; su - ; su - did.

Comment 21 Tim Waugh 2003-02-14 17:27:29 UTC

Ben: what version of bash do you have?

I think there are two issues getting intermingled in this report.

Comment 22 Ben LaHaise 2003-02-14 17:28:37 UTC

strace on the stuck stty shows an infinite stream of:

--- SIGTTOU (Stopped (tty output)) ---
--- SIGTTOU (Stopped (tty output)) ---
ioctl(0, SNDCTL_TMR_STOP, {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS
(To be restarted)
--- SIGTTOU (Stopped (tty output)) ---
--- SIGTTOU (Stopped (tty output)) ---
ioctl(0, SNDCTL_TMR_STOP, {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS
(To be restarted)

Comment 23 Tim Waugh 2003-02-14 17:34:22 UTC

Ben: what version of bash do you have?

Comment 24 Ben LaHaise 2003-02-14 17:40:09 UTC

bash-2.05b-5.  Still, this is a kernel bug if a syscall is infinitely being
restarted.

Comment 25 Ben LaHaise 2003-02-14 17:49:24 UTC

Created attachment 90094 [details]
strace of hung su -

Comment 26 Ben LaHaise 2003-02-14 17:50:16 UTC

Ingo, this is a kernel problem -- can you look at the trace?

Comment 27 Ingo Molnar 2003-02-14 18:51:10 UTC

Ben, do you see 'stty' looping infinitely? I think the infinite restarts are
just an strace artifact.

what might have happened is that stty tried to write to a terminal that is
closed already (or something like that), due to some user-space race. The fact
that turning on additional synchronization in bash solves the problem
strenghtens this theory.

Comment 28 Ben LaHaise 2003-02-14 19:15:06 UTC

Take a look at the whole strace, specifically the part copied below: wait4() is
returning -ECHILD when a child has just exited.

[pid 10469] ioctl(0, SNDCTL_TMR_STOP, {B38400 opost isig icanon echo ...}) = 0
[pid 10469] ioctl(0, SNDCTL_TMR_TIMEBASE, {B38400 opost isig icanon echo ...}) = 0
[pid 10469] _exit(0)                    = ?
[pid 10426] <... wait4 resumed> 0xbfffed38, 0, NULL) = -1 ECHILD (No child
processes)
[pid 10426] ioctl(255, SNDCTL_TMR_TIMEBASE, 0xbfffecf0) = -1 ENOTTY
(Inappropriate ioctl for device)
[pid 10426] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

Comment 29 Ingo Molnar 2003-02-18 15:51:30 UTC

The steps described here do not reproduce the problem for me, but the following
one does:

do su inside su, 100 times. (yeah, it's boring.) Then do Ctrl-D and keep it
pressed, to exit the 100 shells. The whole chain of shell-exits will proceed as
expected, until a point when an 'stty' process hangs. Roughly 30 shells exited
before the race was triggered.

i have the very latest rawhide packages, kernel -2.49, latest glibc, etc.

in this hung state, stty produces the strange strace output described by Ben.

Comment 30 Ingo Molnar 2003-02-18 15:53:41 UTC

the stty process has this state:

Name:   stty
State:  T (stopped)
Tgid:   3549
Pid:    3549
PPid:   3535
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 256
Groups: 0 1 2 3 4 6 10
VmSize:     3628 kB
VmLck:         0 kB
VmRSS:       476 kB
VmData:       16 kB
VmStk:        12 kB
VmExe:        32 kB
VmLib:      1292 kB
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 8000000000000000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 00000000fffffeff
CapEff: 00000000fffffeff

Comment 31 Ingo Molnar 2003-02-18 15:54:38 UTC

stty has the following kernel state:

stty          T 00000000  2384  3549   3535                     (NOTLB)
Call Trace:   [<c0127686>] finish_stop [kernel] 0x36 (0xe2f4deec))
[<c0127a02>] get_signal_to_deliver [kernel] 0x212 (0xe2f4def8))
[<c01092f4>] do_signal [kernel] 0x64 (0xe2f4df20))
[<c01860f7>] tty_ioctl [kernel] 0x2e7 (0xe2f4df64))
[<c0156847>] sys_ioctl [kernel] 0x97 (0xe2f4df94))
[<c010d5f1>] syscall_trace [kernel] 0x51 (0xe2f4dfac))
[<c0109570>] signal_return [kernel] 0x14 (0xe2f4dfc0))

Comment 32 Ingo Molnar 2003-02-18 15:57:28 UTC

Tim: is /usr/share/libtool/libltdl/aclocal.m4 supposed to have
BASH_SYS_PGRP_SYNC set? It isnt on my box.

Comment 33 Tim Waugh 2003-02-18 16:02:28 UTC

No.

Get the bash from 8.0; that will reliably trigger the behaviour, as described.

Comment 34 Ingo Molnar 2003-02-18 16:06:22 UTC

Well, i can reproduce the hang even with current bash - which has the workaround
installed. This makes this bug quite worrisome.

Comment 35 Roland McGrath 2003-02-18 20:18:00 UTC

I have not reproduced the problem myself using any of the methods, though I
still need to try with everything updated appropriately.

However, I found some suspicious code in the kernel for TIOCSPGRP
that could possibly explain this.  If someone who can reproduce the bug
can hack their kernel to show the backtrace from kill_pg when SIGTTOU is sent,
that will be very helpful.

Comment 36 Roland McGrath 2003-02-18 20:24:13 UTC

I was deluded, still have no clue.  Will keep trying to reproduce it.

Comment 37 Roland McGrath 2003-02-19 22:05:33 UTC

I have had no luck trying to reproduce this with Ingo's method.
My machine has all current rawhide bits and kernel 2.49.
I have tried both SMP and UP kernels.

I ssh to the box as a non-root user, su with password, then type su again 100 or
200 times, then hold down C-d until back to the non-root prompt.  No hangs.

Comment 38 Tim Waugh 2003-02-19 22:42:19 UTC

Roland: if you use the bash from 8.0 you will find this extremely easy to
reproduce.  See comment #33.

Comment 39 Roland McGrath 2003-02-20 02:11:22 UTC

No dice.  bash-2.05b-5 (on otherwise rawhide system) does not make
"su" followed by "su -" fail for me.

Comment 40 Ingo Molnar 2003-02-20 09:23:40 UTC

The way i reproduced it on my UP machine (which never showed this problem
before, in ssh), was to do it under X, in gnome-terminal. Maybe that somehow
influences timings. Was using an UP kernel.

Comment 41 Roland McGrath 2003-02-20 22:06:17 UTC

No dice in gnome-terminal on 2.49 UP either.  But I am confused as to how
that scenario would do it.  When does stty get run from shells exiting?
It runs at shell startup because of /etc/bashrc.

Comment 42 Matt Wilson 2003-10-16 18:59:07 UTC

looks like this is fixed...

Note You need to log in before you can comment on or make changes to this bug.