As part of troubleshooting bug 598631 we got a workaround into util-linux-ng that makes the symptoms of that bug go away. We should make sure we revert that patch once the root cause is found, since the patch could potentially have more far reaching implications than immediately apparent.
Relevant context below:
+++ This bug was initially created as a clone of Bug #598631 +++
Description of problem:
None of shutdown, reboot, halt and Ctrl-Alt-Delete actually achieve anything useful.
If I do any of them, I see a broadcast message:
[root@andromeda linux-2.6.32.i686]# shutdown now
Broadcast message from firstname.lastname@example.org
(/dev/pts/0) at 19:24 ...
The system is going down for maintenance NOW!
The terminals on /dev/tty# and /dev/ttyS# are killed, but that's all. Nothing else happens. My ssh login is still there, and I can still do things.
Looking in ps, I see:
1591 ? Ss 0:00 /usr/sbin/atd
1710 ? Ss 0:00 /bin/sh -e /dev/fd/10
1714 ? S 0:00 \_ initctl emit splash-request IMMEDIATE=1 MODE=shutdown MESSAGE=Shutting down...
1715 ? Ss 0:00 /bin/sh -e /dev/fd/11
1716 ? S 0:00 \_ /sbin/plymouthd --mode=shutdown
1717 ? S 0:00 \_ /sbin/plymouthd --mode=shutdown
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Do any of shutdown, reboot, halt or C-A-D.
The machine should shut down.
May need to be running an X session.
--- Additional comment from email@example.com on 2010-06-18 19:02:35 EDT ---
I have the same problem with plymouth-0.8.3-5.el6.x86_64
It simply hangs here:
# ps -ef | grep plymouth
root 25120 25119 0 14:55 ? 00:00:00 /sbin/plymouthd --mode=shutdown
root 25121 25120 0 14:55 ? 00:00:00 /sbin/plymouthd --mode=shutdown
# strace -p 25121
Process 25121 attached - interrupt to quit
killing that child plymouthd allows reboot to complete.
--- Additional comment from firstname.lastname@example.org on 2010-07-09 18:30:42 EDT ---
I've looked into this a bit today, but still need to investigate more.
It looks like the CLOCAL flag on the serial console tty is somehow getting cleared.
If I run
stty clocal --file=/dev/ttyS0
then shutdown works fine.
I've also added a
stty -a --file=/dev/ttyS0
call to /etc/rc.local and indeed clocal is properly set at that point (right before agetty is run).
Adding -L to /etc/init/serial.conf also makes reboot work properly. This means that CLOCAL is getting cleared some time after rc.local runs and some time before agetty calls termio_init().
it could upstart, plymouth, the kernel, or something else clearing CLOCAL.
I've got to run now, but I'll investigate more on monday.
--- Additional comment from email@example.com on 2010-07-09 23:42:30 EDT ---
So it looks like this is agetty itself dropping CLOCAL.
In termio_init, it does:
tp->c_cflag = CS8 | HUPCL | CREAD;
which wipes out all existing control flags from the tty (including CLOCAL). The only reason it doesn't hang itself when trying to open the tty is that it uses O_NONBLOCK. I think we need to do one of:
1) Fix the assignment above to:
tp->c_cflag |= CS8 | HUPCL | CREAD;
2) Fix the assignment above to:
tp->c_cflag = CS8 | HUPCL | CREAD | (tp->c_cflag & CLOCAL);
3) change initscripts to pass -L to agetty when the tty is initialized with CLOCAL by the kernel
4) change plymouth to open it's tty with O_NONBLOCK
I'm not sure what the "right" answer is out of these 4 (or if some 5th possibility is the right answer), but since 2 of the 4 involve changing agetty, i'm moving this to util-linux-ng
--- Additional comment from firstname.lastname@example.org on 2010-07-12 13:33:41 EDT ---
So, I've been reading up a bit more about serial protocol, and I think the CLOCAL "fix" may just be a workaround. I'd appreciate if someone who knows more about this stuff would chime in, though. Here's an explanation on how I think it works:
As I understand it, CLOCAL means "don't block waiting for carrier detect and dataset ready signals". These are two different bits that deal with hardware flow control. I think they mean something like this:
- When carrier detect is asserted it means there is a connection between the remote serial console and the running machine.
- When dataset ready is asserted it means the serial console is reporting that its ready to receive data from the running machine.
stty -a --file=/dev/ttyS0
on an affected machine outputs:
speed 115200 baud; rows 0; columns 0; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
werase = ^W; lnext = ^V; flush = ^O; min = 1; time = 0;
-parenb -parodd cs8 hupcl -cstopb cread -clocal -crtscts cdtrdsr
-ignbrk brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl -ixon -ixoff
-iuclc -ixany imaxbel -iutf8
opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
The "cdtrdsr" there suggests the tty is configured for hardware flow control. Despite this, I'm guessing we're not getting a "dataset ready" signal from the remote serial console which is what's causing the open() call to block. When CLOCAL is set, the kernel doesn't bother waiting for the "dataset ready" signal, so no blocking will happen. Note, I don't believe the remote end will assert "dataset ready" until the kernel asserts "data terminal ready" (a different control line) in response to an open() call from the application. So the problem could be the kernel isn't ever asserting "data terminal ready" and so the remote end isn't ever asserting "dataset ready" in response. Or it could be something else.
Anyway, this stuff is a bit outside my knowledge domain, so I'd appreciate feedback from someone more knowledgeable here.
Also, I just noticed Karel committed a patch that does suggestion 2) from comment 25 in util-linux-ng-2.17.2-5.el6. I'll test that and report its results.
--- Additional comment from email@example.com on 2010-07-12 13:52:52 EDT ---
Just to follow up, I can confirm util-linux-ng-2.17.2-5.el6 makes the problem go away.
--- Additional comment from firstname.lastname@example.org on 2010-07-12 14:23:29 EDT ---
Created an attachment (id=431237)
dmesg output after echo t > /proc/sysrq-trigger while cat /usr/share/pango*/HELLO.txt > /dev/ttyS0 is blocked
I've talked to aris about this issue, and he believe this is a kernel bug. He's asked for the above output.
--- Additional comment from email@example.com on 2010-07-12 14:50:04 EDT ---
Note for QA & PM: it seems that the change in agetty is a workaround for kernel bug #613756. I'm going to keep the patch in util-linux-ng until we found a better solution.
This proposed exception is missing an ack and remains unapproved and unresolved.
Snapshot 8 is the last snapshot for approved exceptions and commits and brew builds are due by Jul 14 noon Eastern.
If you plan to fix this in Snapshot 8, please request your acks of the appropriate RHEL 6.0 leads.
If this issue is not approved and resolved by Jul 14, it will be denied for RHEL 6.0.
We should drop the patch if the kernel bug gets fixed. If the kernel bug doesn't get fixed, or gets demoted we should keep this patch. moving to blocker? so this doesn't get autoclosed.
I talked to aris about this today and it sounds like we will probably need to revert this patch:
<halfline> aris: the work around we put in util-linux-ng is to set CLOCAL on the tty
<halfline> i have a suspicion that unconditionally setting CLOCAL is wrong
<halfline> and will break certain setups, however
<halfline> aris: can you make a call on the validity of the CLOCAL work around?
<aris> halfline, I'm pretty sure it'll break serial console
<aris> but I don't know for sure until I look at it
<notting> solution: apply etherkiller to serial ports on affected machines
<halfline> by "serial console" you mean a normal serial console over a physical serial cable, versus these remote serial console things?
<aris> normal serial console
(In reply to comment #3)
> I talked to aris about this today and it sounds like we will probably need to
> revert this patch:
> <halfline> aris: the work around we put in util-linux-ng is to set CLOCAL on
> the tty
That's not true. We don't set the flag -- we keep the flag if the flag is already set by kernel. So:
tp->c_cflag = CS8 | HUPCL | CREAD | (tp->c_cflag & CLOCAL);
it means that userspace is following kernel opinion about the tty.
There may be another workaround which is acceptable:
[11:26:14] <aris> halfline, there's something that can be done: IIRC, the setting for DTR/DSR flow on the serial console is 'd'. unless it's set on /proc/cmdline, DTR/DSR flow control can be disabled by default
[11:26:20] <aris> yes, it's ugly
[11:30:39] <halfline> aris: so you're saying the fix may be to disable dtr/dsr flow control in all cases unless the user has console=/dev/ttyS0,9600d or whatever?
[11:34:19] <aris> adding something on util-linux or whatever as workaround
[11:34:21] <halfline> oh you're suggesting we do this from user-space?
[11:34:23] <aris> is not pretty
[11:34:26] <aris> yes
[11:34:29] <aris> instead of CLOCAL
[11:34:38] <aris> basically if you use serial console with flow control
[11:34:45] <aris> and you need the hardware flow control
[11:34:48] <aris> CLOCAL will disable it
[11:35:25] <halfline> this dtr/dsr issue is going to cause other problems with serial devices too though yea?
[11:35:30] <halfline> not just serial console?
[11:35:36] <aris> not really
[11:35:52] <aris> because anything you need to use, you'll need to setup the serial the way you want
[11:36:03] <halfline> it won't break serial printers that use rts/cts?
[11:36:05] <halfline> oh i see
[11:36:12] <aris> for some reason, the implementation of serial port on those sun boxes won't have DTR always on or something
[11:36:35] <aris> that's why it didn't cause problems on other machines
[11:36:43] <aris> that's my guess from what I've seen so far
[11:37:05] <halfline> if you enter the escape sequence in the console client it has an option for toggling flow control
[11:37:09] <halfline> but i think it's software flow control
[11:38:34] <halfline> aris: so if we get kzak to add the ugly hack to util-linux-ng then we can demote the kernel bug from blocker status?
[11:39:20] <aris> yes
[11:40:25] <halfline> okay
[11:42:53] <halfline> that might be the best solution given mchehab is on vacation and you're overbooked
ok, after comment #4 I'm convinced the workaround is good.
(In reply to comment #6)
> ok, after comment #4 I'm convinced the workaround is good.
Should be the workaround applied to the util-linux-ng upstream?
BTW, is it workaround or bugfix? -- in other words, is it correct that
we remove (reset) all tty flags in agetty?
Closing, I don't think that we have to remove this workaround from the package. I have added the patch to upstream code too.
Note that the upstream version also supports a new options "-c":
-c Don’t reset terminal cflags (control modes). See termios(3) for more details.